Simmons et al. 2011

Author

Rick Gilmore

Purpose

This document simulates some data to illustrate Study 1 from (Simmons, Nelson, & Simonsohn, 2011). We discussed this study on 2023-10-04 and 2023-10-06.

Generate data

There are two song groups:

song <- c("kid", "ctrl")
song <- rep(song, each = 15)
song
 [1] "kid"  "kid"  "kid"  "kid"  "kid"  "kid"  "kid"  "kid"  "kid"  "kid" 
[11] "kid"  "kid"  "kid"  "kid"  "kid"  "ctrl" "ctrl" "ctrl" "ctrl" "ctrl"
[21] "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" "ctrl" "ctrl"

There is one covariate, father’s age. Let’s say that these are uniformly distributed between 40 and 70.

father_age <- runif(n = 30, min = 40, max = 70)
hist(father_age)

And there is one dependent (outcome) measure that is the focus, “How old do you feel right now?”, with five options: very young, young, neither young nor old, old, and very old. Let’s give these ratings numbers between 1 and 5, and let’s say that the average rating for the “kid” group was 3.75 and the average rating for the “ctrl” group was 3. We’ll assume that each of the groups ratings are normally distributed around these two means.

old_feeling <- c(rnorm(15, mean = 3.75, sd = 0.5), rnorm(15, mean = 3, sd = 0.5))
hist(old_feeling)

But of course since respondents only give integer (1, 2, 3, 4, 5) ratings, we have to change the continuous scores to integer values.

old_feeling <- round(old_feeling, 0)
hist(old_feeling, main = "Rounded `old_feeling`", breaks = c(1, 2, 3, 4, 5))

Visualize data

To visualize the data, we’ll make a data frame. This is a convenient ‘rectangular’ format for our data.

false_pos_study_1 <- data.frame(sub_id = 1:30, song, father_age, old_feeling)
str(false_pos_study_1)
'data.frame':   30 obs. of  4 variables:
 $ sub_id     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ song       : chr  "kid" "kid" "kid" "kid" ...
 $ father_age : num  48.5 55.3 68.7 50.2 45.2 ...
 $ old_feeling: num  3 4 4 5 4 4 5 4 4 3 ...

Now, let’s plot this to show how analysis of covariance (ANCOVA) might work.

Note

My graduate statistics professor, Dr. Mike Meyer, always exhorted us to “plot our data!” That is, he thought we should always plot first, then run our statistics later. I’ve taken that advice to heart in my own work.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
false_pos_study_1 |>
  ggplot() +
  aes(x = father_age, y = old_feeling, color = song) +
  geom_point() +
  geom_smooth(method = lm) # linear fit
`geom_smooth()` using formula = 'y ~ x'

Figure 1: Illustration of a simulated ANCOVA based on (Simmons et al., 2011)

This plot adds two lines to the data, one for each song group. The ANCOVA will let us evaluate whether the old_feeling is predicted by a linear combination of father_age and song.

In the code below, notice that we set formula = old_feeling ~ song + father_age. This is how we tell the aov() command that the linear combination of song and father_age are the variables we want to use to predict old_feeling.

glm_ancova <- aov(formula = old_feeling ~ song + father_age, data = false_pos_study_1)
summary(glm_ancova)
            Df Sum Sq Mean Sq F value  Pr(>F)    
song         1  8.533   8.533  34.576 2.9e-06 ***
father_age   1  0.003   0.003   0.012   0.913    
Residuals   27  6.664   0.247                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, let’s imagine that father_age wasn’t entirely random for some reason.

father_age <- (69:40) + rnorm(30, 0, 1)
plot(1:30, father_age)

hist(father_age)

false_pos_study_1a <- data.frame(sub_id = 1:30, song, father_age, old_feeling)

false_pos_study_1a |>
  ggplot() +
  aes(x = father_age, y = old_feeling, color = song) +
  geom_point() +
  geom_smooth(method = lm) # linear fit
`geom_smooth()` using formula = 'y ~ x'

glm_ancova <- aov(formula = old_feeling ~ song + father_age, data = false_pos_study_1a)
summary(glm_ancova)
            Df Sum Sq Mean Sq F value   Pr(>F)    
song         1  8.533   8.533  34.827 2.74e-06 ***
father_age   1  0.051   0.051   0.209    0.652    
Residuals   27  6.616   0.245                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or, maybe worse, that how old participants’ felt related to their fathers’ age.

old_feeling <- (father_age-40)/(70-40) + rnorm(n=30, 0, 0.25)
old_feeling <- round(old_feeling/(max(old_feeling)-min(old_feeling))*4 + 1, 0)

false_pos_study_1b <- data.frame(sub_id = 1:30, song, father_age, old_feeling)

false_pos_study_1b |>
  ggplot() +
  aes(x = father_age, y = old_feeling, color = song) +
  geom_point() +
  geom_smooth(method = lm) # linear fit
`geom_smooth()` using formula = 'y ~ x'

glm_ancova <- aov(formula = old_feeling ~ song + father_age, data = false_pos_study_1b)
summary(glm_ancova)
            Df Sum Sq Mean Sq F value   Pr(>F)    
song         1 17.633  17.633  31.269 6.24e-06 ***
father_age   1  1.307   1.307   2.318     0.14    
Residuals   27 15.226   0.564                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

References

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632