Negligence

2024-10-21 Mon

Rick Gilmore

Prelude

Thought question

  • What would happen to the ‘file drawer effect’ if we simulated a large effect size with the same small sample size (n)?

https://www.outsideonline.com/culture/love-humor/excellent-advice-for-living-kevin-kelly/

Tend to the small things. More people are defeated by blisters than by mountains.

You choose to be lucky by believing that any setbacks are just temporary.

Kelly (2023), reviewed in

https://www.outsideonline.com/culture/love-humor/excellent-advice-for-living-kevin-kelly/

Overview

Announcements

Last time…

File drawer effect

Important

What is the file drawer effect?

Is the file drawer effect a problem? Why or why not?

If it’s a problem, what’s a solution?

Today

Negligence

  • Discuss
    • Nuijten, Hartgerink, Assen, Epskamp, & Wicherts (2015)
    • Szucs & Ioannidis (2017)

Types of negligence

Definitions of

negligence from Mac OS dictionary app

Data mistakes

Alexander (2013)

Statistical reporting errors

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period.

Nuijten et al. (2015)

In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom.

Nuijten et al. (2015)

One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined.

Nuijten et al. (2015)

The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results.

Nuijten et al. (2015)

Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.*”

Nuijten et al. (2015)

statcheck

https://michelenuijten.shinyapps.io/statcheck-web/

How it works

  • n=10 people fill out a Likert scale survey question with permissible values of 1, 2, or 3.
  • Is a mean score of 2.10 possible?
  • How about 2.15? Why?

We tested this technique with a sample of 260 recent empirical articles in leading journals. Of the articles that we could test with the GRIM technique (N = 71), around half (N = 36) appeared to contain at least one inconsistent mean, and more than 20% (N = 16) contained multiple such inconsistencies.

Brown & Heathers (2017)

We requested the data sets corresponding to 21 of these articles, receiving positive responses in 9 cases. We confirmed the presence of at least one reporting error in all cases, with three articles requiring extensive corrections. The implications for the reliability and replicability of empirical psychology are discussed.

Brown & Heathers (2017)

Note

How do these kinds of errors arise?

What practices could researchers adopt to address the problems identified by Brown & Heathers (2017) and Nuijten et al. (2015)?

Inadequate power

  • Power: If there is an effect, what’s the probability my test/decision procedure will detect it (avoid a false negative).
  • If \(\beta\) is \(p\)(false negative), then power is \(1-\beta\).
  • Sample size and alpha (\(\alpha\)) or \(p\)(false positive) affect power, as does the actual (unknown in advance) effect size (\(d\)).
  • Conventions for categorizing effect sizes: small (\(d\) = 0.2), medium (\(d\) = 0.5), and large (\(d\) = 0.8).

We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically significant results and D = 0.24 (0.11–0.42) for nonsignificant results.

Szucs & Ioannidis (2017)

Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology.

Szucs & Ioannidis (2017)

d vs. observed power (Szucs & Ioannidis, 2017)

Effect size d Median observed power \(\beta\)
Small 0.2 0.12 0.88
Medium 0.5 0.44 0.56
Large 0.8 0.73 0.27

Important

Remember \(\beta\): p(false negative)

Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.”

Szucs & Ioannidis (2017)

Figure 3 from Szucs & Ioannidis (2017)

Reading (Figure 3 from Szucs & Ioannidis, 2017)

  • Horizontal axis: Fraction of studies, so median can be found at 0.5 (half above/half below).
  • Vertical axis: A. Degrees of freedom (~ sample size); B. Power (1-\(\beta\))

Note

Why do so many studies have such low power?

If power is low, what should researchers do going forward to increase it? Why might increasing power be difficult?

A new mantra

  • Plan your study (to have adequate power)!
  • Plot your data!
  • Script your analyses!
  • Publish your results, especially null findings!

Next time

Hype

Resources

References

Alexander, R. (2013). Reinhart, Rogoff...and Herndon: The student who caught out the profs. BBC News. Retrieved from https://www.bbc.com/news/magazine-22223190
Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science, 8(4), 363–369. https://doi.org/10.1177/1948550616673876
Carlisle, J. B. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia, 72(8), 944–952. https://doi.org/10.1111/anae.13938
Carlisle, J. B. (2018). Seeking and reporting apparent research misconduct: Errors and integrity - a reply. Anaesthesia, 73(1), 126–128. https://doi.org/10.1111/anae.14148
Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363–1368. https://doi.org/10.1177/0956797610383437
Kelly, K. (2023). Excellent advice for living: Wisdom I wish I’d known earlier. New York, NY: Viking Press. Retrieved from https://www.penguinrandomhouse.com/books/725357/excellent-advice-for-living-by-kevin-kelly/
Kharasch, E. D., & Houle, T. T. (2018). Seeking and reporting apparent research misconduct: Errors and integrity. Anaesthesia, 73(1), 125–126. https://doi.org/10.1111/anae.14147
Nuijten, M. B., Hartgerink, C. H. J., Assen, M. A. L. M. van, Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. https://doi.org/10.3758/s13428-015-0664-2
Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the robustness of power posing: No effect on hormones and risk tolerance in a large sample of men and women. Psychological Science, 26(5), 653–656. https://doi.org/10.1177/0956797614553946
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biology, 15(3), e2000797. https://doi.org/10.1371/journal.pbio.2000797