Negligence

In the news

Youyou, Yang, & Uzzi (2023)

Figure 70: Youyou et al. (2023)

Conjecture about the weak replicability in social sciences has made scholars eager to quantify the scale and scope of replication failure for a discipline. Yet small-scale manual replication methods alone are ill-suited to deal with this big data problem. Here, we conduct a discipline-wide replication census in science. Our sample (N = 14,126 papers) covers nearly all papers published in the six top-tier Psychology journals over the past 20 y. Using a validated machine learning model that estimates a paper’s likelihood of replication, we found evidence that both supports and refutes speculations drawn from a relatively small sample of manual replications. First, we find that a single overall replication rate of Psychology poorly captures the varying degree of replicability among subfields. Second, we find that replication rates are strongly correlated with research methods in all subfields. Experiments replicate at a significantly lower rate than do non-experimental studies. Third, we find that authors’ cumulative publication number and citation impact are positively related to the likelihood of replication, while other proxies of research quality and rigor, such as an author’s university prestige and a paper’s citations, are unrelated to replicability. Finally, contrary to the ideal that media attention should cover replicable research, we find that media attention is positively related to the likelihood of replication failure. Our assessments of the scale and scope of replicability are important next steps toward broadly resolving issues of replicability.

– Youyou et al. (2023)

Roadmap

Discuss
- (Ritchie, 2020), Chapter 5
- Nuijten et al. (2015)
- Szucs & Ioannidis (2017)
Discuss
- Assignment Exercise 03: Alpha, Power, Effect Sizes, & Sample Size, due Thursday, March 16
Work session
- proposals, due Thursday, March 2.

Types of negligence

Definitions of

negligent

Figure 71: negligence from Mac OS dictionary app

Data mistakes

e.g., Reinhart & Rogoff spreadsheet

Statistical reporting errors

e.g., Nuijten et al. (2015))

“This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.”

– Nuijten et al. (2015)

statcheck

https://michelenuijten.shinyapps.io/statcheck-web/

Granularity-Releated Inconsistency of Means (GRIM)

Brown & Heathers (2017)

“We present a simple mathematical technique that we call granularity-related inconsistency of means (GRIM) for verifying the summary statistics of research reports in psychology. This technique evaluates whether the reported means of integer data such as Likert-type scales are consistent with the given sample size and number of items. We tested this technique with a sample of 260 recent empirical articles in leading journals. Of the articles that we could test with the GRIM technique (N = 71), around half (N = 36) appeared to contain at least one inconsistent mean, and more than 20% (N = 16) contained multiple such inconsistencies. We requested the data sets corresponding to 21 of these articles, receiving positive responses in 9 cases. We confirmed the presence of at least one reporting error in all cases, with three articles requiring extensive corrections. The implications for the reliability and replicability of empirical psychology are discussed.”

– Brown & Heathers (2017)

A possible final project might involve assessing some retracted papers using either statcheck or GRIMM. It would be interesting to see whether the tools could have detected problems in advance of publication.

Inadequate power

Power: If there is an effect, what’s the probability my test/decision procedure will detect it (avoid a false negative).
If \(\beta\) is \(p\)(false negative), then power is \(1-\beta\).
Sample size and alpha (\(\alpha\)) or \(p\)(false positive) affect power, as does the actual (unknown in advance) effect size (\(d\)).
Conventions for categorizing effect sizes: small (\(d\) = 0.2), medium (\(d\) = 0.5), and large (\(d\) = 0.8).

“We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.”

– Szucs & Ioannidis (2017)

Figure 3 from [@Szucs2017-fc](http://dx.doi.org/10.1371/journal.pbio.2000797)

Figure 72: Figure 3 from Szucs & Ioannidis (2017)

Discussion of Exercise 03: Alpha, Power, Effect Sizes, & Sample Size

Goal
- To gain a better understanding of how these concepts relate to one another and affect statistical decision-making.
App
- https://rogilmore.shinyapps.io/PSYCH490-2023-APES/

Next time…

Hype
- (Ritchie, 2020), Chapter 6
- Carney, Cuddy, & Yap (2010)
- (Optional) Ranehill et al. (2015)
Watch
- Cuddy (2012)
Due
- Final project proposal

References

Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science, 8(4), 363–369. https://doi.org/10.1177/1948550616673876

Carlisle, J. B. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia, 72(8), 944–952. https://doi.org/10.1111/anae.13938

Carlisle, J. B. (2018). Seeking and reporting apparent research misconduct: Errors and integrity - a reply. Anaesthesia, 73(1), 126–128. https://doi.org/10.1111/anae.14148

Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363–1368. https://doi.org/10.1177/0956797610383437

Cuddy, A. (2012). Your body language may shape who you are. Retrieved from https://www.ted.com/talks/amy_cuddy_your_body_language_may_shape_who_you_are

Kharasch, E. D., & Houle, T. T. (2018). Seeking and reporting apparent research misconduct: Errors and integrity. Anaesthesia, 73(1), 125–126. https://doi.org/10.1111/anae.14147

Nuijten, M. B., Hartgerink, C. H. J., Assen, M. A. L. M. van, Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. https://doi.org/10.3758/s13428-015-0664-2

Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the robustness of power posing: No effect on hormones and risk tolerance in a large sample of men and women. Psychological Science, 26(5), 653–656. https://doi.org/10.1177/0956797614553946

Ritchie, S. (2020). Science fictions: Exposing fraud, bias, negligence and hype in science (1st ed.). Penguin Random House. Retrieved from https://www.amazon.com/Science-Fictions/dp/1847925669

Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biology, 15(3), e2000797. https://doi.org/10.1371/journal.pbio.2000797

Youyou, W., Yang, Y., & Uzzi, B. (2023). A discipline-wide investigation of the replicability of psychology papers over the past two decades. Proceedings of the National Academy of Sciences of the United States of America, 120(6), e2208863120. https://doi.org/10.1073/pnas.2208863120