Visualizing ‘Many Analysts’ data

Modified

July 29, 2024

Purpose

This document generates some visualizations of the data described in (R. Silberzahn et al., 2018).

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Gathering

An OSF project hosts the data at: https://osf.io/fv8c3. The parent OSF project is (R. Silberzahn et al., 2023).

Gilmore likes to make these workflows automated using specialized packages, so he downloaded the osfr package from CRAN, see also (Wolen et al., 2020).

Since the file does not change often and has not changed recently, we set eval = params$download_data in this document’s YAML header so that we could update the file in the future by rendering the document with params$download_data set to TRUE.

if (params$download_data) {
  message("Re-downloading data from 'https://osf.io/fv8c3'.")
} else {
  message("`params$upload_data` is FALSE. Data not downloaded again.")
  message("The following chunk is shown for completeness, but is skipped here.")
}
`params$upload_data` is FALSE. Data not downloaded again.
The following chunk is shown for completeness, but is skipped here.
osf_silberzahn_node = "https://osf.io/fv8c3"

# Save file to temp directory
this_file <-
  osfr::osf_retrieve_file(basename(osf_silberzahn_node)) |>
  osfr::osf_download(path = tempdir(),
                     progress = TRUE,
                     conflicts = "overwrite")

# Extract to target directory
if (xfun::file_ext(this_file$name) == "zip") {
  utils::unzip(this_file$local_path,
               exdir = paste0(here::here(), "/src/surveys/csv"))
}

The data is saved as a CSV in /src/surveys/csv/CrowdstormingDataJuly1st.csv.

red_cards <-
  readr::read_csv(
    paste0(
      here::here(),
      "/src/surveys/csv/CrowdstormingDataJuly1st.csv"
    ),
    show_col_types = FALSE
  )

str(red_cards)
spc_tbl_ [146,028 × 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ playerShort  : chr [1:146028] "lucas-wilchez" "john-utaka" "abdon-prats" "pablo-mari" ...
 $ player       : chr [1:146028] "Lucas Wilchez" "John Utaka" "Abdón Prats" "Pablo Marí" ...
 $ club         : chr [1:146028] "Real Zaragoza" "Montpellier HSC" "RCD Mallorca" "RCD Mallorca" ...
 $ leagueCountry: chr [1:146028] "Spain" "France" "Spain" "Spain" ...
 $ birthday     : chr [1:146028] "31.08.1983" "08.01.1982" "17.12.1992" "31.08.1993" ...
 $ height       : num [1:146028] 177 179 181 191 172 182 187 180 193 180 ...
 $ weight       : num [1:146028] 72 82 79 87 70 71 80 68 80 70 ...
 $ position     : chr [1:146028] "Attacking Midfielder" "Right Winger" NA "Center Back" ...
 $ games        : num [1:146028] 1 1 1 1 1 1 1 1 1 1 ...
 $ victories    : num [1:146028] 0 0 0 1 1 0 1 0 0 1 ...
 $ ties         : num [1:146028] 0 0 1 0 0 0 0 0 1 0 ...
 $ defeats      : num [1:146028] 1 1 0 0 0 1 0 1 0 0 ...
 $ goals        : num [1:146028] 0 0 0 0 0 0 0 0 0 0 ...
 $ yellowCards  : num [1:146028] 0 1 1 0 0 0 0 0 0 0 ...
 $ yellowReds   : num [1:146028] 0 0 0 0 0 0 0 0 0 0 ...
 $ redCards     : num [1:146028] 0 0 0 0 0 0 0 0 0 0 ...
 $ photoID      : chr [1:146028] "95212.jpg" "1663.jpg" NA NA ...
 $ rater1       : num [1:146028] 0.25 0.75 NA NA NA 0.25 0 1 0.25 0 ...
 $ rater2       : num [1:146028] 0.5 0.75 NA NA NA 0 0.25 1 0.25 0 ...
 $ refNum       : num [1:146028] 1 2 3 3 3 4 4 4 4 4 ...
 $ refCountry   : num [1:146028] 1 2 3 3 3 4 4 4 4 4 ...
 $ Alpha_3      : chr [1:146028] "GRC" "ZMB" "ESP" "ESP" ...
 $ meanIAT      : num [1:146028] 0.326 0.203 0.37 0.37 0.37 ...
 $ nIAT         : num [1:146028] 712 40 1785 1785 1785 ...
 $ seIAT        : num [1:146028] 0.000564 0.010875 0.000229 0.000229 0.000229 ...
 $ meanExp      : num [1:146028] 0.396 -0.204 0.588 0.588 0.588 ...
 $ nExp         : num [1:146028] 750 49 1897 1897 1897 ...
 $ seExp        : num [1:146028] 0.0027 0.0615 0.001 0.001 0.001 ...
 - attr(*, "spec")=
  .. cols(
  ..   playerShort = col_character(),
  ..   player = col_character(),
  ..   club = col_character(),
  ..   leagueCountry = col_character(),
  ..   birthday = col_character(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   position = col_character(),
  ..   games = col_double(),
  ..   victories = col_double(),
  ..   ties = col_double(),
  ..   defeats = col_double(),
  ..   goals = col_double(),
  ..   yellowCards = col_double(),
  ..   yellowReds = col_double(),
  ..   redCards = col_double(),
  ..   photoID = col_character(),
  ..   rater1 = col_double(),
  ..   rater2 = col_double(),
  ..   refNum = col_double(),
  ..   refCountry = col_double(),
  ..   Alpha_3 = col_character(),
  ..   meanIAT = col_double(),
  ..   nIAT = col_double(),
  ..   seIAT = col_double(),
  ..   meanExp = col_double(),
  ..   nExp = col_double(),
  ..   seExp = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Exploration

Let’s see if we can understand more about the data.

Each row appears to include the number of yellow and red cards awarded to a given player by a given referee, presumably within the sample period of the study.

Cards per specific player/ref

red_cards |>
  ggplot() +
  aes(x = yellowCards) +
  geom_histogram(fill = "yellow") +
  ggtitle("Distribution of yellow cards given specific player*ref interaction")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 1

Most player by referee interactions resulted in no yellow cards at all, as expected.

red_cards |>
  ggplot() +
  aes(redCards) +
  geom_histogram(fill = "red") +
  ggtitle("Distribution of red cards given specific player*ref interaction") +
  xlab("N red cards")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 2

And there are relatively few red cards, too.

n_red_cards <- red_cards |>
  dplyr::filter(redCards > 0)

xtabs(formula = ~ redCards, data = red_cards)
redCards
     0      1      2 
144219   1784     25 

Specifically, there are \(n=\) 1809 red cards, or \(p=\) 0.012 of all player/referee interactions. So, these are rare events in the database. But there are some interactions with two red cards.

n_red_cards |>
  dplyr::filter(redCards == 2) |>
  dplyr::select(playerShort, refNum) |>
  dplyr::arrange(refNum, playerShort) |>
  knitr::kable(format = 'html') |>
  kableExtra::kable_classic()
playerShort refNum
bodipo 140
antonio-lopez_2 367
javi-martinez 432
jonas 432
phil-jagielka 487
cyril-jeunechamp 586
sergio-ramos 804
aly-cissokho 985
eugen-polanski 1114
emmanuel-adebayor 1214
jose-nunes 1349
jesus-rueda 1574
marc-planus 1696
de-lucas 1886
fallou-diagne 1931
laurent-koscielny 2080
saber-khelifa 2246
ivo-ilicevic 2370
fabien-audard 2796
cristiano-ronaldo 2803
gary-cahill 2822
lee-cattermole 2822
cyril-jeunechamp 2961
mouhamadou-dabo 2961
rafael-van-der-vaart 3099

It doesn’t look like there are three referees who gave two red cards out to more than one player, referees 432, 2822, and 2961.

n_red_cards |>
  dplyr::filter(redCards == 2) |>
  dplyr::select(playerShort, refNum) |>
  dplyr::arrange(playerShort, refNum) |>
  knitr::kable(format = 'html') |>
  kableExtra::kable_classic()
playerShort refNum
aly-cissokho 985
antonio-lopez_2 367
bodipo 140
cristiano-ronaldo 2803
cyril-jeunechamp 586
cyril-jeunechamp 2961
de-lucas 1886
emmanuel-adebayor 1214
eugen-polanski 1114
fabien-audard 2796
fallou-diagne 1931
gary-cahill 2822
ivo-ilicevic 2370
javi-martinez 432
jesus-rueda 1574
jonas 432
jose-nunes 1349
laurent-koscielny 2080
lee-cattermole 2822
marc-planus 1696
mouhamadou-dabo 2961
phil-jagielka 487
rafael-van-der-vaart 3099
saber-khelifa 2246
sergio-ramos 804

And one player, cyril-jeunechamp, who got two red cards from different referees.

Just for fun, let’s see how red and yellow cards relate to one another. We add some jitter and use alpha=.1 to make it easier to see any pattern.

red_cards |>
  ggplot() +
  aes(x = yellowCards, y = redCards) +
  geom_jitter(alpha = .1, width = .2, height = .2) +
  geom_smooth(method = "lm") +
  ggtitle("N yellow vs. N red cards for each player by referee combination") +
  xlab("N yellow cards") +
  ylab("N red cards")
`geom_smooth()` using formula = 'y ~ x'

Number of yellow and red cards for each player/referee pair in data from (R. Silberzahn et al., 2023)

By player

Let’s group by player and see how the number of yellow and red cards appear as a function of player.

by_player <- red_cards |>
  dplyr::group_by(playerShort) |>
  dplyr::summarise(n_refs = n(), n_yellow = sum(yellowCards), n_red = sum(redCards)) |>
  dplyr::mutate(p_yellow = n_yellow/n_refs, p_red = n_red/n_refs)
by_player |>
  ggplot() +
  aes(x = n_refs) +
  geom_histogram() +
  ggtitle("Distribution of referee interactions for each player")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let’s look at these distributions.

by_player |>
  ggplot() +
  aes(x = p_yellow) +
  geom_histogram(fill = "yellow") +
  ggtitle("Proportion of referee interactions with yellow cards")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 3
by_player |>
  ggplot() +
  aes(x = p_red) +
  geom_histogram(fill = "red") +
  ggtitle("Proportion of referee interactions with red cards")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 4

So, there’s something odd going on with p_yellow. This should be in [0,1].

by_player |> dplyr::filter(p_yellow > 1)
# A tibble: 63 × 6
   playerShort   n_refs n_yellow n_red p_yellow  p_red
   <chr>          <int>    <dbl> <dbl>    <dbl>  <dbl>
 1 adam-bodzek       56       66     0     1.18 0     
 2 aduriz            63       68     4     1.08 0.0635
 3 albelda          147      197     6     1.34 0.0408
 4 alberto-botia     45       69     3     1.53 0.0667
 5 alexis            53       77     4     1.45 0.0755
 6 allan-nyom        37       42     0     1.14 0     
 7 amorebieta        85       93     4     1.09 0.0471
 8 apono             43       45     1     1.05 0.0233
 9 arribas_2         38       48     2     1.26 0.0526
10 ballesteros      110      128     5     1.16 0.0455
# ℹ 53 more rows

Wait, a player can get more than one yellow card per referee interactdion. Doh!

Let’s see if the yellow and red cards relate to one another.

by_player |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("Proportion of yellow vs. red cards for each player*referee")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Figure 5

By player adding skin tone

Let’s try to visualize the ratings of skin tone along side this.

by_player_skin <- red_cards |>
  dplyr::group_by(playerShort) |>
  dplyr::summarise(
    n_refs = n(),
    n_yellow = sum(yellowCards),
    n_red = sum(redCards),
    rater1 = mean(rater1),
    rater2 = mean(rater2)
  ) |>
  dplyr::mutate(p_yellow = n_yellow / n_refs, p_red = n_red / n_refs, skin = (rater1 + rater2)/2)
by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_yellow, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "yellow") +
  ggtitle("Distribution of player skin tone and number of yellow cards given")
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 6
by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_yellow) +
  geom_point(color = "yellow") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and number of yellow cards given")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 7
by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_yellow, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "yellow") +
  ggtitle("Distribution of player skin tone and proportion of yellow cards given")
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 8
by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_yellow) +
  geom_point(color = "yellow") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and proportion of yellow cards given")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 9
by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_red, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "red") +
  ggtitle("Distribution of player skin tone and number of red cards given")
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 10
by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_red) +
  geom_point(color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and proportion of red cards given")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 11
by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_red, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "red") +
  ggtitle("Distribution of skin tone and red cards given")
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 12
by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_red) +
  geom_point(color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and proportion of red cards given")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 468 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 13
by_player_skin |>
  ggplot() +
  aes(x = n_yellow, y = n_red, group = 1) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Number yellow vs. (per ref encounter/game) red cards")
Warning: Using size for a discrete variable is not advised.
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 14
by_player_skin |>
  ggplot() +
  aes(x = p_yellow, y = p_red, group = 1) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. (per ref encounter/game) red cards")
Warning: Using size for a discrete variable is not advised.
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 468 rows containing missing values or values outside the scale range
(`geom_point()`).
Figure 15

Darker-skinned players

That’s hard to read, so let’s focus on players with skin > .5.

by_player_skin |>
  dplyr::filter(skin > .5) |>
  ggplot() +
  aes(x = n_yellow, y = n_red) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. (per ref encounter/game) red cards for darker-skinned players")
Warning: Using size for a discrete variable is not advised.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Figure 16
by_player_skin |>
  dplyr::filter(skin > .5) |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. (per ref encounter/game) red cards for darker-skinned players")
Warning: Using size for a discrete variable is not advised.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Figure 17

Lighter-skinned players

And for completeness, let’s look at the lighter-skinned players.

by_player_skin |>
  dplyr::filter(skin <= .5) |>
  ggplot() +
  aes(x = n_yellow, y = n_red) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. red cards (per ref encounter/game) for lighter skinned players")
Warning: Using size for a discrete variable is not advised.
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Figure 18
by_player_skin |>
  dplyr::filter(skin <= .5) |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("Yellow vs. red cards (per ref encounter/game) for lighter skinned players")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Figure 19

Referees

Now let’s focus on the referees.

by_ref <- red_cards |>
  dplyr::group_by(refNum) |>
  dplyr::summarise(n_interactions = n(),
    n_yellow = sum(yellowCards),
    n_red = sum(redCards)) |>
  dplyr::mutate(p_yellow = n_yellow / n_interactions, p_red = n_red / n_interactions)
by_ref |>
  ggplot() +
  aes(x = n_interactions) +
  geom_histogram() +
  ggtitle("Player interactions per referee")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_ref |>
  ggplot() +
  aes(x = n_yellow) +
  geom_histogram(fill = "yellow") +
  ggtitle("N yellow cards given by referees")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 20
by_ref |>
  ggplot() +
  aes(x = p_yellow) +
  geom_histogram(fill = "yellow") +
  ggtitle("Yellow cards per player interaction given by referees")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 21
by_ref |>
  ggplot() +
  aes(x = n_red) +
  geom_histogram(fill = "red")  +
  ggtitle("N red cards given by referees")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 22
by_ref |>
  ggplot() +
  aes(x = p_red) +
  geom_histogram(fill = "red")  +
  ggtitle("Red cards per player interaction given by referees")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 23
by_ref |>
  ggplot() +
  aes(x = n_yellow, y = n_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("N yellow vs. red cards by referee*player interaction")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Figure 24
by_ref |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("Proportion yellow vs. red cards by referee*player interaction")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Figure 25

References

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E. C., … al., et. (2023, May). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. OSF. Retrieved from osf.io/gvm2z
Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … Nosek, B. A. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337–356. https://doi.org/10.1177/2515245917747646
Wolen, A. R., Hartgerink, C. H. J., Hafen, R., Richards, B. G., Soderberg, C. K., & York, T. P. (2020). osfr: An R interface to the Open Science Framework. Journal of Open Source Software, 5(46), 2071. https://doi.org/10.21105/joss.02071