Visualizing ‘Many Analysts’ data

Purpose

This document generates some visualizations of the data described in (R. Silberzahn et al., 2018).

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Gathering

An OSF project hosts the data at: https://osf.io/fv8c3. The parent OSF project is (R. Silberzahn et al., 2023).

Gilmore likes to make these workflows automated using specialized packages, so he downloaded the osfr package from CRAN, see also (Wolen et al., 2020).

Since the file does not change often and has not changed recently, we set eval = params$download_data in this document’s YAML header so that we could update the file in the future by rendering the document with params$download_data set to TRUE.

if (params$download_data) {
  message("Re-downloading data from 'https://osf.io/fv8c3'.")
} else {
  message("`params$upload_data` is FALSE. Data not downloaded again.")
  message("The following chunk is shown for completeness, but is skipped here.")
}

`params$upload_data` is FALSE. Data not downloaded again.

The following chunk is shown for completeness, but is skipped here.

osf_silberzahn_node = "https://osf.io/fv8c3"

# Save file to temp directory
this_file <-
  osfr::osf_retrieve_file(basename(osf_silberzahn_node)) |>
  osfr::osf_download(path = tempdir(),
                     progress = TRUE,
                     conflicts = "overwrite")

# Extract to target directory
if (xfun::file_ext(this_file$name) == "zip") {
  utils::unzip(this_file$local_path,
               exdir = paste0(here::here(), "/src/surveys/csv"))
}

The data is saved as a CSV in /src/surveys/csv/CrowdstormingDataJuly1st.csv.

red_cards <-
  readr::read_csv(
    paste0(
      here::here(),
      "/src/surveys/csv/CrowdstormingDataJuly1st.csv"
    ),
    show_col_types = FALSE
  )

str(red_cards)

spc_tbl_ [146,028 × 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ playerShort  : chr [1:146028] "lucas-wilchez" "john-utaka" "abdon-prats" "pablo-mari" ...
 $ player       : chr [1:146028] "Lucas Wilchez" "John Utaka" "Abdón Prats" "Pablo Marí" ...
 $ club         : chr [1:146028] "Real Zaragoza" "Montpellier HSC" "RCD Mallorca" "RCD Mallorca" ...
 $ leagueCountry: chr [1:146028] "Spain" "France" "Spain" "Spain" ...
 $ birthday     : chr [1:146028] "31.08.1983" "08.01.1982" "17.12.1992" "31.08.1993" ...
 $ height       : num [1:146028] 177 179 181 191 172 182 187 180 193 180 ...
 $ weight       : num [1:146028] 72 82 79 87 70 71 80 68 80 70 ...
 $ position     : chr [1:146028] "Attacking Midfielder" "Right Winger" NA "Center Back" ...
 $ games        : num [1:146028] 1 1 1 1 1 1 1 1 1 1 ...
 $ victories    : num [1:146028] 0 0 0 1 1 0 1 0 0 1 ...
 $ ties         : num [1:146028] 0 0 1 0 0 0 0 0 1 0 ...
 $ defeats      : num [1:146028] 1 1 0 0 0 1 0 1 0 0 ...
 $ goals        : num [1:146028] 0 0 0 0 0 0 0 0 0 0 ...
 $ yellowCards  : num [1:146028] 0 1 1 0 0 0 0 0 0 0 ...
 $ yellowReds   : num [1:146028] 0 0 0 0 0 0 0 0 0 0 ...
 $ redCards     : num [1:146028] 0 0 0 0 0 0 0 0 0 0 ...
 $ photoID      : chr [1:146028] "95212.jpg" "1663.jpg" NA NA ...
 $ rater1       : num [1:146028] 0.25 0.75 NA NA NA 0.25 0 1 0.25 0 ...
 $ rater2       : num [1:146028] 0.5 0.75 NA NA NA 0 0.25 1 0.25 0 ...
 $ refNum       : num [1:146028] 1 2 3 3 3 4 4 4 4 4 ...
 $ refCountry   : num [1:146028] 1 2 3 3 3 4 4 4 4 4 ...
 $ Alpha_3      : chr [1:146028] "GRC" "ZMB" "ESP" "ESP" ...
 $ meanIAT      : num [1:146028] 0.326 0.203 0.37 0.37 0.37 ...
 $ nIAT         : num [1:146028] 712 40 1785 1785 1785 ...
 $ seIAT        : num [1:146028] 0.000564 0.010875 0.000229 0.000229 0.000229 ...
 $ meanExp      : num [1:146028] 0.396 -0.204 0.588 0.588 0.588 ...
 $ nExp         : num [1:146028] 750 49 1897 1897 1897 ...
 $ seExp        : num [1:146028] 0.0027 0.0615 0.001 0.001 0.001 ...
 - attr(*, "spec")=
  .. cols(
  ..   playerShort = col_character(),
  ..   player = col_character(),
  ..   club = col_character(),
  ..   leagueCountry = col_character(),
  ..   birthday = col_character(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   position = col_character(),
  ..   games = col_double(),
  ..   victories = col_double(),
  ..   ties = col_double(),
  ..   defeats = col_double(),
  ..   goals = col_double(),
  ..   yellowCards = col_double(),
  ..   yellowReds = col_double(),
  ..   redCards = col_double(),
  ..   photoID = col_character(),
  ..   rater1 = col_double(),
  ..   rater2 = col_double(),
  ..   refNum = col_double(),
  ..   refCountry = col_double(),
  ..   Alpha_3 = col_character(),
  ..   meanIAT = col_double(),
  ..   nIAT = col_double(),
  ..   seIAT = col_double(),
  ..   meanExp = col_double(),
  ..   nExp = col_double(),
  ..   seExp = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Exploration

Let’s see if we can understand more about the data.

Each row appears to include the number of yellow and red cards awarded to a given player by a given referee, presumably within the sample period of the study.

Cards per specific player/ref

red_cards |>
  ggplot() +
  aes(x = yellowCards) +
  geom_histogram(fill = "yellow") +
  ggtitle("Distribution of yellow cards given specific player*ref interaction")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most player by referee interactions resulted in no yellow cards at all, as expected.

red_cards |>
  ggplot() +
  aes(redCards) +
  geom_histogram(fill = "red") +
  ggtitle("Distribution of red cards given specific player*ref interaction") +
  xlab("N red cards")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And there are relatively few red cards, too.

n_red_cards <- red_cards |>
  dplyr::filter(redCards > 0)

xtabs(formula = ~ redCards, data = red_cards)

redCards
     0      1      2 
144219   1784     25

Specifically, there are $n=$ 1809 red cards, or $p=$ 0.012 of all player/referee interactions. So, these are rare events in the database. But there are some interactions with two red cards.

n_red_cards |>
  dplyr::filter(redCards == 2) |>
  dplyr::select(playerShort, refNum) |>
  dplyr::arrange(refNum, playerShort) |>
  knitr::kable(format = 'html') |>
  kableExtra::kable_classic()

playerShort	refNum
bodipo	140
antonio-lopez_2	367
javi-martinez	432
jonas	432
phil-jagielka	487
cyril-jeunechamp	586
sergio-ramos	804
aly-cissokho	985
eugen-polanski	1114
emmanuel-adebayor	1214
jose-nunes	1349
jesus-rueda	1574
marc-planus	1696
de-lucas	1886
fallou-diagne	1931
laurent-koscielny	2080
saber-khelifa	2246
ivo-ilicevic	2370
fabien-audard	2796
cristiano-ronaldo	2803
gary-cahill	2822
lee-cattermole	2822
cyril-jeunechamp	2961
mouhamadou-dabo	2961
rafael-van-der-vaart	3099

It doesn’t look like there are three referees who gave two red cards out to more than one player, referees 432, 2822, and 2961.

n_red_cards |>
  dplyr::filter(redCards == 2) |>
  dplyr::select(playerShort, refNum) |>
  dplyr::arrange(playerShort, refNum) |>
  knitr::kable(format = 'html') |>
  kableExtra::kable_classic()

playerShort	refNum
aly-cissokho	985
antonio-lopez_2	367
bodipo	140
cristiano-ronaldo	2803
cyril-jeunechamp	586
cyril-jeunechamp	2961
de-lucas	1886
emmanuel-adebayor	1214
eugen-polanski	1114
fabien-audard	2796
fallou-diagne	1931
gary-cahill	2822
ivo-ilicevic	2370
javi-martinez	432
jesus-rueda	1574
jonas	432
jose-nunes	1349
laurent-koscielny	2080
lee-cattermole	2822
marc-planus	1696
mouhamadou-dabo	2961
phil-jagielka	487
rafael-van-der-vaart	3099
saber-khelifa	2246
sergio-ramos	804

And one player, cyril-jeunechamp, who got two red cards from different referees.

Just for fun, let’s see how red and yellow cards relate to one another. We add some jitter and use alpha=.1 to make it easier to see any pattern.

red_cards |>
  ggplot() +
  aes(x = yellowCards, y = redCards) +
  geom_jitter(alpha = .1, width = .2, height = .2) +
  geom_smooth(method = "lm") +
  ggtitle("N yellow vs. N red cards for each player by referee combination") +
  xlab("N yellow cards") +
  ylab("N red cards")

`geom_smooth()` using formula = 'y ~ x'

Number of yellow and red cards for each player/referee pair in data from (R. Silberzahn et al., 2023)

By player

Let’s group by player and see how the number of yellow and red cards appear as a function of player.

by_player <- red_cards |>
  dplyr::group_by(playerShort) |>
  dplyr::summarise(n_refs = n(), n_yellow = sum(yellowCards), n_red = sum(redCards)) |>
  dplyr::mutate(p_yellow = n_yellow/n_refs, p_red = n_red/n_refs)

by_player |>
  ggplot() +
  aes(x = n_refs) +
  geom_histogram() +
  ggtitle("Distribution of referee interactions for each player")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let’s look at these distributions.

by_player |>
  ggplot() +
  aes(x = p_yellow) +
  geom_histogram(fill = "yellow") +
  ggtitle("Proportion of referee interactions with yellow cards")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_player |>
  ggplot() +
  aes(x = p_red) +
  geom_histogram(fill = "red") +
  ggtitle("Proportion of referee interactions with red cards")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So, there’s something odd going on with p_yellow. This should be in [0,1].

by_player |> dplyr::filter(p_yellow > 1)

# A tibble: 63 × 6
   playerShort   n_refs n_yellow n_red p_yellow  p_red
   <chr>          <int>    <dbl> <dbl>    <dbl>  <dbl>
 1 adam-bodzek       56       66     0     1.18 0     
 2 aduriz            63       68     4     1.08 0.0635
 3 albelda          147      197     6     1.34 0.0408
 4 alberto-botia     45       69     3     1.53 0.0667
 5 alexis            53       77     4     1.45 0.0755
 6 allan-nyom        37       42     0     1.14 0     
 7 amorebieta        85       93     4     1.09 0.0471
 8 apono             43       45     1     1.05 0.0233
 9 arribas_2         38       48     2     1.26 0.0526
10 ballesteros      110      128     5     1.16 0.0455
# ℹ 53 more rows

Wait, a player can get more than one yellow card per referee interactdion. Doh!

Let’s see if the yellow and red cards relate to one another.

by_player |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("Proportion of yellow vs. red cards for each player*referee")

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

By player adding skin tone

Let’s try to visualize the ratings of skin tone along side this.

by_player_skin <- red_cards |>
  dplyr::group_by(playerShort) |>
  dplyr::summarise(
    n_refs = n(),
    n_yellow = sum(yellowCards),
    n_red = sum(redCards),
    rater1 = mean(rater1),
    rater2 = mean(rater2)
  ) |>
  dplyr::mutate(p_yellow = n_yellow / n_refs, p_red = n_red / n_refs, skin = (rater1 + rater2)/2)

by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_yellow, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "yellow") +
  ggtitle("Distribution of player skin tone and number of yellow cards given")

Warning: Removed 468 rows containing non-finite values (`stat_ydensity()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_yellow) +
  geom_point(color = "yellow") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and number of yellow cards given")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 468 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_yellow, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "yellow") +
  ggtitle("Distribution of player skin tone and proportion of yellow cards given")

Warning: Removed 468 rows containing non-finite values (`stat_ydensity()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_yellow) +
  geom_point(color = "yellow") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and proportion of yellow cards given")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 468 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_red, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "red") +
  ggtitle("Distribution of player skin tone and number of red cards given")

Warning: Removed 468 rows containing non-finite values (`stat_ydensity()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = n_red) +
  geom_point(color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and proportion of red cards given")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 468 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_red, group = as.factor(skin)) +
  geom_violin() +
  geom_point(color = "red") +
  ggtitle("Distribution of skin tone and red cards given")

Warning: Removed 468 rows containing non-finite values (`stat_ydensity()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = skin, y = p_red) +
  geom_point(color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("Distribution of player skin tone and proportion of red cards given")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 468 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = n_yellow, y = n_red, group = 1) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Number yellow vs. (per ref encounter/game) red cards")

Warning: Using size for a discrete variable is not advised.

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 468 rows containing missing values (`geom_point()`).

by_player_skin |>
  ggplot() +
  aes(x = p_yellow, y = p_red, group = 1) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. (per ref encounter/game) red cards")

Warning: Using size for a discrete variable is not advised.

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: Removed 468 rows containing missing values (`geom_point()`).

Darker-skinned players

That’s hard to read, so let’s focus on players with skin > .5.

by_player_skin |>
  dplyr::filter(skin > .5) |>
  ggplot() +
  aes(x = n_yellow, y = n_red) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. (per ref encounter/game) red cards for darker-skinned players")

Warning: Using size for a discrete variable is not advised.

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

by_player_skin |>
  dplyr::filter(skin > .5) |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. (per ref encounter/game) red cards for darker-skinned players")

Warning: Using size for a discrete variable is not advised.

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Lighter-skinned players

And for completeness, let’s look at the lighter-skinned players.

by_player_skin |>
  dplyr::filter(skin <= .5) |>
  ggplot() +
  aes(x = n_yellow, y = n_red) +
  geom_point(aes(size = as.factor(skin))) +
  geom_smooth() +
  ggtitle("Yellow vs. red cards (per ref encounter/game) for lighter skinned players")

Warning: Using size for a discrete variable is not advised.

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

by_player_skin |>
  dplyr::filter(skin <= .5) |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("Yellow vs. red cards (per ref encounter/game) for lighter skinned players")

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Referees

Now let’s focus on the referees.

by_ref <- red_cards |>
  dplyr::group_by(refNum) |>
  dplyr::summarise(n_interactions = n(),
    n_yellow = sum(yellowCards),
    n_red = sum(redCards)) |>
  dplyr::mutate(p_yellow = n_yellow / n_interactions, p_red = n_red / n_interactions)

by_ref |>
  ggplot() +
  aes(x = n_interactions) +
  geom_histogram() +
  ggtitle("Player interactions per referee")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_ref |>
  ggplot() +
  aes(x = n_yellow) +
  geom_histogram(fill = "yellow") +
  ggtitle("N yellow cards given by referees")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_ref |>
  ggplot() +
  aes(x = p_yellow) +
  geom_histogram(fill = "yellow") +
  ggtitle("Yellow cards per player interaction given by referees")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_ref |>
  ggplot() +
  aes(x = n_red) +
  geom_histogram(fill = "red")  +
  ggtitle("N red cards given by referees")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_ref |>
  ggplot() +
  aes(x = p_red) +
  geom_histogram(fill = "red")  +
  ggtitle("Red cards per player interaction given by referees")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

by_ref |>
  ggplot() +
  aes(x = n_yellow, y = n_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("N yellow vs. red cards by referee*player interaction")

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

by_ref |>
  ggplot() +
  aes(x = p_yellow, y = p_red) +
  geom_point() +
  geom_smooth() +
  ggtitle("Proportion yellow vs. red cards by referee*player interaction")

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

References

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E. C., … al., et. (2023, May). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. OSF. Retrieved from osf.io/gvm2z

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … Nosek, B. A. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337–356. https://doi.org/10.1177/2515245917747646

Wolen, A. R., Hartgerink, C. H. J., Hafen, R., Richards, B. G., Soderberg, C. K., & York, T. P. (2020). osfr: An R interface to the Open Science Framework. Journal of Open Source Software, 5(46), 2071. https://doi.org/10.21105/joss.02071