The overall goal is to review ANOVA methods in R, as well as analyses of contingency tables (categorical data).
For between-subjects designs, the aov
function in R gives you most of what you’d need to compute standard ANOVA statistics. But it requires a fairly detailed understanding of sum of squares and typically assumes a balanced design. The car::Anova
function takes things a bit further by allowing you to specify Type II or III sum of squares.
Consider the Moore 1971 dataset on conformity responses as a function of partner status (high/low) and level on an authoritarianism scale (highm medium, low). This is a between subjects design with 2 factors (2 x 3).
mod <- lm(conformity ~ fcategory*partner.status, data=Moore)
Anova(mod, type="III")
Sum Sq | Df | F value | Pr(>F) | |
---|---|---|---|---|
(Intercept) | 984.1 | 1 | 46.935 | 0.000 |
fcategory | 89.7 | 2 | 2.138 | 0.131 |
partner.status | 2.2 | 1 | 0.105 | 0.748 |
fcategory:partner.status | 175.5 | 2 | 4.185 | 0.023 |
Residuals | 817.8 | 39 | NA | NA |
Even better is the afex
package, which handles most of the complexity of ANOVA designs.
Moore$id <- 1:nrow(Moore)
res <- aov_ez(id="id", dv = "conformity", between = c("partner.status", "fcategory"), data=Moore)
## Contrasts set to contr.sum for the following variables: partner.status, fcategory
print(res)
## Anova Table (Type 3 tests)
##
## Response: conformity
## Effect df MSE F ges p.value
## 1 partner.status 1, 39 20.97 11.42 ** .23 .002
## 2 fcategory 2, 39 20.97 0.86 .04 .43
## 3 partner.status:fcategory 2, 39 20.97 4.18 * .18 .02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
More detailed coverage of this example: https://cran.r-project.org/web/packages/afex/vignettes/afex_anova_example.html
In general, the aov_ez
function from the afex
package is an ideal tool for ANOVA analysis because it computes the expected ANOVA table, as well as effect size (generalized eta squared).
Let’s consider the experiment of Singmann and Klauer (2011), where they examined the conditional reasoning of individuals based on instruction type (b/w subs: deductive versus probailistic), inference validity (w/i subs: valid versus invalid problems) and plausibility (plausible versus implausible). In short, this is a 2 x 2 x 2 mixed ANOVA design that can be tested without much difficulty.
str(sk2011.1)
## 'data.frame': 640 obs. of 9 variables:
## $ id : Factor w/ 40 levels "8","9","10","12",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ instruction : Factor w/ 2 levels "deductive","probabilistic": 2 2 2 2 2 2 2 2 2 2 ...
## $ plausibility: Factor w/ 2 levels "plausible","implausible": 1 2 2 1 2 1 1 2 1 2 ...
## $ inference : Factor w/ 4 levels "MP","MT","AC",..: 4 2 1 3 4 2 1 3 4 2 ...
## $ validity : Factor w/ 2 levels "valid","invalid": 2 1 1 2 2 1 1 2 2 1 ...
## $ what : Factor w/ 2 levels "affirmation",..: 2 2 1 1 2 2 1 1 2 2 ...
## $ type : Factor w/ 2 levels "original","reversed": 2 2 2 2 1 1 1 1 2 2 ...
## $ response : int 100 60 94 70 100 99 98 49 82 50 ...
## $ content : Factor w/ 4 levels "C1","C2","C3",..: 1 1 1 1 2 2 2 2 3 3 ...
a1 <- aov_ez("id", "response", sk2011.1, between = "instruction",
within = c("inference", "plausibility"))
## Warning: More than one observation per cell, aggregating the data using
## mean (i.e, fun_aggregate = mean)!
## Contrasts set to contr.sum for the following variables: instruction
summary(a1)
##
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
##
## Sum Sq num Df Error SS den Df F value
## (Intercept) 2004236 1 77042 38 988.56
## instruction 622 1 77042 38 0.31
## inference 14827 3 96988 114 5.81
## instruction:inference 15308 3 96988 114 6.00
## plausibility 16046 1 17815 38 34.23
## instruction:plausibility 5001 1 17815 38 10.67
## inference:plausibility 2096 3 27779 114 2.87
## instruction:inference:plausibility 2911 3 27779 114 3.98
## Pr(>F)
## (Intercept) < 2e-16 ***
## instruction 0.58302
## inference 0.00099 ***
## instruction:inference 0.00078 ***
## plausibility 9.1e-07 ***
## instruction:plausibility 0.00232 **
## inference:plausibility 0.03970 *
## instruction:inference:plausibility 0.00971 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Mauchly Tests for Sphericity
##
## Test statistic p-value
## inference 0.837 0.2582
## instruction:inference 0.837 0.2582
## inference:plausibility 0.652 0.0078
## instruction:inference:plausibility 0.652 0.0078
##
##
## Greenhouse-Geisser and Huynh-Feldt Corrections
## for Departure from Sphericity
##
## GG eps Pr(>F[GG])
## inference 0.887 0.0016 **
## instruction:inference 0.887 0.0013 **
## inference:plausibility 0.764 0.0551 .
## instruction:inference:plausibility 0.764 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## HF eps Pr(>F[HF])
## inference 0.960 0.001185
## instruction:inference 0.960 0.000947
## inference:plausibility 0.816 0.051231
## instruction:inference:plausibility 0.816 0.015478
Pretty print out of ANOVA results
knitr::kable(nice(a1))
Effect | df | MSE | F | ges | p.value |
---|---|---|---|---|---|
instruction | 1, 38 | 2027.42 | 0.31 | .003 | .58 |
inference | 2.66, 101.12 | 959.12 | 5.81 ** | .06 | .002 |
instruction:inference | 2.66, 101.12 | 959.12 | 6.00 ** | .07 | .001 |
plausibility | 1, 38 | 468.82 | 34.23 *** | .07 | <.0001 |
instruction:plausibility | 1, 38 | 468.82 | 10.67 ** | .02 | .002 |
inference:plausibility | 2.29, 87.11 | 318.91 | 2.87 + | .009 | .06 |
instruction:inference:plausibility | 2.29, 87.11 | 318.91 | 3.98 * | .01 | .02 |
What about the estimated marginal means of the design?
emmeans(a1, ~instruction * inference * plausibility)
## instruction inference plausibility emmean SE df lower.CL upper.CL
## deductive MP plausible 99.5 6.01 190 87.6 111.3
## probabilistic MP plausible 95.3 6.01 190 83.4 107.2
## deductive MT plausible 70.5 6.01 190 58.7 82.4
## probabilistic MT plausible 93.0 6.01 190 81.1 104.8
## deductive AC plausible 66.9 6.01 190 55.1 78.8
## probabilistic AC plausible 90.5 6.01 190 78.7 102.4
## deductive DA plausible 86.5 6.01 190 74.7 98.4
## probabilistic DA plausible 87.4 6.01 190 75.6 99.3
## deductive MP implausible 95.1 6.01 190 83.2 107.0
## probabilistic MP implausible 60.2 6.01 190 48.3 72.0
## deductive MT implausible 70.2 6.01 190 58.4 82.1
## probabilistic MT implausible 72.9 6.01 190 61.1 84.8
## deductive AC implausible 56.0 6.01 190 44.2 67.9
## probabilistic AC implausible 64.1 6.01 190 52.3 76.0
## deductive DA implausible 77.1 6.01 190 65.2 89.0
## probabilistic DA implausible 80.7 6.01 190 68.9 92.6
##
## Confidence level used: 0.95
What about the ‘main effects?’
em1 <- emmeans(a1, ~inference)
## NOTE: Results may be misleading due to involvement in interactions
And pairwise difference among levels of inference validity?
pairs(em1)
## contrast estimate SE df t.ratio p.value
## MP - MT 10.83 4.61 114 2.349 0.0933
## MP - AC 18.10 4.61 114 3.925 0.0008
## MP - DA 4.56 4.61 114 0.988 0.7566
## MT - AC 7.27 4.61 114 1.576 0.3963
## MT - DA -6.28 4.61 114 -1.361 0.5266
## AC - DA -13.54 4.61 114 -2.937 0.0206
##
## Results are averaged over the levels of: instruction, plausibility
## P value adjustment: tukey method for comparing a family of 4 estimates
Note that emmeans plays well with the multcomp package, which provides a broader suite of multiple comparison corrections:
summary(as.glht(pairs(em1)), test=adjusted("free"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## MP - MT == 0 10.83 4.61 2.35 0.06892 .
## MP - AC == 0 18.10 4.61 3.92 0.00086 ***
## MP - DA == 0 4.56 4.61 0.99 0.32527
## MT - AC == 0 7.27 4.61 1.58 0.28168
## MT - DA == 0 -6.28 4.61 -1.36 0.29693
## AC - DA == 0 -13.54 4.61 -2.94 0.01813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- free method)
emmip(a1, instruction ~ inference|plausibility, CIs = TRUE)
The emmeans
package provides the emmip
function, which is very useful for plotting the results of an aov_ez
object.
The vcd
and `vcdExtra packages are particulary handy for working with categorical data.
data("Arthritis")
# attach(Arthritis)
# ###
# Treatment
# detach(Arthritis)
#table(Arthritis$Treatment, Arthritis$Sex)
with(Arthritis, table(Treatment, Sex))
## Sex
## Treatment Female Male
## Placebo 32 11
## Treated 27 14
tb <- with(Arthritis, table(Treatment, Sex))
tb["Placebo", "Female"]
## [1] 32
with(Arthritis, table(Treatment))
## Treatment
## Placebo Treated
## 43 41
#cell proportions
round(prop.table(tb), 3)
## Sex
## Treatment Female Male
## Placebo 0.381 0.131
## Treated 0.321 0.167
#row proportions
round(prop.table(tb, 1), 3)
## Sex
## Treatment Female Male
## Placebo 0.744 0.256
## Treated 0.659 0.341
#column proportions
round(prop.table(tb, 2), 3)
## Sex
## Treatment Female Male
## Placebo 0.542 0.440
## Treated 0.458 0.560
#sum over rows
margin.table(tb, 1)
## Treatment
## Placebo Treated
## 43 41
#add row margins
addmargins(tb, 1)
## Sex
## Treatment Female Male
## Placebo 32 11
## Treated 27 14
## Sum 59 25
addmargins(tb, 2)
## Sex
## Treatment Female Male Sum
## Placebo 32 11 43
## Treated 27 14 41
addmargins(tb, c(1,2))
## Sex
## Treatment Female Male Sum
## Placebo 32 11 43
## Treated 27 14 41
## Sum 59 25 84
#3-way table with formula syntax
#tb3 <- xtabs(~Treatment + Improved + Sex, Arthritis)
tb3 <- xtabs(~Treatment + Sex + Improved, Arthritis)
tb3
## , , Improved = None
##
## Sex
## Treatment Female Male
## Placebo 19 10
## Treated 6 7
##
## , , Improved = Some
##
## Sex
## Treatment Female Male
## Placebo 7 0
## Treated 5 2
##
## , , Improved = Marked
##
## Sex
## Treatment Female Male
## Placebo 6 1
## Treated 16 5
ftable(tb3)
## Improved None Some Marked
## Treatment Sex
## Placebo Female 19 7 6
## Male 10 0 1
## Treated Female 6 5 16
## Male 7 2 5
## col ~ row
structable(Sex + Treatment ~ Improved, Arthritis)
## Sex Female Male
## Treatment Placebo Treated Placebo Treated
## Improved
## None 19 6 10 7
## Some 7 5 0 2
## Marked 6 16 1 5
xtable(xtabs(~ Sex + Treatment, Arthritis))
Placebo | Treated | |
---|---|---|
Female | 32 | 27 |
Male | 11 | 14 |
#sum frequencies across sexes
freqdf <- Arthritis %>% count(Improved, Treatment) %>% ungroup()
#proportions
Arthritis %>% count(Improved, Treatment) %>% mutate(prop = prop.table(n))
Improved | Treatment | n | prop |
---|---|---|---|
None | Placebo | 29 | 0.345 |
None | Treated | 13 | 0.155 |
Some | Placebo | 7 | 0.083 |
Some | Treated | 7 | 0.083 |
Marked | Placebo | 7 | 0.083 |
Marked | Treated | 21 | 0.250 |
#can obtain frequency tables from dplyr, too!
Arthritis %>%
count(Improved, Treatment) %>%
spread(Treatment, n)
Improved | Placebo | Treated |
---|---|---|
None | 29 | 13 |
Some | 7 | 7 |
Marked | 7 | 21 |
Arthritis %>%
count(Improved, Treatment, Sex) %>%
spread(Treatment, n)
Improved | Sex | Placebo | Treated |
---|---|---|---|
None | Female | 19 | 6 |
None | Male | 10 | 7 |
Some | Female | 7 | 5 |
Some | Male | NA | 2 |
Marked | Female | 6 | 16 |
Marked | Male | 1 | 5 |
#same as
xtabs(~Improved + Treatment, Arthritis)
## Treatment
## Improved Placebo Treated
## None 29 13
## Some 7 7
## Marked 7 21
#and can collapse over factors in frequency form
freqdf %>% count(Improved, wt=n) #tell it to weight the counts by frequency variable n
Improved | nn |
---|---|
None | 42 |
Some | 14 |
Marked | 28 |
freqdf %>% group_by(Improved) %>% dplyr::summarize(freq=sum(n)) #more explicit syntax
Improved | freq |
---|---|
None | 42 |
Some | 14 |
Marked | 28 |
freqdf %>% count(Treatment, wt=n)
Treatment | nn |
---|---|
Placebo | 43 |
Treated | 41 |
#2-way chi-square test
chisq.test(xtabs(~Treatment + Improved, Arthritis))
##
## Pearson's Chi-squared test
##
## data: xtabs(~Treatment + Improved, Arthritis)
## X-squared = 10, df = 2, p-value = 0.001
#association statistics
assocstats(xtabs(~Treatment + Improved, Arthritis))
## X^2 df P(> X^2)
## Likelihood Ratio 13.530 2 0.0011536
## Pearson 13.055 2 0.0014626
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.367
## Cramer's V : 0.394
#mosaic(structable(Treatment+Improved ~ Sex, Arthritis))
mosaic(structable(Sex+Treatment~Improved, Arthritis))
ggpairs(dplyr::select(Arthritis, -ID))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
tv.data<-read.table(system.file("doc","extdata","tv.dat",package="vcdExtra"))
TV <- xtabs(V5 ~ ., data=tv.data)
dimnames(TV) <- list(Day=c("Monday","Tuesday","Wednesday","Thursday","Friday"),
Time=c("8:00","8:15","8:30","8:45","9:00","9:15","9:30",
"9:45","10:00","10:15","10:30"),
Network=c("ABC","CBS","NBC","Fox","Other"),
State=c("Off","Switch","Persist"))
ggpairs(TV)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
TV <- TV[,,c("ABC", "CBS", "NBC"),] # keep only ABC, CBS, NBC
TV <- TV[,,,"Persist"] #people who keep watching
structable(TV)
## Time 8:00 8:15 8:30 8:45 9:00 9:15 9:30 9:45 10:00 10:15 10:30
## Day Network
## Monday ABC 146 151 156 83 325 350 386 340 352 280 278
## CBS 337 293 304 233 311 251 241 164 252 265 272
## NBC 263 219 236 140 226 235 239 246 279 263 283
## Tuesday ABC 244 181 231 205 385 283 345 192 329 351 364
## CBS 173 180 184 109 218 235 256 250 274 263 261
## NBC 315 254 280 241 370 214 195 111 188 190 210
## Wednesday ABC 233 161 194 156 339 264 279 140 237 228 203
## CBS 158 126 207 59 98 103 122 86 109 105 110
## NBC 134 146 166 66 194 230 264 143 274 289 306
## Thursday ABC 174 183 197 181 187 198 211 86 110 122 117
## CBS 196 185 195 104 106 116 116 47 102 84 84
## NBC 515 463 472 477 590 473 446 349 649 705 747
## Friday ABC 294 281 305 239 278 246 245 138 246 232 233
## CBS 130 144 154 81 129 153 136 126 138 136 152
## NBC 195 220 248 160 172 164 169 85 183 198 204
ggpairs(TV)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#more SPSS-like
CrossTable(TV[,,1],prop.t=FALSE,prop.r=FALSE,prop.c=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## |-------------------------|
##
##
## Total Observations in Table: 12894
##
##
## | Time
## Day | 8:00 | 8:15 | 8:30 | 8:45 | 9:00 | 9:15 | 9:30 | 9:45 | 10:00 | 10:15 | 10:30 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Monday | 146 | 151 | 156 | 83 | 325 | 350 | 386 | 340 | 352 | 280 | 278 | 2847 |
## | 37.381 | 17.211 | 28.897 | 60.883 | 0.258 | 9.814 | 11.993 | 102.156 | 17.769 | 0.553 | 0.758 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Tuesday | 244 | 181 | 231 | 205 | 385 | 283 | 345 | 192 | 329 | 351 | 364 | 3110 |
## | 1.393 | 10.755 | 3.495 | 0.055 | 1.077 | 5.058 | 0.209 | 2.690 | 1.534 | 11.668 | 19.918 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Wednesday | 233 | 161 | 194 | 156 | 339 | 264 | 279 | 140 | 237 | 228 | 203 | 2434 |
## | 3.553 | 2.138 | 0.533 | 0.309 | 9.904 | 0.466 | 0.019 | 5.020 | 0.051 | 0.004 | 2.260 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Thursday | 174 | 183 | 197 | 181 | 187 | 198 | 211 | 86 | 110 | 122 | 117 | 1766 |
## | 4.041 | 20.571 | 15.969 | 33.183 | 1.999 | 1.118 | 0.519 | 10.987 | 23.835 | 11.725 | 13.308 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Friday | 294 | 281 | 305 | 239 | 278 | 246 | 245 | 138 | 246 | 232 | 233 | 2737 |
## | 16.821 | 29.841 | 24.542 | 16.855 | 5.854 | 5.249 | 14.077 | 14.323 | 2.207 | 2.522 | 1.683 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1091 | 957 | 1083 | 864 | 1514 | 1341 | 1466 | 896 | 1274 | 1213 | 1195 | 12894 |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
We won’t cover this here, but check out this nice tutorial: https://stats.idre.ucla.edu/r/dae/logit-regression/
Likewise, see this: https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/