First load the required packages for this walk through.
pkg_list <- c("tidyverse", "psych", "rcompanion", "knitr", "car", "afex", "ez",
"ggfortify", "Hmisc", "emmeans", "jtools", "apaTables", "dplyr")
# purrr::walk(pkg_list, require, quietly = TRUE, character.only = TRUE)
pacman::p_load(pkg_list, character.only = TRUE)
Optional package
# devtools::install_github("ropensci/skimr")
library("skimr")
The data set being used for this walk through is automatically loaded when psych
is loaded. You can examine what each of the columns represent by looking at the data set’s help page with ?sat.act
.
Next, just to get a snapshot of the data, we can use head
and str
. Notice that sex
and education
are integer columns despite being categorical variables.
data("sat.act", package="psych")
sat.act <- sat.act %>% dplyr::rename(sex=gender) %>%
mutate(sex = factor(sex, levels = c(1,2), labels = c("male", "female")))
# Alternatively...
# source("R/load_sat_act.R")
head(sat.act)
## sex education age ACT SATV SATQ
## 29442 female 3 19 24 500 500
## 29457 female 3 23 35 600 500
## 29498 female 3 20 21 480 470
## 29503 male 4 27 26 550 520
## 29504 male 2 33 31 600 550
## 29518 male 5 26 28 640 640
str(sat.act)
## 'data.frame': 700 obs. of 6 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
## $ education: int 3 3 3 4 2 5 5 3 4 5 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
Since some of the analyses we will be running require categorical variables, we first need to preprocess some of the integer columns into categories/factors.
labels <- c("none", "some_hs", "high_school",
"some_college", "college", "graduate")
sat.act <- sat.act %>%
mutate(.,
education = factor(education, levels = (0:5), labels = labels)
# education = case_when(education == 0 ~ "none",
# education == 1 ~ "some_hs",
# education == 2 ~ "high_school",
# education == 3 ~ "some_college",
# education == 4 ~ "college",
# education == 5 ~ "graduate")
)
Before analysis, we can obtain descriptive statistics quickly by utilizing the describe()
function in the psych
package.
describeBy(sat.act) %>%
dplyr::select(-vars, -trimmed, -mad, -range, -se)
## n mean sd median min max skew kurtosis
## sex* 700 1.65 0.48 2 1 2 -0.61 -1.62
## education* 700 4.16 1.43 4 1 6 -0.68 -0.07
## age 700 25.59 9.50 22 13 65 1.64 2.42
## ACT 700 28.55 4.82 29 3 36 -0.66 0.53
## SATV 700 612.23 112.90 620 200 800 -0.64 0.33
## SATQ 687 610.22 115.64 620 200 800 -0.59 -0.02
The describe()
function also allows us to get descriptive statistics by a grouping variable using the partner function describeBy()
. If we wanted to get descriptive statistics by sex
we simply pass that column to an argument.
describeBy(sat.act, group = c("sex"), mat = TRUE) %>%
dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)
## group1 n mean sd median min max skew kurtosis
## sex*1 male 247 1.00 0.00 1 1 1 NaN NaN
## sex*2 female 453 2.00 0.00 2 2 2 NaN NaN
## education*1 male 247 4.00 1.54 4 1 6 -0.542 -0.598
## education*2 female 453 4.26 1.35 4 1 6 -0.737 0.272
## age1 male 247 25.86 9.74 22 14 58 1.428 1.435
## age2 female 453 25.45 9.37 22 13 65 1.766 3.029
## ACT1 male 247 28.79 5.06 30 3 36 -1.057 1.888
## ACT2 female 453 28.42 4.69 29 15 36 -0.392 -0.416
## SATV1 male 247 615.11 114.16 630 200 800 -0.630 0.130
## SATV2 female 453 610.66 112.31 620 200 800 -0.651 0.423
## SATQ1 male 245 635.87 116.02 660 300 800 -0.722 -0.122
## SATQ2 female 442 596.00 113.07 600 200 800 -0.577 0.129
We can add multiple grouping variables in the same manner.
describeBy(sat.act, group = c("sex", "education"), mat = TRUE) %>%
dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)
## group1 group2 n mean sd median min max skew kurtosis
## sex*1 male none 27 1.0 0.00 1.0 1 1 NaN NaN
## sex*2 female none 30 2.0 0.00 2.0 2 2 NaN NaN
## sex*3 male some_hs 20 1.0 0.00 1.0 1 1 NaN NaN
## sex*4 female some_hs 25 2.0 0.00 2.0 2 2 NaN NaN
## sex*5 male high_school 23 1.0 0.00 1.0 1 1 NaN NaN
## sex*6 female high_school 21 2.0 0.00 2.0 2 2 NaN NaN
## sex*7 male some_college 80 1.0 0.00 1.0 1 1 NaN NaN
## sex*8 female some_college 195 2.0 0.00 2.0 2 2 NaN NaN
## sex*9 male college 51 1.0 0.00 1.0 1 1 NaN NaN
## sex*10 female college 87 2.0 0.00 2.0 2 2 NaN NaN
## sex*11 male graduate 46 1.0 0.00 1.0 1 1 NaN NaN
## sex*12 female graduate 95 2.0 0.00 2.0 2 2 NaN NaN
## education*1 male none 27 1.0 0.00 1.0 1 1 NaN NaN
## education*2 female none 30 1.0 0.00 1.0 1 1 NaN NaN
## education*3 male some_hs 20 2.0 0.00 2.0 2 2 NaN NaN
## education*4 female some_hs 25 2.0 0.00 2.0 2 2 NaN NaN
## education*5 male high_school 23 3.0 0.00 3.0 3 3 NaN NaN
## education*6 female high_school 21 3.0 0.00 3.0 3 3 NaN NaN
## education*7 male some_college 80 4.0 0.00 4.0 4 4 NaN NaN
## education*8 female some_college 195 4.0 0.00 4.0 4 4 NaN NaN
## education*9 male college 51 5.0 0.00 5.0 5 5 NaN NaN
## education*10 female college 87 5.0 0.00 5.0 5 5 NaN NaN
## education*11 male graduate 46 6.0 0.00 6.0 6 6 NaN NaN
## education*12 female graduate 95 6.0 0.00 6.0 6 6 NaN NaN
## age1 male none 27 16.9 1.04 17.0 14 18 -0.86185 0.33852
## age2 female none 30 17.0 1.07 17.0 13 18 -1.75193 4.12663
## age3 male some_hs 20 19.6 6.12 18.0 17 45 3.54747 11.77598
## age4 female some_hs 25 19.3 4.62 18.0 17 37 2.85534 7.26619
## age5 male high_school 23 25.3 8.68 22.0 18 55 1.94253 3.63012
## age6 female high_school 21 30.1 12.22 26.0 18 57 1.15981 0.00534
## age7 male some_college 80 20.8 3.06 20.0 17 34 1.99765 4.54881
## age8 female some_college 195 21.1 4.75 20.0 17 46 3.41196 12.83174
## age9 male college 51 32.2 9.03 29.0 23 57 1.20022 0.63050
## age10 female college 87 29.1 7.76 26.0 21 52 1.26002 0.69961
## age11 male graduate 46 35.8 10.00 35.5 22 58 0.47241 -0.67127
## age12 female graduate 95 34.3 10.67 30.0 22 65 1.17791 0.61024
## ACT1 male none 27 29.0 5.00 29.0 20 36 -0.30135 -1.12511
## ACT2 female none 30 26.1 5.06 26.0 15 36 0.07822 -0.55970
## ACT3 male some_hs 20 26.7 7.11 28.0 15 35 -0.29718 -1.51093
## ACT4 female some_hs 25 28.1 5.13 27.0 18 36 -0.21443 -0.78102
## ACT5 male high_school 23 26.7 6.39 28.0 3 32 -2.13714 5.38880
## ACT6 female high_school 21 27.3 5.23 28.0 15 36 -0.31605 -0.33928
## ACT7 male some_college 80 28.6 5.03 30.0 17 36 -0.45493 -0.91876
## ACT8 female some_college 195 28.2 4.78 29.0 16 36 -0.46005 -0.47245
## ACT9 male college 51 28.9 4.42 29.0 16 36 -0.73812 0.11815
## ACT10 female college 87 29.4 4.32 30.0 19 36 -0.27442 -0.66751
## ACT11 male graduate 46 30.8 3.11 32.0 25 36 -0.38460 -0.81302
## ACT12 female graduate 95 29.0 4.19 29.0 18 36 -0.30707 -0.72671
## SATV1 male none 27 640.1 132.24 670.0 400 800 -0.28857 -1.39638
## SATV2 female none 30 595.3 123.46 595.0 350 800 -0.08653 -0.81239
## SATV3 male some_hs 20 603.0 141.24 600.0 300 780 -0.38530 -1.12127
## SATV4 female some_hs 25 597.0 119.38 610.0 350 799 -0.31101 -0.95287
## SATV5 male high_school 23 560.0 152.29 600.0 200 800 -0.53228 -0.59245
## SATV6 female high_school 21 593.6 115.34 600.0 375 770 -0.44018 -0.91187
## SATV7 male some_college 80 617.4 111.79 630.0 300 800 -0.61837 -0.06488
## SATV8 female some_college 195 610.0 119.78 620.0 200 800 -0.80541 0.65541
## SATV9 male college 51 620.3 81.72 620.0 430 800 -0.25952 -0.29384
## SATV10 female college 87 615.0 106.62 620.0 300 800 -0.58308 0.28315
## SATV11 male graduate 46 623.5 99.58 645.0 390 770 -0.61371 -0.43122
## SATV12 female graduate 95 620.4 95.72 620.0 300 800 -0.45595 0.42801
## SATQ1 male none 27 642.7 127.90 660.0 400 800 -0.23835 -1.36130
## SATQ2 female none 29 599.7 123.20 600.0 333 800 -0.09034 -0.99079
## SATQ3 male some_hs 19 625.8 95.87 650.0 400 765 -0.65838 -0.46543
## SATQ4 female some_hs 24 592.5 140.83 625.0 230 799 -0.92682 0.20055
## SATQ5 male high_school 23 569.1 160.65 600.0 300 800 -0.35873 -1.43839
## SATQ6 female high_school 20 586.5 120.96 585.0 375 800 0.00849 -1.11439
## SATQ7 male some_college 79 642.6 118.28 680.0 300 800 -0.80809 -0.16871
## SATQ8 female some_college 190 590.9 114.46 600.0 200 800 -0.71909 0.38400
## SATQ9 male college 51 635.9 104.12 640.0 400 800 -0.45941 -0.45068
## SATQ10 female college 86 597.6 106.24 600.0 300 800 -0.71438 0.19853
## SATQ11 male graduate 46 657.8 89.61 680.0 475 800 -0.45170 -0.77286
## SATQ12 female graduate 93 606.7 105.55 600.0 350 800 -0.14002 -0.94239
Optional While describe
is useful for quick glimpses at the data, it does not always play nicely with the tidyverse
. If you wanted to stick with a more “pure” tidyverse
approach to descriptive statistics, you can use skimr
.
sat.act %>%
skimr::skim(.)
## Skim summary statistics
## n obs: 700
## n variables: 6
##
## ── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts ordered
## education 0 700 700 6 som: 275, gra: 141, col: 138, non: 57 FALSE
## sex 0 700 700 2 fem: 453, mal: 247, NA: 0 FALSE
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ACT 0 700 700 28.55 4.82 3 25 29 32 36 ▁▁▁▁▃▅▇▇
## age 0 700 700 25.59 9.5 13 19 22 29 65 ▇▇▂▂▁▁▁▁
## SATQ 13 687 700 610.22 115.64 200 530 620 700 800 ▁▁▁▅▃▇▇▅
## SATV 0 700 700 612.23 112.9 200 550 620 700 800 ▁▁▁▃▃▇▅▃
While skim()
doesn’t provide as much descriptive detail as describe
, it does provide the basics, and a useful visual representation of the data.
Similarly, for obtaining descriptives by grouping variables you can utilize the group_by()
function.
sat.act %>%
group_by(., sex) %>%
skimr::skim(.)
## Skim summary statistics
## n obs: 700
## n variables: 6
## group variables: sex
##
## ── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────
## sex variable missing complete n n_unique top_counts ordered
## male education 0 247 247 6 som: 80, col: 51, gra: 46, non: 27 FALSE
## female education 0 453 453 6 som: 195, gra: 95, col: 87, non: 30 FALSE
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
## sex variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## male ACT 0 247 247 28.79 5.06 3 25 30 32.5 36 ▁▁▁▁▂▃▆▇
## male age 0 247 247 25.86 9.74 14 19 22 30 58 ▇▇▃▂▁▁▁▁
## male SATQ 2 245 247 635.87 116.02 300 570 660 720 800 ▁▂▂▃▅▅▇▆
## male SATV 0 247 247 615.11 114.16 200 540 630 700 800 ▁▁▂▅▃▇▇▅
## female ACT 0 453 453 28.42 4.69 15 25 29 32 36 ▁▁▂▅▇▅▇▆
## female age 0 453 453 25.45 9.37 13 19 22 28 65 ▆▇▃▂▁▁▁▁
## female SATQ 11 442 453 596 113.07 200 500 600 683 800 ▁▁▁▅▃▇▆▃
## female SATV 0 453 453 610.66 112.31 200 550 620 700 800 ▁▁▁▃▃▇▅▃
sat.act %>%
group_by(., education) %>%
skimr::skim(.)
## Skim summary statistics
## n obs: 700
## n variables: 6
## group variables: education
##
## ── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────
## education variable missing complete n n_unique top_counts ordered
## none sex 0 57 57 2 fem: 30, mal: 27, NA: 0 FALSE
## some_hs sex 0 45 45 2 fem: 25, mal: 20, NA: 0 FALSE
## high_school sex 0 44 44 2 mal: 23, fem: 21, NA: 0 FALSE
## some_college sex 0 275 275 2 fem: 195, mal: 80, NA: 0 FALSE
## college sex 0 138 138 2 fem: 87, mal: 51, NA: 0 FALSE
## graduate sex 0 141 141 2 fem: 95, mal: 46, NA: 0 FALSE
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
## education variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## none ACT 0 57 57 27.47 5.21 15 24 28 31 36 ▁▂▅▃▇▃▃▅
## none age 0 57 57 16.95 1.04 13 17 17 18 18 ▁▁▁▁▃▁▇▆
## none SATQ 1 56 57 620.43 126.21 333 515 620 712.5 800 ▁▁▇▅▅▃▅▇
## none SATV 0 57 57 616.51 128.53 350 500 600 710 800 ▂▂▆▃▅▃▆▇
## some_hs ACT 0 45 45 27.49 6.06 15 24 27 33 36 ▂▅▁▅▇▂▅▇
## some_hs age 0 45 45 19.47 5.27 17 18 18 18 45 ▇▁▁▁▁▁▁▁
## some_hs SATQ 2 43 45 607.26 122.8 230 545 650 700 799 ▁▁▂▂▂▇▅▂
## some_hs SATV 0 45 45 599.67 128.05 300 500 600 700 799 ▂▂▃▅▇▃▇▇
## high_school ACT 0 44 44 26.98 5.81 3 25 28 31 36 ▁▁▁▁▃▇▇▅
## high_school age 0 44 44 27.57 10.68 18 20 25 28.75 57 ▇▆▁▂▁▁▁▂
## high_school SATQ 1 43 44 577.21 142.18 300 450 600 700 800 ▂▃▃▃▆▂▇▃
## high_school SATV 0 44 44 576.02 135.43 200 487.5 600 682.5 800 ▁▁▃▃▂▇▇▂
## some_college ACT 0 275 275 28.29 4.85 16 25 29 32 36 ▁▃▃▅▅▇▆▆
## some_college age 0 275 275 21.01 4.32 17 19 20 21 46 ▇▃▁▁▁▁▁▁
## some_college SATQ 6 269 275 606.07 117.76 200 525 620 700 800 ▁▁▁▅▃▇▇▅
## some_college SATV 0 275 275 612.13 117.36 200 540 625 700 800 ▁▁▁▃▃▇▆▅
## college ACT 0 138 138 29.26 4.35 16 27 30 32 36 ▁▂▂▃▆▇▆▆
## college age 0 138 138 30.24 8.36 21 24 28 34.75 57 ▇▅▂▂▁▁▁▁
## college SATQ 1 137 138 611.85 106.71 300 565 610 690 800 ▁▂▁▃▇▅▇▃
## college SATV 0 138 138 616.95 97.88 300 572.5 620 680 800 ▁▁▁▃▇▇▅▃
## graduate ACT 0 141 141 29.6 3.95 18 27 30 32 36 ▁▁▃▆▆▅▇▅
## graduate age 0 141 141 34.83 10.44 22 27 33 40 65 ▇▅▅▃▁▁▂▁
## graduate SATQ 2 139 141 623.63 103.09 350 535 640 700 800 ▁▂▆▅▆▆▇▆
## graduate SATV 0 141 141 621.4 96.65 300 570 630 700 800 ▁▁▂▅▇▇▆▃
Suppose we wanted to see if number of males or females differed by education level. One way to accomplish this is to perform a chi-squared test of independence. In R, the chisq.test()
function expects a contingency table in order to produce the appropriate statistic. A contingency table provides count data by groups.
edu_sex_table <- table(sat.act$education, sat.act$sex)
print(edu_sex_table)
##
## male female
## none 27 30
## some_hs 20 25
## high_school 23 21
## some_college 80 195
## college 51 87
## graduate 46 95
Next, we simply pass the contingency table to the chisq.test()
function.
chi_test <- chisq.test(edu_sex_table)
chi_test
##
## Pearson's Chi-squared test
##
## data: edu_sex_table
## X-squared = 16, df = 5, p-value = 0.007
It appears as though the two variables are not independent. We will examine the pairwise comparisons in a moment to get a better idea of which variables are driving these results.
We can get the observed, expected, and residuals from the saved object.
chi_test$observed
##
## male female
## none 27 30
## some_hs 20 25
## high_school 23 21
## some_college 80 195
## college 51 87
## graduate 46 95
chi_test$expected
##
## male female
## none 20.1 36.9
## some_hs 15.9 29.1
## high_school 15.5 28.5
## some_college 97.0 178.0
## college 48.7 89.3
## graduate 49.8 91.2
We could also get the chi-squared statistic from the table using summary
summary(edu_sex_table)
## Number of cases in table: 700
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 16, df = 5, p-value = 0.007
Aside from simply printing a table, we can visualize frequencies with a mosaic plot. See the supplemental materials for an example.
If we want to compare each level with the other in our contingency table we can computed those with pairwise comparisons using rcompanion
.
rcompanion::pairwiseNominalIndependence(edu_sex_table,
compare = "row", # or compare = "column"
fisher = FALSE,
gtest = FALSE,
method = "holm")
## Comparison p.Chisq p.adj.Chisq
## 1 none : some_hs 0.92500 1.0000
## 2 none : high_school 0.77300 1.0000
## 3 none : some_college 0.01140 0.1600
## 4 none : college 0.23400 1.0000
## 5 none : graduate 0.07440 0.8180
## 6 some_hs : high_school 0.59800 1.0000
## 7 some_hs : some_college 0.05920 0.7100
## 8 some_hs : college 0.47200 1.0000
## 9 some_hs : graduate 0.20600 1.0000
## 10 high_school : some_college 0.00398 0.0597
## 11 high_school : college 0.10400 1.0000
## 12 high_school : graduate 0.02970 0.3860
## 13 some_college : college 0.13200 1.0000
## 14 some_college : graduate 0.52900 1.0000
## 15 college : graduate 0.52600 1.0000
Moving beyond categorical variables, we next test the relationship between numeric values using simple correlation. See the supplemental materials for more in-depth explnations.
Correlation is done using the cor()
function. Suppose we want to see if the ACT scores increase with age.
# Covariance
# cov(sat.act$age, sat.act$ACT)
# Correlation
cor(sat.act$age, sat.act$ACT)
## [1] 0.111
A small correlation, but no test of significance. To obtain significance values, you must pass your data to cor.test()
. Note that this can be done by passing x
and y
or using the formula method (which will be used for linear models).
# Default method
# cor.test(sat.act$age, sat.act$ACT)
# Formula method
cor.test(~ sat.act$age + sat.act$ACT, data = sat.act)
##
## Pearson's product-moment correlation
##
## data: sat.act$age and sat.act$ACT
## t = 3, df = 698, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0367 0.1831
## sample estimates:
## cor
## 0.111
To visualize this relationship, we can pass the raw data to ggplot
and get a simple regression line using stat_smooth()
with the lm
method. See supplemental materials for an example.
You can also pass a dataframe of values to the cor
function to get a simple correlation matrix (a la SPSS).
cor(sat.act[c("age","ACT","SATV","SATQ")], use = "pairwise.complete.obs")
## age ACT SATV SATQ
## age 1.0000 0.111 -0.0424 -0.0339
## ACT 0.1105 1.000 0.5611 0.5871
## SATV -0.0424 0.561 1.0000 0.6443
## SATQ -0.0339 0.587 0.6443 1.0000
# Or, using tidyverse verbs...
# sat.act %>%
# select(., age, ACT, SATV, SATQ) %>%
# drop_na(., SATQ) %>%
# cor()
Or, optionally for easier-on-the-eyes output we can use a number of specialized functions.
Hmisc::rcorr(sat.act[c("age","ACT","SATV","SATQ")] %>% as.matrix(.))
## age ACT SATV SATQ
## age 1.00 0.11 -0.04 -0.03
## ACT 0.11 1.00 0.56 0.59
## SATV -0.04 0.56 1.00 0.64
## SATQ -0.03 0.59 0.64 1.00
##
## n
## age ACT SATV SATQ
## age 700 700 700 687
## ACT 700 700 700 687
## SATV 700 700 700 687
## SATQ 687 687 687 687
##
## P
## age ACT SATV SATQ
## age 0.0034 0.2631 0.3744
## ACT 0.0034 0.0000 0.0000
## SATV 0.2631 0.0000 0.0000
## SATQ 0.3744 0.0000 0.0000
Or directly to APA-acceptable tables. You can pass a filename
argument to apa.cor.table
to save it directly to file.
apaTables::apa.cor.table(sat.act[c("age","ACT","SATV","SATQ")])
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3
## 1. age 25.59 9.50
##
## 2. ACT 28.55 4.82 .11**
## [.04, .18]
##
## 3. SATV 612.23 112.90 -.04 .56**
## [-.12, .03] [.51, .61]
##
## 4. SATQ 610.22 115.64 -.03 .59** .64**
## [-.11, .04] [.54, .63] [.60, .69]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
The overall goal is to give you an introduction to conducting regression analyses or linear modeling in R.
Let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm
function in R. The lm
function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.
Let’s take a look at the simple case of the association between SAT math and verbal scores.
ggplot(sat.act, aes(x=SATV, y=SATQ)) +
geom_point(color='darkblue', size = 3) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
labs(x="Math Score", y="Verbal Score")
This relationship looks quite linear. The higher the math score the higher the verbal score.
In R regression models, we use the ~
operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think verbal is predictive of math, we’d say ‘math regressed on verbal’ or ‘verbal predicts math.’
In formula terms, this is SATV ~ SATQ
, which we pass as the first argument to lm()
.
lm_SAT <- lm(SATQ ~ SATV, data=sat.act)
summary(lm_SAT)
##
## Call:
## lm(formula = SATQ ~ SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302.1 -46.5 2.4 51.3 282.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 207.5253 18.5725 11.2 <2e-16 ***
## SATV 0.6576 0.0298 22.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.5 on 685 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.415, Adjusted R-squared: 0.414
## F-statistic: 486 on 1 and 685 DF, p-value: <2e-16
The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).
The ggfortify
package also provides an autoplot
function that gives similar diagnostics within a handy ggplot-based graph.
autoplot(lm_SAT)
Using functionality from the car
and boot
packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.
sat.act.na <- na.omit(sat.act)
lm_SAT.na <- lm(SATQ ~ SATV, data=sat.act.na)
system.time(lm_SAT.boot <- Boot(lm_SAT.na, R=2000))
## user system elapsed
## 1.744 0.085 1.828
summary(lm_SAT.boot, high.moments=TRUE)
##
## Number of bootstrap replications R = 2000
## original bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept) 207.525 -0.280479 20.1033 206.684 0.0365 0.150
## SATV 0.658 0.000648 0.0316 0.659 -0.0399 0.148
We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).
confint(lm_SAT.boot, level=.95, type="bca")
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 169.714 248.015
## SATV 0.594 0.717
And we can easily compare the bootstrapped and standard OLS models:
hist(lm_SAT.boot, legend="separate")
We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the sat.act
dataset, we could examine the extent to which the math scores (SATQ
) is a function of both verbal scores (SATV
), age (age
), and sex (gender
, 1 = male, 2 = female).
asv_model <- lm(SATQ ~ age + sex + SATV, sat.act)
summary(asv_model)
##
## Call:
## lm(formula = SATQ ~ age + sex + SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.34 -50.32 5.81 51.51 295.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 236.6596 21.2572 11.13 < 2e-16 ***
## age -0.1267 0.3498 -0.36 0.72
## sexfemale -36.8580 6.9202 -5.33 1.4e-07 ***
## SATV 0.6541 0.0293 22.33 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.8 on 683 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.438, Adjusted R-squared: 0.436
## F-statistic: 178 on 3 and 683 DF, p-value: <2e-16
Note that the broom
package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?
We can use the glance
function to get the global model statistics.
broom::glance(asv_model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.438 0.436 86.8 178. 3.52e-85 4 -4040. 8089. 8112. 5150990. 683
And the tidy
function yields the parameter table
broom::tidy(asv_model)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 237. 21.3 11.1 1.43e-26
## 2 age -0.127 0.350 -0.362 7.17e- 1
## 3 sexfemale -36.9 6.92 -5.33 1.36e- 7
## 4 SATV 0.654 0.0293 22.3 2.51e-83
We can use the *
operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:
int_model <- lm(SATQ ~ sex*age*SATV, sat.act)
summary(int_model)
##
## Call:
## lm(formula = SATQ ~ sex * age * SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.41 -50.48 4.05 52.30 291.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.01000 86.97955 -0.26 0.791
## sexfemale 216.92703 108.13088 2.01 0.045 *
## age 9.54204 3.39075 2.81 0.005 **
## SATV 1.02899 0.13875 7.42 3.6e-13 ***
## sexfemale:age -8.86490 4.15128 -2.14 0.033 *
## sexfemale:SATV -0.33756 0.17252 -1.96 0.051 .
## age:SATV -0.01395 0.00546 -2.55 0.011 *
## sexfemale:age:SATV 0.01155 0.00668 1.73 0.084 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.2 on 679 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.45, Adjusted R-squared: 0.444
## F-statistic: 79.4 on 7 and 679 DF, p-value: <2e-16
This model includes individual effects of sex (gender
) and age (age
), as well as their interation (gender:age
). This highlights that the asterisk operator *
will compute all possible interations among the specified predictors. For example, a*b*c*d
will generate all effets up through and including the a x b x c x d
interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b
). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.
What do we see? Males tend to have higher math scores and maintain these scores with age. Women have lower maths scores and show a decreasing trend as they get older.
(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans
documentation: https://cran.r-project.org/web/packages/emmeans/)
One of the handiest packages in the R regression universe is emmeans
, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp
to handle correction for multiple comparisons. See the longer documentation here.
Let’s look how age and verbal scores may interact to predict math scores.
# creating age groups for pairwise comparisons
sat.act$agef[sat.act$age < 25] <- 1
sat.act$agef[sat.act$age >= 25 & sat.act$age <= 50] <- 2
sat.act$agef[sat.act$age > 50] <- 3
# setting as factors
sat.act$agef <- as.factor(sat.act$agef)
sat.act$gender <- as.factor(sat.act$sex)
# running the model
sat.lm <- lm(SATQ ~ agef + SATV, data = sat.act)
summary(sat.lm)
##
## Call:
## lm(formula = SATQ ~ agef + SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -301.62 -45.92 2.07 51.81 283.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 206.9031 18.8894 10.95 <2e-16 ***
## agef2 -0.0946 7.1350 -0.01 0.99
## agef3 15.5437 18.9726 0.82 0.41
## SATV 0.6579 0.0299 22.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.6 on 683 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.416, Adjusted R-squared: 0.413
## F-statistic: 162 on 3 and 683 DF, p-value: <2e-16
This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.
We can obtain the expected means for each condition.
sat.emm.s <- emmeans(sat.lm, "agef")
print(sat.emm.s)
## agef emmean SE df lower.CL upper.CL
## 1 610 4.32 683 601 618
## 2 610 5.67 683 598 621
## 3 625 18.47 683 589 662
##
## Confidence level used: 0.95
sat.emm.p <- emmeans(sat.lm, "SATV")
print(sat.emm.p)
## SATV emmean SE df lower.CL upper.CL
## 612 615 4.32 683 606 623
##
## Results are averaged over the levels of: agef
## Confidence level used: 0.95
print(emmeans(sat.lm, ~agef*SATV))
## agef SATV emmean SE df lower.CL upper.CL
## 1 612 610 4.32 683 601 618
## 2 612 610 5.67 683 598 621
## 3 612 625 18.47 683 589 662
##
## Confidence level used: 0.95
If we wanted to compare the pairwise differences in the effect of age group on math scores while controlling for verbal scores (and potentially other variables we add to the model), we could use the pairs
function:
sat_pairs <- pairs(sat.emm.s)
print(sat_pairs)
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.09 7.13 683 0.013 1.0000
## 1 - 3 -15.54 18.97 683 -0.819 0.6910
## 2 - 3 -15.64 19.32 683 -0.809 0.6970
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Note that you can get a sense of the contrasts being tested by emmeans
by examining the @linfct
slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).
Contrasts for the predicted mean level of math scores contrast for each age group, controlling for verbal score.
sat.emm.s@linfct
## (Intercept) agef2 agef3 SATV
## [1,] 1 0 0 612
## [2,] 1 1 0 612
## [3,] 1 0 1 612
What are the pairwise contrasts for verbal scores?
sat_pairs@linfct
## (Intercept) agef2 agef3 SATV
## [1,] 0 -1 0 0
## [2,] 0 0 -1 0
## [3,] 0 1 -1 0
Consider a model in which we examine the association between SATQ and SATV across age. Here, we regress SATV on SATQ, age (three levels), and their interaction.
fit_sat <- lm(SATQ ~ SATV*agef, data = sat.act)
summary(fit_sat)
##
## Call:
## lm(formula = SATQ ~ SATV * agef, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302.51 -50.21 2.09 54.61 256.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.8432 22.1853 8.06 3.4e-15 ***
## SATV 0.7034 0.0354 19.90 < 2e-16 ***
## agef2 109.8710 42.6641 2.58 0.0102 *
## agef3 17.2861 95.7898 0.18 0.8568
## SATV:agef2 -0.1804 0.0690 -2.61 0.0091 **
## SATV:agef3 -0.0022 0.1547 -0.01 0.9887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.3 on 681 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.422, Adjusted R-squared: 0.417
## F-statistic: 99.3 on 5 and 681 DF, p-value: <2e-16
car::Anova(fit_sat, type="III") #overall effects of predictors in the model
## Anova Table (Type III tests)
##
## Response: SATQ
## Sum Sq Df F value Pr(>F)
## (Intercept) 506336 1 64.99 3.4e-15 ***
## SATV 3084321 1 395.85 < 2e-16 ***
## agef 51806 2 3.32 0.037 *
## SATV:agef 53928 2 3.46 0.032 *
## Residuals 5306050 681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this yields a categorical (age group) x continuous (verbal scores) interaction. The output from car::Anova
indicates that the interaction is significant, but we need more detailed guidance on how the slope for verbal scores is moderated by age group.
In a simple slopes test, we might wish to know whether the slope for SATV
is non-zero in each age group individually. Let’s start by getting the estimated marginal means for each age group.
emmeans(fit_sat, ~agef)
## NOTE: Results may be misleading due to involvement in interactions
## agef emmean SE df lower.CL upper.CL
## 1 610 4.31 681 601 618
## 2 609 5.66 681 598 620
## 3 626 18.43 681 589 662
##
## Confidence level used: 0.95
And pairwise differences between age groups:
pairs(emmeans(fit_sat, ~agef))
## NOTE: Results may be misleading due to involvement in interactions
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.62 7.11 681 0.087 0.9960
## 1 - 3 -15.94 18.92 681 -0.842 0.6770
## 2 - 3 -16.56 19.28 681 -0.859 0.6660
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Transitioning to SATV, because we are interested its linear effect (slope), we use the emtrends
function to estimate the slope in each species individually. In terms of simple slopes, we test whether the SATV slope is non-zero in each age group. The infer
argument in the summary of emtrends
requests t-tests and p-values for the slopes.
summary(emtrends(model = fit_sat, ~agef, var="SATV"), infer=TRUE)
## agef SATV.trend SE df lower.CL upper.CL t.ratio p.value
## 1 0.703 0.0354 681 0.634 0.773 19.900 <.0001
## 2 0.523 0.0593 681 0.407 0.639 8.820 <.0001
## 3 0.701 0.1506 681 0.406 0.997 4.660 <.0001
##
## Confidence level used: 0.95
Finally, we could examine pairwise differences between slopes among age groups.
pairs(emtrends(model = fit_sat, ~agef, var="SATV"))
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.1804 0.069 681 2.614 0.0250
## 1 - 3 0.0022 0.155 681 0.014 1.0000
## 2 - 3 -0.1782 0.162 681 -1.101 0.5140
##
## P value adjustment: tukey method for comparing a family of 3 estimates
There is a lot more on probing interations here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.
Further, sometimes we want to test the relationship between categorical variables and numeric variables. One way to do this is ANOVA. Analysis of variance (ANOVA) provides a statistical test of whether two or more population means are equal. ANOVA is done using the aov_ez
function.
Suppose we want to see if the ACT scores differ with education levels and gender.
A two-way between-subjects ANOVA is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. First, let’s see the summary of each condition.
str(sat.act)
## 'data.frame': 700 obs. of 8 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
## $ education: Factor w/ 6 levels "none","some_hs",..: 4 4 4 5 3 6 6 4 5 6 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
## $ agef : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
describeBy(sat.act$ACT,list(sat.act$education,sat.act$gender), mat=TRUE,digits=2) %>%
dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)
## group1 group2 n mean sd median min max skew kurtosis
## X11 none male 27 29.0 5.00 29 20 36 -0.30 -1.13
## X12 some_hs male 20 26.7 7.11 28 15 35 -0.30 -1.51
## X13 high_school male 23 26.6 6.39 28 3 32 -2.14 5.39
## X14 some_college male 80 28.6 5.03 30 17 36 -0.45 -0.92
## X15 college male 51 28.9 4.42 29 16 36 -0.74 0.12
## X16 graduate male 46 30.8 3.11 32 25 36 -0.38 -0.81
## X17 none female 30 26.1 5.06 26 15 36 0.08 -0.56
## X18 some_hs female 25 28.1 5.13 27 18 36 -0.21 -0.78
## X19 high_school female 21 27.3 5.23 28 15 36 -0.32 -0.34
## X110 some_college female 195 28.2 4.78 29 16 36 -0.46 -0.47
## X111 college female 87 29.4 4.32 30 19 36 -0.27 -0.67
## X112 graduate female 95 29.0 4.19 29 18 36 -0.31 -0.73
sat.act$id<-1:nrow(sat.act)
sat.act$education<-factor(sat.act$education)
# Compute the analysis of variance
res.aov <- aov_ez(id="id", dv="ACT", data = sat.act, between=c("education","gender"))
## Contrasts set to contr.sum for the following variables: education, gender
# Summary of the analysis
print(res.aov)
## Anova Table (Type 3 tests)
##
## Response: ACT
## Effect df MSE F ges p.value
## 1 education 5, 688 22.56 4.53 *** .03 .0004
## 2 gender 1, 688 22.56 0.87 .001 .35
## 3 education:gender 5, 688 22.56 2.04 + .01 .07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
We can conclude that education is statistically significant. ACT score is positively related with education. Also, there is no significant interaction effect between education and gender.
ACT scores generally increase with education. If we want to compare each level in education factor with the other, we can compute those with pairwise comparisons using emmeans
.
sat.emm.aov <- emmeans(res.aov, "education")
## NOTE: Results may be misleading due to involvement in interactions
sat_pairs_aov <- pairs(sat.emm.aov)
print(sat_pairs_aov)
## contrast estimate SE df t.ratio p.value
## none - some_hs 0.142 0.951 688 0.150 1.0000
## none - high_school 0.559 0.954 688 0.590 0.9920
## none - some_college -0.822 0.705 688 -1.170 0.8530
## none - college -1.643 0.757 688 -2.170 0.2520
## none - graduate -2.366 0.761 688 -3.110 0.0240
## some_hs - high_school 0.417 1.011 688 0.410 0.9980
## some_hs - some_college -0.964 0.779 688 -1.240 0.8190
## some_hs - college -1.785 0.826 688 -2.160 0.2580
## some_hs - graduate -2.508 0.830 688 -3.020 0.0310
## high_school - some_college -1.381 0.783 688 -1.760 0.4900
## high_school - college -2.202 0.830 688 -2.650 0.0860
## high_school - graduate -2.926 0.834 688 -3.510 0.0060
## some_college - college -0.821 0.524 688 -1.570 0.6210
## some_college - graduate -1.545 0.530 688 -2.910 0.0430
## college - graduate -0.724 0.598 688 -1.210 0.8320
##
## Results are averaged over the levels of: gender
## P value adjustment: tukey method for comparing a family of 6 estimates
Repeated measures ANOVA is commonly used in data analysis. For example, participants take more than one tasks or conditions. In longitudinal study, data is collected from each participant more than once. Assume there are different time points in this dataset. After introduce the time point variable into the data, let’s check whether ACT scores vary by time point and education.
set.seed(1999) # Needed for getting same results
sat.act <- sat.act %>%
mutate(education = factor(rep(sample(1:5, 140, replace = TRUE), each=5)),
id = rep(1:140, each = 5),
time = rep(1:5, 140)
)
res.aov2 <- aov_ez(
id = "id",
dv = "ACT",
data = sat.act,
between = c("education"),
within = c("time")
)
## Contrasts set to contr.sum for the following variables: education
summary(res.aov2)
##
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
##
## Sum Sq num Df Error SS den Df F value Pr(>F)
## (Intercept) 531504 1 3619 135 19829.43 <2e-16 ***
## education 154 4 3619 135 1.44 0.225
## time 194 4 11905 540 2.20 0.067 .
## education:time 387 16 11905 540 1.10 0.355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Mauchly Tests for Sphericity
##
## Test statistic p-value
## time 0.874 0.0362
## education:time 0.874 0.0362
##
##
## Greenhouse-Geisser and Huynh-Feldt Corrections
## for Departure from Sphericity
##
## GG eps Pr(>F[GG])
## time 0.936 0.072 .
## education:time 0.936 0.356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## HF eps Pr(>F[HF])
## time 0.966 0.0698
## education:time 0.966 0.3556
Aside from simply printing a table, we can visualize the mean and sd of each conditions with a line plot. See the supplemental materials for an example.
t-test: For a simpler comparison of means between two groups in a sample (t.test
function) Linear regression: comparing one or more factors in a sample (lm, mlm,or lmer function)
Download the practice RMarkdown file and open it. We will be using the spi
data set from the psych
package for the practice questions.