Load and preprocess data

First load the required packages for this walk through.

pkg_list <- c("tidyverse", "psych", "rcompanion", "knitr", "car", "afex", "ez",
              "ggfortify", "Hmisc", "emmeans", "jtools", "apaTables", "dplyr")
# purrr::walk(pkg_list, require, quietly = TRUE, character.only = TRUE)
pacman::p_load(pkg_list, character.only = TRUE)

Optional package

# devtools::install_github("ropensci/skimr")
library("skimr")

The data set being used for this walk through is automatically loaded when psych is loaded. You can examine what each of the columns represent by looking at the data set’s help page with ?sat.act.

Next, just to get a snapshot of the data, we can use head and str. Notice that sex and education are integer columns despite being categorical variables.

data("sat.act", package="psych")
sat.act <- sat.act %>% dplyr::rename(sex=gender) %>%
mutate(sex = factor(sex, levels = c(1,2), labels = c("male", "female")))

# Alternatively...
# source("R/load_sat_act.R")
head(sat.act)

##          sex education age ACT SATV SATQ
## 29442 female         3  19  24  500  500
## 29457 female         3  23  35  600  500
## 29498 female         3  20  21  480  470
## 29503   male         4  27  26  550  520
## 29504   male         2  33  31  600  550
## 29518   male         5  26  28  640  640

str(sat.act)

## 'data.frame':    700 obs. of  6 variables:
##  $ sex      : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
##  $ education: int  3 3 3 4 2 5 5 3 4 5 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...

Preprocessing integers into factors

Since some of the analyses we will be running require categorical variables, we first need to preprocess some of the integer columns into categories/factors.

labels <- c("none", "some_hs", "high_school",
            "some_college", "college", "graduate")

sat.act <- sat.act %>%
  mutate(.,
         education = factor(education, levels = (0:5), labels = labels)
         # education = case_when(education == 0 ~ "none",
         #                       education == 1 ~ "some_hs",
         #                       education == 2 ~ "high_school",
         #                       education == 3 ~ "some_college",
         #                       education == 4 ~ "college",
         #                       education == 5 ~ "graduate")
  )

Describe data

Before analysis, we can obtain descriptive statistics quickly by utilizing the describe() function in the psych package.

describeBy(sat.act) %>%
  dplyr::select(-vars, -trimmed, -mad, -range, -se)

##              n   mean     sd median min max  skew kurtosis
## sex*       700   1.65   0.48      2   1   2 -0.61    -1.62
## education* 700   4.16   1.43      4   1   6 -0.68    -0.07
## age        700  25.59   9.50     22  13  65  1.64     2.42
## ACT        700  28.55   4.82     29   3  36 -0.66     0.53
## SATV       700 612.23 112.90    620 200 800 -0.64     0.33
## SATQ       687 610.22 115.64    620 200 800 -0.59    -0.02

The describe() function also allows us to get descriptive statistics by a grouping variable using the partner function describeBy(). If we wanted to get descriptive statistics by sex we simply pass that column to an argument.

describeBy(sat.act, group = c("sex"), mat = TRUE) %>%
  dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)

##             group1   n   mean     sd median min max   skew kurtosis
## sex*1         male 247   1.00   0.00      1   1   1    NaN      NaN
## sex*2       female 453   2.00   0.00      2   2   2    NaN      NaN
## education*1   male 247   4.00   1.54      4   1   6 -0.542   -0.598
## education*2 female 453   4.26   1.35      4   1   6 -0.737    0.272
## age1          male 247  25.86   9.74     22  14  58  1.428    1.435
## age2        female 453  25.45   9.37     22  13  65  1.766    3.029
## ACT1          male 247  28.79   5.06     30   3  36 -1.057    1.888
## ACT2        female 453  28.42   4.69     29  15  36 -0.392   -0.416
## SATV1         male 247 615.11 114.16    630 200 800 -0.630    0.130
## SATV2       female 453 610.66 112.31    620 200 800 -0.651    0.423
## SATQ1         male 245 635.87 116.02    660 300 800 -0.722   -0.122
## SATQ2       female 442 596.00 113.07    600 200 800 -0.577    0.129

We can add multiple grouping variables in the same manner.

describeBy(sat.act, group = c("sex", "education"), mat = TRUE) %>%
  dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)

##              group1       group2   n  mean     sd median min max     skew kurtosis
## sex*1          male         none  27   1.0   0.00    1.0   1   1      NaN      NaN
## sex*2        female         none  30   2.0   0.00    2.0   2   2      NaN      NaN
## sex*3          male      some_hs  20   1.0   0.00    1.0   1   1      NaN      NaN
## sex*4        female      some_hs  25   2.0   0.00    2.0   2   2      NaN      NaN
## sex*5          male  high_school  23   1.0   0.00    1.0   1   1      NaN      NaN
## sex*6        female  high_school  21   2.0   0.00    2.0   2   2      NaN      NaN
## sex*7          male some_college  80   1.0   0.00    1.0   1   1      NaN      NaN
## sex*8        female some_college 195   2.0   0.00    2.0   2   2      NaN      NaN
## sex*9          male      college  51   1.0   0.00    1.0   1   1      NaN      NaN
## sex*10       female      college  87   2.0   0.00    2.0   2   2      NaN      NaN
## sex*11         male     graduate  46   1.0   0.00    1.0   1   1      NaN      NaN
## sex*12       female     graduate  95   2.0   0.00    2.0   2   2      NaN      NaN
## education*1    male         none  27   1.0   0.00    1.0   1   1      NaN      NaN
## education*2  female         none  30   1.0   0.00    1.0   1   1      NaN      NaN
## education*3    male      some_hs  20   2.0   0.00    2.0   2   2      NaN      NaN
## education*4  female      some_hs  25   2.0   0.00    2.0   2   2      NaN      NaN
## education*5    male  high_school  23   3.0   0.00    3.0   3   3      NaN      NaN
## education*6  female  high_school  21   3.0   0.00    3.0   3   3      NaN      NaN
## education*7    male some_college  80   4.0   0.00    4.0   4   4      NaN      NaN
## education*8  female some_college 195   4.0   0.00    4.0   4   4      NaN      NaN
## education*9    male      college  51   5.0   0.00    5.0   5   5      NaN      NaN
## education*10 female      college  87   5.0   0.00    5.0   5   5      NaN      NaN
## education*11   male     graduate  46   6.0   0.00    6.0   6   6      NaN      NaN
## education*12 female     graduate  95   6.0   0.00    6.0   6   6      NaN      NaN
## age1           male         none  27  16.9   1.04   17.0  14  18 -0.86185  0.33852
## age2         female         none  30  17.0   1.07   17.0  13  18 -1.75193  4.12663
## age3           male      some_hs  20  19.6   6.12   18.0  17  45  3.54747 11.77598
## age4         female      some_hs  25  19.3   4.62   18.0  17  37  2.85534  7.26619
## age5           male  high_school  23  25.3   8.68   22.0  18  55  1.94253  3.63012
## age6         female  high_school  21  30.1  12.22   26.0  18  57  1.15981  0.00534
## age7           male some_college  80  20.8   3.06   20.0  17  34  1.99765  4.54881
## age8         female some_college 195  21.1   4.75   20.0  17  46  3.41196 12.83174
## age9           male      college  51  32.2   9.03   29.0  23  57  1.20022  0.63050
## age10        female      college  87  29.1   7.76   26.0  21  52  1.26002  0.69961
## age11          male     graduate  46  35.8  10.00   35.5  22  58  0.47241 -0.67127
## age12        female     graduate  95  34.3  10.67   30.0  22  65  1.17791  0.61024
## ACT1           male         none  27  29.0   5.00   29.0  20  36 -0.30135 -1.12511
## ACT2         female         none  30  26.1   5.06   26.0  15  36  0.07822 -0.55970
## ACT3           male      some_hs  20  26.7   7.11   28.0  15  35 -0.29718 -1.51093
## ACT4         female      some_hs  25  28.1   5.13   27.0  18  36 -0.21443 -0.78102
## ACT5           male  high_school  23  26.7   6.39   28.0   3  32 -2.13714  5.38880
## ACT6         female  high_school  21  27.3   5.23   28.0  15  36 -0.31605 -0.33928
## ACT7           male some_college  80  28.6   5.03   30.0  17  36 -0.45493 -0.91876
## ACT8         female some_college 195  28.2   4.78   29.0  16  36 -0.46005 -0.47245
## ACT9           male      college  51  28.9   4.42   29.0  16  36 -0.73812  0.11815
## ACT10        female      college  87  29.4   4.32   30.0  19  36 -0.27442 -0.66751
## ACT11          male     graduate  46  30.8   3.11   32.0  25  36 -0.38460 -0.81302
## ACT12        female     graduate  95  29.0   4.19   29.0  18  36 -0.30707 -0.72671
## SATV1          male         none  27 640.1 132.24  670.0 400 800 -0.28857 -1.39638
## SATV2        female         none  30 595.3 123.46  595.0 350 800 -0.08653 -0.81239
## SATV3          male      some_hs  20 603.0 141.24  600.0 300 780 -0.38530 -1.12127
## SATV4        female      some_hs  25 597.0 119.38  610.0 350 799 -0.31101 -0.95287
## SATV5          male  high_school  23 560.0 152.29  600.0 200 800 -0.53228 -0.59245
## SATV6        female  high_school  21 593.6 115.34  600.0 375 770 -0.44018 -0.91187
## SATV7          male some_college  80 617.4 111.79  630.0 300 800 -0.61837 -0.06488
## SATV8        female some_college 195 610.0 119.78  620.0 200 800 -0.80541  0.65541
## SATV9          male      college  51 620.3  81.72  620.0 430 800 -0.25952 -0.29384
## SATV10       female      college  87 615.0 106.62  620.0 300 800 -0.58308  0.28315
## SATV11         male     graduate  46 623.5  99.58  645.0 390 770 -0.61371 -0.43122
## SATV12       female     graduate  95 620.4  95.72  620.0 300 800 -0.45595  0.42801
## SATQ1          male         none  27 642.7 127.90  660.0 400 800 -0.23835 -1.36130
## SATQ2        female         none  29 599.7 123.20  600.0 333 800 -0.09034 -0.99079
## SATQ3          male      some_hs  19 625.8  95.87  650.0 400 765 -0.65838 -0.46543
## SATQ4        female      some_hs  24 592.5 140.83  625.0 230 799 -0.92682  0.20055
## SATQ5          male  high_school  23 569.1 160.65  600.0 300 800 -0.35873 -1.43839
## SATQ6        female  high_school  20 586.5 120.96  585.0 375 800  0.00849 -1.11439
## SATQ7          male some_college  79 642.6 118.28  680.0 300 800 -0.80809 -0.16871
## SATQ8        female some_college 190 590.9 114.46  600.0 200 800 -0.71909  0.38400
## SATQ9          male      college  51 635.9 104.12  640.0 400 800 -0.45941 -0.45068
## SATQ10       female      college  86 597.6 106.24  600.0 300 800 -0.71438  0.19853
## SATQ11         male     graduate  46 657.8  89.61  680.0 475 800 -0.45170 -0.77286
## SATQ12       female     graduate  93 606.7 105.55  600.0 350 800 -0.14002 -0.94239

Optional While describe is useful for quick glimpses at the data, it does not always play nicely with the tidyverse. If you wanted to stick with a more “pure” tidyverse approach to descriptive statistics, you can use skimr.

sat.act %>% 
  skimr::skim(.)

## Skim summary statistics
##  n obs: 700 
##  n variables: 6 
## 
## ── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────
##   variable missing complete   n n_unique                            top_counts ordered
##  education       0      700 700        6 som: 275, gra: 141, col: 138, non: 57   FALSE
##        sex       0      700 700        2             fem: 453, mal: 247, NA: 0   FALSE
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n   mean     sd  p0 p25 p50 p75 p100     hist
##       ACT       0      700 700  28.55   4.82   3  25  29  32   36 ▁▁▁▁▃▅▇▇
##       age       0      700 700  25.59   9.5   13  19  22  29   65 ▇▇▂▂▁▁▁▁
##      SATQ      13      687 700 610.22 115.64 200 530 620 700  800 ▁▁▁▅▃▇▇▅
##      SATV       0      700 700 612.23 112.9  200 550 620 700  800 ▁▁▁▃▃▇▅▃

While skim() doesn’t provide as much descriptive detail as describe, it does provide the basics, and a useful visual representation of the data.

Similarly, for obtaining descriptives by grouping variables you can utilize the group_by() function.

sat.act %>% 
  group_by(., sex) %>% 
  skimr::skim(.)

## Skim summary statistics
##  n obs: 700 
##  n variables: 6 
##  group variables: sex 
## 
## ── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────
##     sex  variable missing complete   n n_unique                          top_counts ordered
##    male education       0      247 247        6  som: 80, col: 51, gra: 46, non: 27   FALSE
##  female education       0      453 453        6 som: 195, gra: 95, col: 87, non: 30   FALSE
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
##     sex variable missing complete   n   mean     sd  p0 p25 p50   p75 p100     hist
##    male      ACT       0      247 247  28.79   5.06   3  25  30  32.5   36 ▁▁▁▁▂▃▆▇
##    male      age       0      247 247  25.86   9.74  14  19  22  30     58 ▇▇▃▂▁▁▁▁
##    male     SATQ       2      245 247 635.87 116.02 300 570 660 720    800 ▁▂▂▃▅▅▇▆
##    male     SATV       0      247 247 615.11 114.16 200 540 630 700    800 ▁▁▂▅▃▇▇▅
##  female      ACT       0      453 453  28.42   4.69  15  25  29  32     36 ▁▁▂▅▇▅▇▆
##  female      age       0      453 453  25.45   9.37  13  19  22  28     65 ▆▇▃▂▁▁▁▁
##  female     SATQ      11      442 453 596    113.07 200 500 600 683    800 ▁▁▁▅▃▇▆▃
##  female     SATV       0      453 453 610.66 112.31 200 550 620 700    800 ▁▁▁▃▃▇▅▃

sat.act %>% 
  group_by(., education) %>% 
  skimr::skim(.)

## Skim summary statistics
##  n obs: 700 
##  n variables: 6 
##  group variables: education 
## 
## ── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────
##     education variable missing complete   n n_unique               top_counts ordered
##          none      sex       0       57  57        2  fem: 30, mal: 27, NA: 0   FALSE
##       some_hs      sex       0       45  45        2  fem: 25, mal: 20, NA: 0   FALSE
##   high_school      sex       0       44  44        2  mal: 23, fem: 21, NA: 0   FALSE
##  some_college      sex       0      275 275        2 fem: 195, mal: 80, NA: 0   FALSE
##       college      sex       0      138 138        2  fem: 87, mal: 51, NA: 0   FALSE
##      graduate      sex       0      141 141        2  fem: 95, mal: 46, NA: 0   FALSE
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
##     education variable missing complete   n   mean     sd  p0   p25 p50    p75 p100     hist
##          none      ACT       0       57  57  27.47   5.21  15  24    28  31      36 ▁▂▅▃▇▃▃▅
##          none      age       0       57  57  16.95   1.04  13  17    17  18      18 ▁▁▁▁▃▁▇▆
##          none     SATQ       1       56  57 620.43 126.21 333 515   620 712.5   800 ▁▁▇▅▅▃▅▇
##          none     SATV       0       57  57 616.51 128.53 350 500   600 710     800 ▂▂▆▃▅▃▆▇
##       some_hs      ACT       0       45  45  27.49   6.06  15  24    27  33      36 ▂▅▁▅▇▂▅▇
##       some_hs      age       0       45  45  19.47   5.27  17  18    18  18      45 ▇▁▁▁▁▁▁▁
##       some_hs     SATQ       2       43  45 607.26 122.8  230 545   650 700     799 ▁▁▂▂▂▇▅▂
##       some_hs     SATV       0       45  45 599.67 128.05 300 500   600 700     799 ▂▂▃▅▇▃▇▇
##   high_school      ACT       0       44  44  26.98   5.81   3  25    28  31      36 ▁▁▁▁▃▇▇▅
##   high_school      age       0       44  44  27.57  10.68  18  20    25  28.75   57 ▇▆▁▂▁▁▁▂
##   high_school     SATQ       1       43  44 577.21 142.18 300 450   600 700     800 ▂▃▃▃▆▂▇▃
##   high_school     SATV       0       44  44 576.02 135.43 200 487.5 600 682.5   800 ▁▁▃▃▂▇▇▂
##  some_college      ACT       0      275 275  28.29   4.85  16  25    29  32      36 ▁▃▃▅▅▇▆▆
##  some_college      age       0      275 275  21.01   4.32  17  19    20  21      46 ▇▃▁▁▁▁▁▁
##  some_college     SATQ       6      269 275 606.07 117.76 200 525   620 700     800 ▁▁▁▅▃▇▇▅
##  some_college     SATV       0      275 275 612.13 117.36 200 540   625 700     800 ▁▁▁▃▃▇▆▅
##       college      ACT       0      138 138  29.26   4.35  16  27    30  32      36 ▁▂▂▃▆▇▆▆
##       college      age       0      138 138  30.24   8.36  21  24    28  34.75   57 ▇▅▂▂▁▁▁▁
##       college     SATQ       1      137 138 611.85 106.71 300 565   610 690     800 ▁▂▁▃▇▅▇▃
##       college     SATV       0      138 138 616.95  97.88 300 572.5 620 680     800 ▁▁▁▃▇▇▅▃
##      graduate      ACT       0      141 141  29.6    3.95  18  27    30  32      36 ▁▁▃▆▆▅▇▅
##      graduate      age       0      141 141  34.83  10.44  22  27    33  40      65 ▇▅▅▃▁▁▂▁
##      graduate     SATQ       2      139 141 623.63 103.09 350 535   640 700     800 ▁▂▆▅▆▆▇▆
##      graduate     SATV       0      141 141 621.4   96.65 300 570   630 700     800 ▁▁▂▅▇▇▆▃

Chi-squared

Suppose we wanted to see if number of males or females differed by education level. One way to accomplish this is to perform a chi-squared test of independence. In R, the chisq.test() function expects a contingency table in order to produce the appropriate statistic. A contingency table provides count data by groups.

edu_sex_table <- table(sat.act$education, sat.act$sex)
print(edu_sex_table)

##               
##                male female
##   none           27     30
##   some_hs        20     25
##   high_school    23     21
##   some_college   80    195
##   college        51     87
##   graduate       46     95

Next, we simply pass the contingency table to the chisq.test() function.

chi_test <- chisq.test(edu_sex_table)
chi_test

## 
##  Pearson's Chi-squared test
## 
## data:  edu_sex_table
## X-squared = 16, df = 5, p-value = 0.007

It appears as though the two variables are not independent. We will examine the pairwise comparisons in a moment to get a better idea of which variables are driving these results.

We can get the observed, expected, and residuals from the saved object.

chi_test$observed

##               
##                male female
##   none           27     30
##   some_hs        20     25
##   high_school    23     21
##   some_college   80    195
##   college        51     87
##   graduate       46     95

chi_test$expected

##               
##                male female
##   none         20.1   36.9
##   some_hs      15.9   29.1
##   high_school  15.5   28.5
##   some_college 97.0  178.0
##   college      48.7   89.3
##   graduate     49.8   91.2

We could also get the chi-squared statistic from the table using summary

summary(edu_sex_table)

## Number of cases in table: 700 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 16, df = 5, p-value = 0.007

Visualizing Chi-Squared

Aside from simply printing a table, we can visualize frequencies with a mosaic plot. See the supplemental materials for an example.

Pairwise comparisons

If we want to compare each level with the other in our contingency table we can computed those with pairwise comparisons using rcompanion.

rcompanion::pairwiseNominalIndependence(edu_sex_table,
                                        compare = "row", # or compare = "column"
                                        fisher = FALSE,
                                        gtest = FALSE,
                                        method = "holm")

##                    Comparison p.Chisq p.adj.Chisq
## 1              none : some_hs 0.92500      1.0000
## 2          none : high_school 0.77300      1.0000
## 3         none : some_college 0.01140      0.1600
## 4              none : college 0.23400      1.0000
## 5             none : graduate 0.07440      0.8180
## 6       some_hs : high_school 0.59800      1.0000
## 7      some_hs : some_college 0.05920      0.7100
## 8           some_hs : college 0.47200      1.0000
## 9          some_hs : graduate 0.20600      1.0000
## 10 high_school : some_college 0.00398      0.0597
## 11      high_school : college 0.10400      1.0000
## 12     high_school : graduate 0.02970      0.3860
## 13     some_college : college 0.13200      1.0000
## 14    some_college : graduate 0.52900      1.0000
## 15         college : graduate 0.52600      1.0000

Correlations

Moving beyond categorical variables, we next test the relationship between numeric values using simple correlation. See the supplemental materials for more in-depth explnations.

Correlation is done using the cor() function. Suppose we want to see if the ACT scores increase with age.

# Covariance
# cov(sat.act$age, sat.act$ACT)

# Correlation
cor(sat.act$age, sat.act$ACT)

## [1] 0.111

A small correlation, but no test of significance. To obtain significance values, you must pass your data to cor.test(). Note that this can be done by passing x and y or using the formula method (which will be used for linear models).

# Default method
# cor.test(sat.act$age, sat.act$ACT)

# Formula method
cor.test(~ sat.act$age + sat.act$ACT, data = sat.act)

## 
##  Pearson's product-moment correlation
## 
## data:  sat.act$age and sat.act$ACT
## t = 3, df = 698, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0367 0.1831
## sample estimates:
##   cor 
## 0.111

Visualizing Correlations

To visualize this relationship, we can pass the raw data to ggplot and get a simple regression line using stat_smooth() with the lm method. See supplemental materials for an example.

Additional Correlation Functions

You can also pass a dataframe of values to the cor function to get a simple correlation matrix (a la SPSS).

cor(sat.act[c("age","ACT","SATV","SATQ")], use = "pairwise.complete.obs")

##          age   ACT    SATV    SATQ
## age   1.0000 0.111 -0.0424 -0.0339
## ACT   0.1105 1.000  0.5611  0.5871
## SATV -0.0424 0.561  1.0000  0.6443
## SATQ -0.0339 0.587  0.6443  1.0000

# Or, using tidyverse verbs...

# sat.act %>% 
#   select(., age, ACT, SATV, SATQ) %>% 
#   drop_na(., SATQ) %>% 
#   cor()

Or, optionally for easier-on-the-eyes output we can use a number of specialized functions.

Hmisc::rcorr(sat.act[c("age","ACT","SATV","SATQ")] %>% as.matrix(.))

##        age  ACT  SATV  SATQ
## age   1.00 0.11 -0.04 -0.03
## ACT   0.11 1.00  0.56  0.59
## SATV -0.04 0.56  1.00  0.64
## SATQ -0.03 0.59  0.64  1.00
## 
## n
##      age ACT SATV SATQ
## age  700 700  700  687
## ACT  700 700  700  687
## SATV 700 700  700  687
## SATQ 687 687  687  687
## 
## P
##      age    ACT    SATV   SATQ  
## age         0.0034 0.2631 0.3744
## ACT  0.0034        0.0000 0.0000
## SATV 0.2631 0.0000        0.0000
## SATQ 0.3744 0.0000 0.0000

Or directly to APA-acceptable tables. You can pass a filename argument to apa.cor.table to save it directly to file.

apaTables::apa.cor.table(sat.act[c("age","ACT","SATV","SATQ")])

## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable M      SD     1           2          3         
##   1. age   25.59  9.50                                    
##                                                           
##   2. ACT   28.55  4.82   .11**                            
##                          [.04, .18]                       
##                                                           
##   3. SATV  612.23 112.90 -.04        .56**                
##                          [-.12, .03] [.51, .61]           
##                                                           
##   4. SATQ  610.22 115.64 -.03        .59**      .64**     
##                          [-.11, .04] [.54, .63] [.60, .69]
##                                                           
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##

Linear models

Introduction

The overall goal is to give you an introduction to conducting regression analyses or linear modeling in R.

Single-predictor (simple) regression

Let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm function in R. The lm function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.

Let’s take a look at the simple case of the association between SAT math and verbal scores.

ggplot(sat.act, aes(x=SATV, y=SATQ)) + 
  geom_point(color='darkblue', size = 3) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
  labs(x="Math Score", y="Verbal Score")

This relationship looks quite linear. The higher the math score the higher the verbal score.

In R regression models, we use the ~ operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think verbal is predictive of math, we’d say ‘math regressed on verbal’ or ‘verbal predicts math.’

In formula terms, this is SATV ~ SATQ, which we pass as the first argument to lm().

lm_SAT <- lm(SATQ ~ SATV, data=sat.act)
summary(lm_SAT)

## 
## Call:
## lm(formula = SATQ ~ SATV, data = sat.act)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -302.1  -46.5    2.4   51.3  282.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 207.5253    18.5725    11.2   <2e-16 ***
## SATV          0.6576     0.0298    22.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.5 on 685 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.415,  Adjusted R-squared:  0.414 
## F-statistic:  486 on 1 and 685 DF,  p-value: <2e-16

The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).

Regression diagnostics

The ggfortify package also provides an autoplot function that gives similar diagnostics within a handy ggplot-based graph.

autoplot(lm_SAT)

Bootstrap estimates and confidence intervals

Using functionality from the car and boot packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.

sat.act.na <- na.omit(sat.act)
lm_SAT.na <- lm(SATQ ~ SATV, data=sat.act.na)
system.time(lm_SAT.boot <- Boot(lm_SAT.na, R=2000))

##    user  system elapsed 
##   1.744   0.085   1.828

summary(lm_SAT.boot, high.moments=TRUE)

## 
## Number of bootstrap replications R = 2000 
##             original  bootBias  bootSE bootMed bootSkew bootKurtosis
## (Intercept)  207.525 -0.280479 20.1033 206.684   0.0365        0.150
## SATV           0.658  0.000648  0.0316   0.659  -0.0399        0.148

We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).

confint(lm_SAT.boot, level=.95, type="bca")

## Bootstrap bca confidence intervals
## 
##               2.5 %  97.5 %
## (Intercept) 169.714 248.015
## SATV          0.594   0.717

And we can easily compare the bootstrapped and standard OLS models:

hist(lm_SAT.boot, legend="separate")

Multiple regression

We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the sat.act dataset, we could examine the extent to which the math scores (SATQ) is a function of both verbal scores (SATV), age (age), and sex (gender, 1 = male, 2 = female).

asv_model <- lm(SATQ ~ age + sex + SATV, sat.act)
summary(asv_model)

## 
## Call:
## lm(formula = SATQ ~ age + sex + SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290.34  -50.32    5.81   51.51  295.57 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 236.6596    21.2572   11.13  < 2e-16 ***
## age          -0.1267     0.3498   -0.36     0.72    
## sexfemale   -36.8580     6.9202   -5.33  1.4e-07 ***
## SATV          0.6541     0.0293   22.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.8 on 683 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.438,  Adjusted R-squared:  0.436 
## F-statistic:  178 on 3 and 683 DF,  p-value: <2e-16

Getting results into a tidy, useful format

Note that the broom package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?

We can use the glance function to get the global model statistics.

broom::glance(asv_model)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1     0.438         0.436  86.8      178. 3.52e-85     4 -4040. 8089. 8112. 5150990.         683

And the tidy function yields the parameter table

broom::tidy(asv_model)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  237.      21.3       11.1   1.43e-26
## 2 age           -0.127    0.350     -0.362 7.17e- 1
## 3 sexfemale    -36.9      6.92      -5.33  1.36e- 7
## 4 SATV           0.654    0.0293    22.3   2.51e-83

Modeling interactions

We can use the * operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:

int_model <- lm(SATQ ~ sex*age*SATV, sat.act)
summary(int_model)

## 
## Call:
## lm(formula = SATQ ~ sex * age * SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.41  -50.48    4.05   52.30  291.29 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -23.01000   86.97955   -0.26    0.791    
## sexfemale          216.92703  108.13088    2.01    0.045 *  
## age                  9.54204    3.39075    2.81    0.005 ** 
## SATV                 1.02899    0.13875    7.42  3.6e-13 ***
## sexfemale:age       -8.86490    4.15128   -2.14    0.033 *  
## sexfemale:SATV      -0.33756    0.17252   -1.96    0.051 .  
## age:SATV            -0.01395    0.00546   -2.55    0.011 *  
## sexfemale:age:SATV   0.01155    0.00668    1.73    0.084 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.2 on 679 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.45,   Adjusted R-squared:  0.444 
## F-statistic: 79.4 on 7 and 679 DF,  p-value: <2e-16

This model includes individual effects of sex (gender) and age (age), as well as their interation (gender:age). This highlights that the asterisk operator * will compute all possible interations among the specified predictors. For example, a*b*c*d will generate all effets up through and including the a x b x c x d interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.

What do we see? Males tend to have higher math scores and maintain these scores with age. Women have lower maths scores and show a decreasing trend as they get older.

Contrasts in regression

(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans documentation: https://cran.r-project.org/web/packages/emmeans/)

One of the handiest packages in the R regression universe is emmeans, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp to handle correction for multiple comparisons. See the longer documentation here.

Let’s look how age and verbal scores may interact to predict math scores.

# creating age groups for pairwise comparisons
sat.act$agef[sat.act$age < 25] <- 1
sat.act$agef[sat.act$age >= 25 & sat.act$age <= 50] <- 2
sat.act$agef[sat.act$age > 50] <- 3

# setting as factors
sat.act$agef <- as.factor(sat.act$agef)
sat.act$gender <- as.factor(sat.act$sex)

# running the model
sat.lm <- lm(SATQ ~ agef + SATV, data = sat.act)
summary(sat.lm)

## 
## Call:
## lm(formula = SATQ ~ agef + SATV, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.62  -45.92    2.07   51.81  283.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 206.9031    18.8894   10.95   <2e-16 ***
## agef2        -0.0946     7.1350   -0.01     0.99    
## agef3        15.5437    18.9726    0.82     0.41    
## SATV          0.6579     0.0299   22.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.6 on 683 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.416,  Adjusted R-squared:  0.413 
## F-statistic:  162 on 3 and 683 DF,  p-value: <2e-16

This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.

We can obtain the expected means for each condition.

Expected means for age group

sat.emm.s <- emmeans(sat.lm, "agef")
print(sat.emm.s)

##  agef emmean    SE  df lower.CL upper.CL
##  1       610  4.32 683      601      618
##  2       610  5.67 683      598      621
##  3       625 18.47 683      589      662
## 
## Confidence level used: 0.95

Expected means for verbal scores

sat.emm.p <- emmeans(sat.lm, "SATV")
print(sat.emm.p)

##  SATV emmean   SE  df lower.CL upper.CL
##   612    615 4.32 683      606      623
## 
## Results are averaged over the levels of: agef 
## Confidence level used: 0.95

Means in each cell of the factorial design

print(emmeans(sat.lm, ~agef*SATV))

##  agef SATV emmean    SE  df lower.CL upper.CL
##  1     612    610  4.32 683      601      618
##  2     612    610  5.67 683      598      621
##  3     612    625 18.47 683      589      662
## 
## Confidence level used: 0.95

Pairwise comparisons

If we wanted to compare the pairwise differences in the effect of age group on math scores while controlling for verbal scores (and potentially other variables we add to the model), we could use the pairs function:

sat_pairs <- pairs(sat.emm.s)
print(sat_pairs)

##  contrast estimate    SE  df t.ratio p.value
##  1 - 2        0.09  7.13 683  0.013  1.0000 
##  1 - 3      -15.54 18.97 683 -0.819  0.6910 
##  2 - 3      -15.64 19.32 683 -0.809  0.6970 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Note that you can get a sense of the contrasts being tested by emmeans by examining the @linfct slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).

Contrasts for the predicted mean level of math scores contrast for each age group, controlling for verbal score.

sat.emm.s@linfct

##      (Intercept) agef2 agef3 SATV
## [1,]           1     0     0  612
## [2,]           1     1     0  612
## [3,]           1     0     1  612

What are the pairwise contrasts for verbal scores?

sat_pairs@linfct

##      (Intercept) agef2 agef3 SATV
## [1,]           0    -1     0    0
## [2,]           0     0    -1    0
## [3,]           0     1    -1    0

Pairwise differences and simple slopes in regression

Consider a model in which we examine the association between SATQ and SATV across age. Here, we regress SATV on SATQ, age (three levels), and their interaction.

fit_sat <- lm(SATQ ~ SATV*agef, data = sat.act)
summary(fit_sat)

## 
## Call:
## lm(formula = SATQ ~ SATV * agef, data = sat.act)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -302.51  -50.21    2.09   54.61  256.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 178.8432    22.1853    8.06  3.4e-15 ***
## SATV          0.7034     0.0354   19.90  < 2e-16 ***
## agef2       109.8710    42.6641    2.58   0.0102 *  
## agef3        17.2861    95.7898    0.18   0.8568    
## SATV:agef2   -0.1804     0.0690   -2.61   0.0091 ** 
## SATV:agef3   -0.0022     0.1547   -0.01   0.9887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88.3 on 681 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.422,  Adjusted R-squared:  0.417 
## F-statistic: 99.3 on 5 and 681 DF,  p-value: <2e-16

car::Anova(fit_sat, type="III") #overall effects of predictors in the model

## Anova Table (Type III tests)
## 
## Response: SATQ
##              Sum Sq  Df F value  Pr(>F)    
## (Intercept)  506336   1   64.99 3.4e-15 ***
## SATV        3084321   1  395.85 < 2e-16 ***
## agef          51806   2    3.32   0.037 *  
## SATV:agef     53928   2    3.46   0.032 *  
## Residuals   5306050 681                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that this yields a categorical (age group) x continuous (verbal scores) interaction. The output from car::Anova indicates that the interaction is significant, but we need more detailed guidance on how the slope for verbal scores is moderated by age group.

In a simple slopes test, we might wish to know whether the slope for SATV is non-zero in each age group individually. Let’s start by getting the estimated marginal means for each age group.

emmeans(fit_sat, ~agef)

## NOTE: Results may be misleading due to involvement in interactions

##  agef emmean    SE  df lower.CL upper.CL
##  1       610  4.31 681      601      618
##  2       609  5.66 681      598      620
##  3       626 18.43 681      589      662
## 
## Confidence level used: 0.95

And pairwise differences between age groups:

pairs(emmeans(fit_sat, ~agef))

## NOTE: Results may be misleading due to involvement in interactions

##  contrast estimate    SE  df t.ratio p.value
##  1 - 2        0.62  7.11 681  0.087  0.9960 
##  1 - 3      -15.94 18.92 681 -0.842  0.6770 
##  2 - 3      -16.56 19.28 681 -0.859  0.6660 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Transitioning to SATV, because we are interested its linear effect (slope), we use the emtrends function to estimate the slope in each species individually. In terms of simple slopes, we test whether the SATV slope is non-zero in each age group. The infer argument in the summary of emtrends requests t-tests and p-values for the slopes.

summary(emtrends(model = fit_sat, ~agef, var="SATV"), infer=TRUE)

##  agef SATV.trend     SE  df lower.CL upper.CL t.ratio p.value
##  1         0.703 0.0354 681    0.634    0.773 19.900  <.0001 
##  2         0.523 0.0593 681    0.407    0.639  8.820  <.0001 
##  3         0.701 0.1506 681    0.406    0.997  4.660  <.0001 
## 
## Confidence level used: 0.95

Finally, we could examine pairwise differences between slopes among age groups.

pairs(emtrends(model = fit_sat, ~agef, var="SATV"))

##  contrast estimate    SE  df t.ratio p.value
##  1 - 2      0.1804 0.069 681  2.614  0.0250 
##  1 - 3      0.0022 0.155 681  0.014  1.0000 
##  2 - 3     -0.1782 0.162 681 -1.101  0.5140 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

There is a lot more on probing interations here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.

ANOVA

Further, sometimes we want to test the relationship between categorical variables and numeric variables. One way to do this is ANOVA. Analysis of variance (ANOVA) provides a statistical test of whether two or more population means are equal. ANOVA is done using the aov_ez function.

Suppose we want to see if the ACT scores differ with education levels and gender.

Between-subjects factorial ANOVA

A two-way between-subjects ANOVA is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. First, let’s see the summary of each condition.

str(sat.act)

## 'data.frame':    700 obs. of  8 variables:
##  $ sex      : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
##  $ education: Factor w/ 6 levels "none","some_hs",..: 4 4 4 5 3 6 6 4 5 6 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...
##  $ agef     : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ gender   : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...

describeBy(sat.act$ACT,list(sat.act$education,sat.act$gender), mat=TRUE,digits=2) %>% 
  dplyr::select(-vars, -item, -trimmed, -mad, -range, -se)

##            group1 group2   n mean   sd median min max  skew kurtosis
## X11          none   male  27 29.0 5.00     29  20  36 -0.30    -1.13
## X12       some_hs   male  20 26.7 7.11     28  15  35 -0.30    -1.51
## X13   high_school   male  23 26.6 6.39     28   3  32 -2.14     5.39
## X14  some_college   male  80 28.6 5.03     30  17  36 -0.45    -0.92
## X15       college   male  51 28.9 4.42     29  16  36 -0.74     0.12
## X16      graduate   male  46 30.8 3.11     32  25  36 -0.38    -0.81
## X17          none female  30 26.1 5.06     26  15  36  0.08    -0.56
## X18       some_hs female  25 28.1 5.13     27  18  36 -0.21    -0.78
## X19   high_school female  21 27.3 5.23     28  15  36 -0.32    -0.34
## X110 some_college female 195 28.2 4.78     29  16  36 -0.46    -0.47
## X111      college female  87 29.4 4.32     30  19  36 -0.27    -0.67
## X112     graduate female  95 29.0 4.19     29  18  36 -0.31    -0.73

sat.act$id<-1:nrow(sat.act)
sat.act$education<-factor(sat.act$education)
# Compute the analysis of variance
res.aov <- aov_ez(id="id", dv="ACT", data = sat.act, between=c("education","gender"))

## Contrasts set to contr.sum for the following variables: education, gender

# Summary of the analysis
print(res.aov)

## Anova Table (Type 3 tests)
## 
## Response: ACT
##             Effect     df   MSE        F  ges p.value
## 1        education 5, 688 22.56 4.53 ***  .03   .0004
## 2           gender 1, 688 22.56     0.87 .001     .35
## 3 education:gender 5, 688 22.56   2.04 +  .01     .07
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

We can conclude that education is statistically significant. ACT score is positively related with education. Also, there is no significant interaction effect between education and gender.

Pairwise comparison among education levels

ACT scores generally increase with education. If we want to compare each level in education factor with the other, we can compute those with pairwise comparisons using emmeans.

sat.emm.aov <- emmeans(res.aov, "education")

## NOTE: Results may be misleading due to involvement in interactions

sat_pairs_aov <- pairs(sat.emm.aov)
print(sat_pairs_aov)

##  contrast                   estimate    SE  df t.ratio p.value
##  none - some_hs                0.142 0.951 688  0.150  1.0000 
##  none - high_school            0.559 0.954 688  0.590  0.9920 
##  none - some_college          -0.822 0.705 688 -1.170  0.8530 
##  none - college               -1.643 0.757 688 -2.170  0.2520 
##  none - graduate              -2.366 0.761 688 -3.110  0.0240 
##  some_hs - high_school         0.417 1.011 688  0.410  0.9980 
##  some_hs - some_college       -0.964 0.779 688 -1.240  0.8190 
##  some_hs - college            -1.785 0.826 688 -2.160  0.2580 
##  some_hs - graduate           -2.508 0.830 688 -3.020  0.0310 
##  high_school - some_college   -1.381 0.783 688 -1.760  0.4900 
##  high_school - college        -2.202 0.830 688 -2.650  0.0860 
##  high_school - graduate       -2.926 0.834 688 -3.510  0.0060 
##  some_college - college       -0.821 0.524 688 -1.570  0.6210 
##  some_college - graduate      -1.545 0.530 688 -2.910  0.0430 
##  college - graduate           -0.724 0.598 688 -1.210  0.8320 
## 
## Results are averaged over the levels of: gender 
## P value adjustment: tukey method for comparing a family of 6 estimates

Repeated measures ANOVA

Repeated measures ANOVA is commonly used in data analysis. For example, participants take more than one tasks or conditions. In longitudinal study, data is collected from each participant more than once. Assume there are different time points in this dataset. After introduce the time point variable into the data, let’s check whether ACT scores vary by time point and education.

set.seed(1999) # Needed for getting same results

sat.act <- sat.act %>% 
  mutate(education = factor(rep(sample(1:5, 140, replace = TRUE), each=5)),
         id = rep(1:140, each = 5),
         time = rep(1:5, 140)
         )

res.aov2 <- aov_ez(
  id = "id",
  dv = "ACT",
  data = sat.act,
  between = c("education"),
  within = c("time")
)

## Contrasts set to contr.sum for the following variables: education

summary(res.aov2)

## 
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
## 
##                Sum Sq num Df Error SS den Df  F value Pr(>F)    
## (Intercept)    531504      1     3619    135 19829.43 <2e-16 ***
## education         154      4     3619    135     1.44  0.225    
## time              194      4    11905    540     2.20  0.067 .  
## education:time    387     16    11905    540     1.10  0.355    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Mauchly Tests for Sphericity
## 
##                Test statistic p-value
## time                    0.874  0.0362
## education:time          0.874  0.0362
## 
## 
## Greenhouse-Geisser and Huynh-Feldt Corrections
##  for Departure from Sphericity
## 
##                GG eps Pr(>F[GG])  
## time            0.936      0.072 .
## education:time  0.936      0.356  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                HF eps Pr(>F[HF])
## time            0.966     0.0698
## education:time  0.966     0.3556

Visualizing ANOVA

Aside from simply printing a table, we can visualize the mean and sd of each conditions with a line plot. See the supplemental materials for an example.

Alternative approaches

t-test: For a simpler comparison of means between two groups in a sample (t.test function) Linear regression: comparing one or more factors in a sample (lm, mlm,or lmer function)

Hands on practice: Data Analyses

Download the practice RMarkdown file and open it. We will be using the spi data set from the psych package for the practice questions.

Basic Analyses in R

Daniel Albohn, Kayla Brown, & Yiming Qian

2019-08-23 15:17:24

Load and preprocess data

Preprocessing integers into factors

Describe data

Chi-squared

Visualizing Chi-Squared

Pairwise comparisons

Correlations

Visualizing Correlations

Additional Correlation Functions

Linear models

Introduction

Single-predictor (simple) regression

Regression diagnostics

Bootstrap estimates and confidence intervals

Multiple regression

Getting results into a tidy, useful format

Modeling interactions

Contrasts in regression

Expected means for age group

Expected means for verbal scores

Means in each cell of the factorial design

Pairwise comparisons

Pairwise differences and simple slopes in regression

ANOVA

Between-subjects factorial ANOVA

Pairwise comparison among education levels

Repeated measures ANOVA

Visualizing ANOVA

Alternative approaches

Hands on practice: Data Analyses