1 Introduction

The overall goal is to give you a very quick introduction to conducting correlation and regression analyses in R.

2 Correlation

The Pearson product moment correlation seeks to measure the linear association between two variables, \(x\) and \(y\) on a standardized scale ranging from \(r = -1 -- 1\).

The correlation of x and y is a covariance that has been standardized by the standard deviations of \(x\) and \(y\). This yields a scale-insensitive measure of the linear association of \(x\) and \(y\). For much more conceptual detail, see this: https://psu-psychology.github.io/psy-597-SEM/01_correlation_regression/01_Correlation_and_Regression.html.

\[r_{XY}= \frac{s_{XY}}{s_{X} s_{Y}}\]

2.1 Correlation matrix

to_correlate <- mtcars %>% dplyr::select(qsec, cyl, disp, hp)
cor(to_correlate)

##        qsec    cyl   disp     hp
## qsec  1.000 -0.591 -0.434 -0.708
## cyl  -0.591  1.000  0.902  0.832
## disp -0.434  0.902  1.000  0.791
## hp   -0.708  0.832  0.791  1.000

2.2 Testing a bivariate association

Recall that the significance of correlations are computed on \(n - 2\) degrees of freedom.

The t-test is:

\[ t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} \]

cor.test(mtcars$qsec, mtcars$cyl)

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$qsec and mtcars$cyl
## t = -4, df = 30, p-value = 4e-04
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.779 -0.306
## sample estimates:
##    cor 
## -0.591

Note that we can use the conf.int argument to cor.test to get different levels of confidence.

cor.test(mtcars$qsec, mtcars$cyl, conf.level = 0.9)

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$qsec and mtcars$cyl
## t = -4, df = 30, p-value = 4e-04
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  -0.755 -0.358
## sample estimates:
##    cor 
## -0.591

2.3 Visualizing the association

ggplot(to_correlate, aes(x=cyl, y=qsec)) + geom_jitter(width=0.1) + stat_smooth(method="lm", se=FALSE)

2.4 Testing the significance of all correlations in the matrix

Hmisc::rcorr(to_correlate %>% as.matrix())

##       qsec   cyl  disp    hp
## qsec  1.00 -0.59 -0.43 -0.71
## cyl  -0.59  1.00  0.90  0.83
## disp -0.43  0.90  1.00  0.79
## hp   -0.71  0.83  0.79  1.00
## 
## n= 32 
## 
## 
## P
##      qsec   cyl    disp   hp    
## qsec        0.0004 0.0131 0.0000
## cyl  0.0004        0.0000 0.0000
## disp 0.0131 0.0000        0.0000
## hp   0.0000 0.0000 0.0000

Notice that we now get a matrix of p-values, too…

2.5 Pretty output

stargazer(cor(to_correlate), type = "html")


	qsec	cyl	disp	hp

qsec	1	-0.591	-0.434	-0.708
cyl	-0.591	1	0.902	0.832
disp	-0.434	0.902	1	0.791
hp	-0.708	0.832	0.791	1

#you can use the filename argument to write out the table as a Word doc!
apaTables::apa.cor.table(to_correlate)

## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable M      SD     1            2          3         
##   1. qsec  17.85  1.79                                     
##                                                            
##   2. cyl   6.19   1.79   -.59**                            
##                          [-.78, -.31]                      
##                                                            
##   3. disp  230.72 123.94 -.43*        .90**                
##                          [-.68, -.10] [.81, .95]           
##                                                            
##   4. hp    146.69 68.56  -.71**       .83**      .79**     
##                          [-.85, -.48] [.68, .92] [.61, .89]
##                                                            
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##

2.6 Keeping the correlations for further analysis

Here, we store all details of the bivariate correlation test as an R object ctest.

ctest <- cor.test(mtcars$qsec, mtcars$cyl)

Let’s poke under the hood:

str(ctest)

## List of 9
##  $ statistic  : Named num -4.02
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 30
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.000366
##  $ estimate   : Named num -0.591
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "mtcars$qsec and mtcars$cyl"
##  $ conf.int   : num [1:2] -0.779 -0.306
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"

So, we can poke around and grab specific things:

ctest$p.value

## [1] 0.000366

ctest$estimate

##    cor 
## -0.591

And there are useful helper packages, especially broom, that will help you work with statistics objects as data.frame objects.

broom::glance(ctest)

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.591	-4.01	0	30	-0.779	-0.306	Pearson’s product-moment correlation	two.sided

2.7 Correlation method

You can use a different correlation method (e.g., Spearman) using the method argument:

cor.test(mtcars$qsec, mtcars$cyl, method = "spearman")

## Warning in cor.test.default(mtcars$qsec, mtcars$cyl, method = "spearman"):
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  mtcars$qsec and mtcars$cyl
## S = 9000, p-value = 6e-04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## -0.572

cor(to_correlate, method = "spearman")

##        qsec    cyl   disp     hp
## qsec  1.000 -0.572 -0.460 -0.667
## cyl  -0.572  1.000  0.928  0.902
## disp -0.460  0.928  1.000  0.851
## hp   -0.667  0.902  0.851  1.000

2.8 Missing data

By default, cor will return an NA (missing) for every pair in which at least one observation is missing. We can ask for correlations to be estimated on the complete cases for each pair. This is use="pairwise.complete.obs".

Here’s the difference (I introduced some missing data to make the point):

First, with ‘everything’ as the use argument (any missing on a variable drops it from the correlation table).

to_correlate_miss <- to_correlate
to_correlate_miss$qsec[c(1, 5)] <- NA
cor(to_correlate_miss) #implicitly use="everything"

##      qsec   cyl  disp    hp
## qsec    1    NA    NA    NA
## cyl    NA 1.000 0.902 0.832
## disp   NA 0.902 1.000 0.791
## hp     NA 0.832 0.791 1.000

Now with pairwise complete calculation:

cor(to_correlate_miss, use="pairwise.complete.obs")

##        qsec    cyl   disp     hp
## qsec  1.000 -0.596 -0.448 -0.731
## cyl  -0.596  1.000  0.902  0.832
## disp -0.448  0.902  1.000  0.791
## hp   -0.731  0.832  0.791  1.000

3 Single-predictor (simple) regression

Next, let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm function in R. The lm function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.

In simple regression, we are interested in a relationship of the form:

\[ Y = B_0 + B_1 X \]

where \(Y\) is the dependent variable (criterion) and \(X\) is the predictor (covariate). The intercept is represented by \(B0\) and the slope for the \(X\) predictor by \(B1\).

Let’s take a look at the simple case of stopping distance (braking) as a function of car speed.

ggplot(cars, aes(x=speed, y=dist)) + 
  geom_point(color='darkblue', size = 3) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
  labs(x="Speed (mph)", y="Stopping distance (ft)")

When conducting regression, we typically try to capture linear relationships among variables. We can introduce higher-order polynomial terms (e.g., quadratic models) or splines (more flexible shapes), but this beyond the scope here.

Fortunately, this relationship looks quite linear! The faster the car, the longer it takes to brake.

In R regression models, we use the ~ operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think speed is a key cause of stopping distance, we’d say ‘braking distance regressed on speed’ or ‘speed predicts braking distance.’

In formula terms, this is dist ~ speed, which we pass as the first argument to lm().

lm_cars <- lm(dist ~ speed, data=cars)
summary(lm_cars)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.07  -9.53  -2.27   9.21  43.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.579      6.758   -2.60    0.012 *  
## speed          3.932      0.416    9.46  1.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.644 
## F-statistic: 89.6 on 1 and 48 DF,  p-value: 1.49e-12

The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).

3.1 Regression diagnostics

We can also get useful diagnostic plots for free using the plot() function:

plot(lm_cars)

The ggfortify package also provides an autoplot function that gives similar diagnostics within a handy ggplot-based graph.

autoplot(lm_cars)

3.2 Bootstrap estimates and confidence intervals

Using functionality from the car and boot packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.

Nonparametric bootstrapping approximates the sampling distribution for a statistic of interest (e.g., a slope) by resampling the existing data with replacement many times and examining the resulting density.

system.time(lm_cars.boot <- Boot(lm_cars, R=2000))

## Loading required namespace: boot

##    user  system elapsed 
##   1.356   0.011   1.375

summary(lm_cars.boot, high.moments=TRUE)

	R	original	bootBias	bootSE	bootMed	bootSkew	bootKurtosis
(Intercept)	2000	-17.58	0.213	5.964	-17.3	-0.208	0.240
speed	2000	3.93	-0.016	0.422	3.9	0.133	0.201

We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).

confint(lm_cars.boot, level=.95, type="bca")

## Bootstrap quantiles, type =  bca 
## 
##             2.5 % 97.5 %
## (Intercept) -31.6  -7.34
## speed         3.2   4.93

And we can easily compare the bootstrapped and standard OLS models:

hist(lm_cars.boot, legend="separate")

Notice that the speed regression coefficient is slightly positively skewed. Additional details are provided in John Fox’s useful book “An R Companion to Applied Regression” (2nd edition): https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Bootstrapping.pdf.

4 Multiple regression

We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the mtcars dataset (car performance statistics from 1974 Motor Trend), we could examine the extent to which the gas mileage (mpg) is a function of both gross horsepower (hp) and transmission (am, where 0 is ‘automatic’ and 1 is ‘manual’).

mpg_model <- lm(mpg ~ hp + am, mtcars)
summary(mpg_model)

## 
## Call:
## lm(formula = mpg ~ hp + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.384 -2.264  0.137  1.697  5.866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.58491    1.42509   18.65  < 2e-16 ***
## hp          -0.05889    0.00786   -7.50  2.9e-08 ***
## am           5.27709    1.07954    4.89  3.5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.91 on 29 degrees of freedom
## Multiple R-squared:  0.782,  Adjusted R-squared:  0.767 
## F-statistic:   52 on 2 and 29 DF,  p-value: 2.55e-10

It appears that these are both influential predictors. We could examine the relationship graphically.

4.1 Visualizing regression data

ggplot(mtcars, aes(x=hp, y=mpg, color=factor(am))) + geom_point() + stat_smooth(method=lm, se=FALSE)

There doesn’t appear to be any meaningful interaction between horsepower and transmission type.

4.2 Getting results into a tidy, useful format

Note that the broom package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?

We can use the glance function to get the global model statistics.

broom::glance(mpg_model)

	r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
value	0.782	0.767	2.91	52	0	3	-78	164	170	245	29

And the tidy function yields the parameter table

broom::tidy(mpg_model)

term	estimate	std.error	statistic
(Intercept)	26.585	1.425	18.66
hp	-0.059	0.008	-7.50
am	5.277	1.080	4.89

As can imagine (and saw earlier in the functional programming overview), the ability to extract regression statistics into a tidy data.frame is a boon to scaling your analyses to multiple models and datasets.

4.3 Modeling interactions

We can use the * operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:

int_model <- lm(mpg ~ hp*wt + am, mtcars)
summary(int_model)

## 
## Call:
## lm(formula = mpg ~ hp * wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.985 -1.658 -0.741  1.436  4.527 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.45224    5.28073    9.36  5.7e-10 ***
## hp          -0.11930    0.02655   -4.49  0.00012 ***
## wt          -8.10056    1.78933   -4.53  0.00011 ***
## am           0.12511    1.33343    0.09  0.92594    
## hp:wt        0.02749    0.00847    3.24  0.00313 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.19 on 27 degrees of freedom
## Multiple R-squared:  0.885,  Adjusted R-squared:  0.868 
## F-statistic: 51.8 on 4 and 27 DF,  p-value: 2.76e-12

This model includes individual effects of horsepower (hp) and weight (wt), as well as their interation (hp:wt). This highlights that the asterisk operator * will compute all possible interations among the specified predictors. For example, a*b*c*d will generate all effets up through and including the a x b x c x d interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.

Note that with weight (wt) in the model, as well the horsepower x weight interaction, the automatic/manual transmission distinction is no longer significant (not even close). Let’s take a look at this interaction. For a two-way continuous x continuous interation, we typically separate one predictor into low (-1SD), medium (mean), and high (+1SD) levels, and plot separate lines for each.

#handy 2-way interation plotting function from jtools.
interact_plot(int_model, pred = "hp", modx = "wt")

What do we see? At higher horsepower, gas mileage suffers regardless of the weight of the car. At lower horsepower, car weight makes a big difference (lower weight, higher mpg).

5 Contrasts in regression

(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans documentation: https://cran.r-project.org/web/packages/emmeans/)

One of the handiest packages in the R regression universe is emmeans, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp to handle correction for multiple comparisons. See the longer documentation here.

Let’s look at the concentration of leucine in a study of pigs who were fed differing levels of protein in the diet (9, 12, 15, and 18%) and different protein sources (fish, soybean, milk). The concentration has a long positive tail, so here we log transform it to normalize things somewhat.

data(pigs, package="emmeans")
pigs <- pigs %>% mutate(log_conc=log(conc), percent_fac=factor(percent))
pigs.lm <- lm(log_conc ~ source + percent_fac, data = pigs)
summary(pigs.lm)

## 
## Call:
## lm(formula = log_conc ~ source + percent_fac, data = pigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1886 -0.0598 -0.0216  0.0812  0.2460 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.2203     0.0536   60.07  < 2e-16 ***
## sourcesoy       0.2728     0.0529    5.15  3.2e-05 ***
## sourceskim      0.4023     0.0542    7.43  1.5e-07 ***
## percent_fac12   0.1796     0.0561    3.20  0.00396 ** 
## percent_fac15   0.2174     0.0597    3.64  0.00137 ** 
## percent_fac18   0.2998     0.0676    4.43  0.00019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.115 on 23 degrees of freedom
## Multiple R-squared:  0.757,  Adjusted R-squared:  0.704 
## F-statistic: 14.3 on 5 and 23 DF,  p-value: 2.05e-06

This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.

We can obtain the expected means for each condition.

5.1 Expected means for protein source

pigs.emm.s <- emmeans(pigs.lm, "source")
print(pigs.emm.s)

##  source emmean     SE df lower.CL upper.CL
##  fish     3.39 0.0367 23     3.32     3.47
##  soy      3.67 0.0374 23     3.59     3.74
##  skim     3.80 0.0394 23     3.72     3.88
## 
## Results are averaged over the levels of: percent_fac 
## Confidence level used: 0.95

5.2 Expected means for protein level

pigs.emm.p <- emmeans(pigs.lm, "percent_fac")
print(pigs.emm.p)

##  percent_fac emmean     SE df lower.CL upper.CL
##  9             3.45 0.0409 23     3.36     3.53
##  12            3.62 0.0384 23     3.55     3.70
##  15            3.66 0.0437 23     3.57     3.75
##  18            3.75 0.0530 23     3.64     3.85
## 
## Results are averaged over the levels of: source 
## Confidence level used: 0.95

5.3 Means in each cell of the factorial design

print(emmeans(pigs.lm, ~source*percent_fac))

##  source percent_fac emmean     SE df lower.CL upper.CL
##  fish   9             3.22 0.0536 23     3.11     3.33
##  soy    9             3.49 0.0498 23     3.39     3.60
##  skim   9             3.62 0.0501 23     3.52     3.73
##  fish   12            3.40 0.0493 23     3.30     3.50
##  soy    12            3.67 0.0489 23     3.57     3.77
##  skim   12            3.80 0.0494 23     3.70     3.90
##  fish   15            3.44 0.0548 23     3.32     3.55
##  soy    15            3.71 0.0507 23     3.61     3.82
##  skim   15            3.84 0.0549 23     3.73     3.95
##  fish   18            3.52 0.0547 23     3.41     3.63
##  soy    18            3.79 0.0640 23     3.66     3.93
##  skim   18            3.92 0.0646 23     3.79     4.06
## 
## Confidence level used: 0.95

5.4 Pairwise comparisons among protein sources

If we wanted to compare the pairwise differences in the effect of protein source on leucine concentration while controlling for protein percentage (and potentially other variables we add to the model), we could use the pairs function:

pig_pairs <- pairs(pigs.emm.s)
print(pig_pairs)

##  contrast    estimate     SE df t.ratio p.value
##  fish - soy    -0.273 0.0529 23   -5.15  0.0001
##  fish - skim   -0.402 0.0542 23   -7.43  <.0001
##  soy - skim    -0.130 0.0530 23   -2.44  0.0570
## 
## Results are averaged over the levels of: percent_fac 
## P value adjustment: tukey method for comparing a family of 3 estimates

Note that you can get a sense of the contrasts being tested by emmeans by examining the @linfct slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).

Contrasts for the predicted mean level of leucine contrast for each protein source, controlling for protein percentage.

pigs.emm.s@linfct

##      (Intercept) sourcesoy sourceskim percent_fac12 percent_fac15
## [1,]           1         0          0          0.25          0.25
## [2,]           1         1          0          0.25          0.25
## [3,]           1         0          1          0.25          0.25
##      percent_fac18
## [1,]          0.25
## [2,]          0.25
## [3,]          0.25

What are the pairwise contrasts for the protein sources?

pig_pairs@linfct

##      (Intercept) sourcesoy sourceskim percent_fac12 percent_fac15
## [1,]           0        -1          0             0             0
## [2,]           0         0         -1             0             0
## [3,]           0         1         -1             0             0
##      percent_fac18
## [1,]             0
## [2,]             0
## [3,]             0

The emmeans package also provides useful plots to understand pairwise differences:

plot(pigs.emm.s, comparisons = TRUE)

The blue bars are confidence intervals for the EMMs, and the red arrows are for the comparisons among them. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”). (Note: Don’t ever use confidence intervals for EMMs to perform comparisons; they can be very misleading.)

5.5 Pairwise differences and simple slopes in regression

Returning to the iris dataset (from the parallel R examples), consider a model in which we examine the association between petal length and width across species. Here, we regress petal length on petal width, species (three levels), and their interaction.

fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris)
summary(fitiris)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width * Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8410 -0.1934 -0.0369  0.1631  1.1707 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.328      0.131   10.14  < 2e-16 ***
## Petal.Width                      0.546      0.490    1.12    0.267    
## Speciesversicolor                0.454      0.374    1.21    0.227    
## Speciesvirginica                 2.913      0.406    7.17  3.5e-11 ***
## Petal.Width:Speciesversicolor    1.323      0.555    2.38    0.019 *  
## Petal.Width:Speciesvirginica     0.101      0.525    0.19    0.848    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.361 on 144 degrees of freedom
## Multiple R-squared:  0.959,  Adjusted R-squared:  0.958 
## F-statistic:  682 on 5 and 144 DF,  p-value: <2e-16

car::Anova(fitiris, type="III") #overall effects of predictors in the model

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	13.433	1	102.81	0.000
Petal.Width	0.163	1	1.24	0.267
Species	6.747	2	25.82	0.000
Petal.Width:Species	2.018	2	7.72	0.001
Residuals	18.816	144	NA	NA

Note that this yields a categorical (species) x continuous (petal width) interaction. The output from car::Anova indicates that the interaction is significant, but we need more detailed guidance on how the slope for petal width is moderated by species. We can visualize the interaction as follows:

interact_plot(fitiris, pred = "Petal.Width", modx = "Species")

In a simple slopes test, we might wish to know whether the slope for Petal.Width is non-zero in each species individually. Let’s start by getting the estimated marginal means for each species.

emmeans(fitiris, ~Species)

## NOTE: Results may be misleading due to involvement in interactions

##  Species    emmean     SE  df lower.CL upper.CL
##  setosa       1.98 0.4699 144     1.05     2.91
##  versicolor   4.02 0.0609 144     3.90     4.14
##  virginica    5.02 0.1636 144     4.69     5.34
## 
## Confidence level used: 0.95

And pairwise differences between species:

pairs(emmeans(fitiris, ~Species))

## NOTE: Results may be misleading due to involvement in interactions

##  contrast               estimate    SE  df t.ratio p.value
##  setosa - versicolor      -2.040 0.474 144   -4.31  0.0001
##  setosa - virginica       -3.034 0.498 144   -6.10  <.0001
##  versicolor - virginica   -0.994 0.175 144   -5.69  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Transitioning to petal width, because we are interested its linear effect (slope), we use the emtrends function to estimate the slope in each species individually. In terms of simple slopes, we test whether the Petal.Width slope is non-zero in each Species. The infer argument in the summary of emtrends requests t-tests and p-values for the slopes.

summary(emtrends(model = fitiris, ~Species, var="Petal.Width"), infer=TRUE)

Species	Petal.Width.trend	SE	df	lower.CL	upper.CL	t.ratio	p.value
setosa	0.546	0.490	144	-0.422	1.51	1.11	0.267
versicolor	1.869	0.261	144	1.353	2.38	7.16	0.000
virginica	0.647	0.188	144	0.276	1.02	3.44	0.001

Finally, we could examine pairwise differences between slopes among species.

pairs(emtrends(model = fitiris, ~Species, var="Petal.Width"))

##  contrast               estimate    SE  df t.ratio p.value
##  setosa - versicolor      -1.323 0.555 144  -2.382  0.0483
##  setosa - virginica       -0.101 0.525 144  -0.192  0.9799
##  versicolor - virginica    1.222 0.322 144   3.798  0.0006
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

5.6 A few other emmeans features

In the pigs dataset, we have treated protein percentage as a fact (9, 12, 15, or 18 percent). If we keep this representation (as opposed to entering percentage as continuous), we can easily get orthogonal polynomial contrasts in emmeans. For example, is the effect of protein percent linearly related to leucine, or might it be quadratic or cubic?

pigs.emm.p <- emmeans(pigs.lm, "percent_fac")
ply <- contrast(pigs.emm.p, "poly")
ply

##  contrast  estimate     SE df t.ratio p.value
##  linear      0.9374 0.2106 23   4.452  0.0002
##  quadratic  -0.0971 0.0883 23  -1.099  0.2830
##  cubic       0.1863 0.1877 23   0.992  0.3313
## 
## Results are averaged over the levels of: source

coef(ply) #show the contrast coefficients

	percent_fac	c.1	c.2	c.3
9	9	-3	1	-1
12	12	-1	-1	3
15	15	1	-1	-3
18	18	3	1	1

There is a lot more on probing interations here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.

Finally, we can examine effects in multivariate regression models (i.e., multiple DVs). Here, we can examine the sales of two varieties of oranges as a function of their prices, the day, and store. Sales for both oranges are modeled at once (sales1 and sales2) to get a sense of price interdependence.

org.int <- lm(cbind(sales1, sales2) ~ price1 * price2 + day + store, data = oranges)
summary(org.int)

## Response sales1 :
## 
## Call:
## lm(formula = sales1 ~ price1 * price2 + day + store, data = oranges)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.465 -2.823  0.463  1.543  6.849 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    67.9188    30.9751    2.19   0.0392 * 
## price1         -1.4136     0.6017   -2.35   0.0282 * 
## price2         -0.5294     0.6846   -0.77   0.4476   
## day2            0.1998     2.5723    0.08   0.9388   
## day3            7.5747     2.5207    3.01   0.0065 **
## day4            2.9048     2.5201    1.15   0.2614   
## day5            9.6828     2.5666    3.77   0.0010 **
## day6            5.5180     2.5384    2.17   0.0408 * 
## store2          1.5848     2.9159    0.54   0.5923   
## store3         -0.6138     3.0157   -0.20   0.8406   
## store4          2.5494     3.0362    0.84   0.4101   
## store5          1.8352     2.8605    0.64   0.5278   
## store6          7.0187     2.7636    2.54   0.0187 * 
## price1:price2   0.0137     0.0137    1.00   0.3275   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.21 on 22 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.62 
## F-statistic: 5.39 on 13 and 22 DF,  p-value: 0.00028
## 
## 
## Response sales2 :
## 
## Call:
## lm(formula = sales2 ~ price1 * price2 + day + store, data = oranges)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.500 -1.864 -0.167  2.783  9.976 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -47.8839    38.7060   -1.24    0.229   
## price1          1.6908     0.7519    2.25    0.035 * 
## price2          0.8937     0.8555    1.04    0.308   
## day2           -2.0322     3.2143   -0.63    0.534   
## day3           10.1099     3.1498    3.21    0.004 **
## day4            3.9669     3.1491    1.26    0.221   
## day5            7.7297     3.2072    2.41    0.025 * 
## day6            5.0629     3.1720    1.60    0.125   
## store2          6.7389     3.6437    1.85    0.078 . 
## store3          2.0551     3.7683    0.55    0.591   
## store4          5.6819     3.7940    1.50    0.148   
## store5          6.4009     3.5744    1.79    0.087 . 
## store6          5.1412     3.4533    1.49    0.151   
## price1:price2  -0.0320     0.0171   -1.87    0.074 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 22 degrees of freedom
## Multiple R-squared:  0.778,  Adjusted R-squared:  0.647 
## F-statistic: 5.94 on 13 and 22 DF,  p-value: 0.000137

Using emmeans, we can test the difference in sales of the two orange varieties as a function of the price of the first type (price1).

#test the pairwise differences in price1 slopes between varieties.
emtrends(org.int, pairwise ~ variety, var = "price1", mult.name = "variety")

## $emtrends
##  variety price1.trend    SE df lower.CL upper.CL
##  sales1        -0.749 0.171 22   -1.104   -0.394
##  sales2         0.138 0.214 22   -0.306    0.582
## 
## Results are averaged over the levels of: day, store 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast        estimate   SE df t.ratio p.value
##  sales1 - sales2   -0.887 0.24 22   -3.69  0.0013
## 
## Results are averaged over the levels of: day, store

Basics of correlation and regression in R

Michael Hallquist

15 Aug 2018