The overall goal is to give you a very quick introduction to conducting correlation and regression analyses in R.
The Pearson product moment correlation seeks to measure the linear association between two variables, \(x\) and \(y\) on a standardized scale ranging from \(r = -1 -- 1\).
The correlation of x and y is a covariance that has been standardized by the standard deviations of \(x\) and \(y\). This yields a scale-insensitive measure of the linear association of \(x\) and \(y\). For much more conceptual detail, see this: https://psu-psychology.github.io/psy-597-SEM/01_correlation_regression/01_Correlation_and_Regression.html.
\[r_{XY}= \frac{s_{XY}}{s_{X} s_{Y}}\]
to_correlate <- mtcars %>% dplyr::select(qsec, cyl, disp, hp)
cor(to_correlate)
## qsec cyl disp hp
## qsec 1.000 -0.591 -0.434 -0.708
## cyl -0.591 1.000 0.902 0.832
## disp -0.434 0.902 1.000 0.791
## hp -0.708 0.832 0.791 1.000
Recall that the significance of correlations are computed on \(n - 2\) degrees of freedom.
The t-test is:
\[ t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^2}} \]
cor.test(mtcars$qsec, mtcars$cyl)
##
## Pearson's product-moment correlation
##
## data: mtcars$qsec and mtcars$cyl
## t = -4, df = 30, p-value = 4e-04
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.779 -0.306
## sample estimates:
## cor
## -0.591
Note that we can use the conf.int
argument to cor.test
to get different levels of confidence.
cor.test(mtcars$qsec, mtcars$cyl, conf.level = 0.9)
##
## Pearson's product-moment correlation
##
## data: mtcars$qsec and mtcars$cyl
## t = -4, df = 30, p-value = 4e-04
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
## -0.755 -0.358
## sample estimates:
## cor
## -0.591
ggplot(to_correlate, aes(x=cyl, y=qsec)) + geom_jitter(width=0.1) + stat_smooth(method="lm", se=FALSE)
Hmisc::rcorr(to_correlate %>% as.matrix())
## qsec cyl disp hp
## qsec 1.00 -0.59 -0.43 -0.71
## cyl -0.59 1.00 0.90 0.83
## disp -0.43 0.90 1.00 0.79
## hp -0.71 0.83 0.79 1.00
##
## n= 32
##
##
## P
## qsec cyl disp hp
## qsec 0.0004 0.0131 0.0000
## cyl 0.0004 0.0000 0.0000
## disp 0.0131 0.0000 0.0000
## hp 0.0000 0.0000 0.0000
Notice that we now get a matrix of p-values, too…
stargazer(cor(to_correlate), type = "html")
qsec | cyl | disp | hp | |
qsec | 1 | -0.591 | -0.434 | -0.708 |
cyl | -0.591 | 1 | 0.902 | 0.832 |
disp | -0.434 | 0.902 | 1 | 0.791 |
hp | -0.708 | 0.832 | 0.791 | 1 |
#you can use the filename argument to write out the table as a Word doc!
apaTables::apa.cor.table(to_correlate)
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3
## 1. qsec 17.85 1.79
##
## 2. cyl 6.19 1.79 -.59**
## [-.78, -.31]
##
## 3. disp 230.72 123.94 -.43* .90**
## [-.68, -.10] [.81, .95]
##
## 4. hp 146.69 68.56 -.71** .83** .79**
## [-.85, -.48] [.68, .92] [.61, .89]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
Here, we store all details of the bivariate correlation test as an R object ctest
.
ctest <- cor.test(mtcars$qsec, mtcars$cyl)
Let’s poke under the hood:
str(ctest)
## List of 9
## $ statistic : Named num -4.02
## ..- attr(*, "names")= chr "t"
## $ parameter : Named int 30
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.000366
## $ estimate : Named num -0.591
## ..- attr(*, "names")= chr "cor"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "correlation"
## $ alternative: chr "two.sided"
## $ method : chr "Pearson's product-moment correlation"
## $ data.name : chr "mtcars$qsec and mtcars$cyl"
## $ conf.int : num [1:2] -0.779 -0.306
## ..- attr(*, "conf.level")= num 0.95
## - attr(*, "class")= chr "htest"
So, we can poke around and grab specific things:
ctest$p.value
## [1] 0.000366
ctest$estimate
## cor
## -0.591
And there are useful helper packages, especially broom
, that will help you work with statistics objects as data.frame
objects.
broom::glance(ctest)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
-0.591 | -4.01 | 0 | 30 | -0.779 | -0.306 | Pearson’s product-moment correlation | two.sided |
You can use a different correlation method (e.g., Spearman) using the method
argument:
cor.test(mtcars$qsec, mtcars$cyl, method = "spearman")
## Warning in cor.test.default(mtcars$qsec, mtcars$cyl, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mtcars$qsec and mtcars$cyl
## S = 9000, p-value = 6e-04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.572
cor(to_correlate, method = "spearman")
## qsec cyl disp hp
## qsec 1.000 -0.572 -0.460 -0.667
## cyl -0.572 1.000 0.928 0.902
## disp -0.460 0.928 1.000 0.851
## hp -0.667 0.902 0.851 1.000
By default, cor
will return an NA
(missing) for every pair in which at least one observation is missing. We can ask for correlations to be estimated on the complete cases for each pair. This is use="pairwise.complete.obs"
.
Here’s the difference (I introduced some missing data to make the point):
First, with ‘everything’ as the use
argument (any missing on a variable drops it from the correlation table).
to_correlate_miss <- to_correlate
to_correlate_miss$qsec[c(1, 5)] <- NA
cor(to_correlate_miss) #implicitly use="everything"
## qsec cyl disp hp
## qsec 1 NA NA NA
## cyl NA 1.000 0.902 0.832
## disp NA 0.902 1.000 0.791
## hp NA 0.832 0.791 1.000
Now with pairwise complete calculation:
cor(to_correlate_miss, use="pairwise.complete.obs")
## qsec cyl disp hp
## qsec 1.000 -0.596 -0.448 -0.731
## cyl -0.596 1.000 0.902 0.832
## disp -0.448 0.902 1.000 0.791
## hp -0.731 0.832 0.791 1.000
Next, let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm
function in R. The lm
function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.
In simple regression, we are interested in a relationship of the form:
\[ Y = B_0 + B_1 X \]
where \(Y\) is the dependent variable (criterion) and \(X\) is the predictor (covariate). The intercept is represented by \(B0\) and the slope for the \(X\) predictor by \(B1\).
Let’s take a look at the simple case of stopping distance (braking) as a function of car speed.
ggplot(cars, aes(x=speed, y=dist)) +
geom_point(color='darkblue', size = 3) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
labs(x="Speed (mph)", y="Stopping distance (ft)")
When conducting regression, we typically try to capture linear relationships among variables. We can introduce higher-order polynomial terms (e.g., quadratic models) or splines (more flexible shapes), but this beyond the scope here.
Fortunately, this relationship looks quite linear! The faster the car, the longer it takes to brake.
In R regression models, we use the ~
operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think speed is a key cause of stopping distance, we’d say ‘braking distance regressed on speed’ or ‘speed predicts braking distance.’
In formula terms, this is dist ~ speed
, which we pass as the first argument to lm()
.
lm_cars <- lm(dist ~ speed, data=cars)
summary(lm_cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.07 -9.53 -2.27 9.21 43.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.579 6.758 -2.60 0.012 *
## speed 3.932 0.416 9.46 1.5e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared: 0.651, Adjusted R-squared: 0.644
## F-statistic: 89.6 on 1 and 48 DF, p-value: 1.49e-12
The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).
We can also get useful diagnostic plots for free using the plot()
function:
plot(lm_cars)
The ggfortify
package also provides an autoplot
function that gives similar diagnostics within a handy ggplot-based graph.
autoplot(lm_cars)
Using functionality from the car
and boot
packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.
Nonparametric bootstrapping approximates the sampling distribution for a statistic of interest (e.g., a slope) by resampling the existing data with replacement many times and examining the resulting density.
system.time(lm_cars.boot <- Boot(lm_cars, R=2000))
## Loading required namespace: boot
## user system elapsed
## 1.356 0.011 1.375
summary(lm_cars.boot, high.moments=TRUE)
R | original | bootBias | bootSE | bootMed | bootSkew | bootKurtosis | |
---|---|---|---|---|---|---|---|
(Intercept) | 2000 | -17.58 | 0.213 | 5.964 | -17.3 | -0.208 | 0.240 |
speed | 2000 | 3.93 | -0.016 | 0.422 | 3.9 | 0.133 | 0.201 |
We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).
confint(lm_cars.boot, level=.95, type="bca")
## Bootstrap quantiles, type = bca
##
## 2.5 % 97.5 %
## (Intercept) -31.6 -7.34
## speed 3.2 4.93
And we can easily compare the bootstrapped and standard OLS models:
hist(lm_cars.boot, legend="separate")
Notice that the speed
regression coefficient is slightly positively skewed. Additional details are provided in John Fox’s useful book “An R Companion to Applied Regression” (2nd edition): https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Bootstrapping.pdf.
We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the mtcars
dataset (car performance statistics from 1974 Motor Trend), we could examine the extent to which the gas mileage (mpg
) is a function of both gross horsepower (hp
) and transmission (am
, where 0 is ‘automatic’ and 1 is ‘manual’).
mpg_model <- lm(mpg ~ hp + am, mtcars)
summary(mpg_model)
##
## Call:
## lm(formula = mpg ~ hp + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.384 -2.264 0.137 1.697 5.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.58491 1.42509 18.65 < 2e-16 ***
## hp -0.05889 0.00786 -7.50 2.9e-08 ***
## am 5.27709 1.07954 4.89 3.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.91 on 29 degrees of freedom
## Multiple R-squared: 0.782, Adjusted R-squared: 0.767
## F-statistic: 52 on 2 and 29 DF, p-value: 2.55e-10
It appears that these are both influential predictors. We could examine the relationship graphically.
ggplot(mtcars, aes(x=hp, y=mpg, color=factor(am))) + geom_point() + stat_smooth(method=lm, se=FALSE)
There doesn’t appear to be any meaningful interaction between horsepower and transmission type.
Note that the broom
package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?
We can use the glance
function to get the global model statistics.
broom::glance(mpg_model)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | |
---|---|---|---|---|---|---|---|---|---|---|---|
value | 0.782 | 0.767 | 2.91 | 52 | 0 | 3 | -78 | 164 | 170 | 245 | 29 |
And the tidy
function yields the parameter table
broom::tidy(mpg_model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 26.585 | 1.425 | 18.66 | 0 |
hp | -0.059 | 0.008 | -7.50 | 0 |
am | 5.277 | 1.080 | 4.89 | 0 |
As can imagine (and saw earlier in the functional programming overview), the ability to extract regression statistics into a tidy data.frame is a boon to scaling your analyses to multiple models and datasets.
We can use the *
operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:
int_model <- lm(mpg ~ hp*wt + am, mtcars)
summary(int_model)
##
## Call:
## lm(formula = mpg ~ hp * wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.985 -1.658 -0.741 1.436 4.527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.45224 5.28073 9.36 5.7e-10 ***
## hp -0.11930 0.02655 -4.49 0.00012 ***
## wt -8.10056 1.78933 -4.53 0.00011 ***
## am 0.12511 1.33343 0.09 0.92594
## hp:wt 0.02749 0.00847 3.24 0.00313 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.19 on 27 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.868
## F-statistic: 51.8 on 4 and 27 DF, p-value: 2.76e-12
This model includes individual effects of horsepower (hp
) and weight (wt
), as well as their interation (hp:wt
). This highlights that the asterisk operator *
will compute all possible interations among the specified predictors. For example, a*b*c*d
will generate all effets up through and including the a x b x c x d
interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b
). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.
Note that with weight (wt
) in the model, as well the horsepower x weight interaction, the automatic/manual transmission distinction is no longer significant (not even close). Let’s take a look at this interaction. For a two-way continuous x continuous interation, we typically separate one predictor into low (-1SD), medium (mean), and high (+1SD) levels, and plot separate lines for each.
#handy 2-way interation plotting function from jtools.
interact_plot(int_model, pred = "hp", modx = "wt")
What do we see? At higher horsepower, gas mileage suffers regardless of the weight of the car. At lower horsepower, car weight makes a big difference (lower weight, higher mpg).
(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans
documentation: https://cran.r-project.org/web/packages/emmeans/)
One of the handiest packages in the R regression universe is emmeans
, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp
to handle correction for multiple comparisons. See the longer documentation here.
Let’s look at the concentration of leucine in a study of pigs who were fed differing levels of protein in the diet (9, 12, 15, and 18%) and different protein sources (fish, soybean, milk). The concentration has a long positive tail, so here we log transform it to normalize things somewhat.
data(pigs, package="emmeans")
pigs <- pigs %>% mutate(log_conc=log(conc), percent_fac=factor(percent))
pigs.lm <- lm(log_conc ~ source + percent_fac, data = pigs)
summary(pigs.lm)
##
## Call:
## lm(formula = log_conc ~ source + percent_fac, data = pigs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1886 -0.0598 -0.0216 0.0812 0.2460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2203 0.0536 60.07 < 2e-16 ***
## sourcesoy 0.2728 0.0529 5.15 3.2e-05 ***
## sourceskim 0.4023 0.0542 7.43 1.5e-07 ***
## percent_fac12 0.1796 0.0561 3.20 0.00396 **
## percent_fac15 0.2174 0.0597 3.64 0.00137 **
## percent_fac18 0.2998 0.0676 4.43 0.00019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.115 on 23 degrees of freedom
## Multiple R-squared: 0.757, Adjusted R-squared: 0.704
## F-statistic: 14.3 on 5 and 23 DF, p-value: 2.05e-06
This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.
We can obtain the expected means for each condition.
pigs.emm.s <- emmeans(pigs.lm, "source")
print(pigs.emm.s)
## source emmean SE df lower.CL upper.CL
## fish 3.39 0.0367 23 3.32 3.47
## soy 3.67 0.0374 23 3.59 3.74
## skim 3.80 0.0394 23 3.72 3.88
##
## Results are averaged over the levels of: percent_fac
## Confidence level used: 0.95
pigs.emm.p <- emmeans(pigs.lm, "percent_fac")
print(pigs.emm.p)
## percent_fac emmean SE df lower.CL upper.CL
## 9 3.45 0.0409 23 3.36 3.53
## 12 3.62 0.0384 23 3.55 3.70
## 15 3.66 0.0437 23 3.57 3.75
## 18 3.75 0.0530 23 3.64 3.85
##
## Results are averaged over the levels of: source
## Confidence level used: 0.95
print(emmeans(pigs.lm, ~source*percent_fac))
## source percent_fac emmean SE df lower.CL upper.CL
## fish 9 3.22 0.0536 23 3.11 3.33
## soy 9 3.49 0.0498 23 3.39 3.60
## skim 9 3.62 0.0501 23 3.52 3.73
## fish 12 3.40 0.0493 23 3.30 3.50
## soy 12 3.67 0.0489 23 3.57 3.77
## skim 12 3.80 0.0494 23 3.70 3.90
## fish 15 3.44 0.0548 23 3.32 3.55
## soy 15 3.71 0.0507 23 3.61 3.82
## skim 15 3.84 0.0549 23 3.73 3.95
## fish 18 3.52 0.0547 23 3.41 3.63
## soy 18 3.79 0.0640 23 3.66 3.93
## skim 18 3.92 0.0646 23 3.79 4.06
##
## Confidence level used: 0.95
If we wanted to compare the pairwise differences in the effect of protein source on leucine concentration while controlling for protein percentage (and potentially other variables we add to the model), we could use the pairs
function:
pig_pairs <- pairs(pigs.emm.s)
print(pig_pairs)
## contrast estimate SE df t.ratio p.value
## fish - soy -0.273 0.0529 23 -5.15 0.0001
## fish - skim -0.402 0.0542 23 -7.43 <.0001
## soy - skim -0.130 0.0530 23 -2.44 0.0570
##
## Results are averaged over the levels of: percent_fac
## P value adjustment: tukey method for comparing a family of 3 estimates
Note that you can get a sense of the contrasts being tested by emmeans
by examining the @linfct
slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).
Contrasts for the predicted mean level of leucine contrast for each protein source, controlling for protein percentage.
pigs.emm.s@linfct
## (Intercept) sourcesoy sourceskim percent_fac12 percent_fac15
## [1,] 1 0 0 0.25 0.25
## [2,] 1 1 0 0.25 0.25
## [3,] 1 0 1 0.25 0.25
## percent_fac18
## [1,] 0.25
## [2,] 0.25
## [3,] 0.25
What are the pairwise contrasts for the protein sources?
pig_pairs@linfct
## (Intercept) sourcesoy sourceskim percent_fac12 percent_fac15
## [1,] 0 -1 0 0 0
## [2,] 0 0 -1 0 0
## [3,] 0 1 -1 0 0
## percent_fac18
## [1,] 0
## [2,] 0
## [3,] 0
The emmeans
package also provides useful plots to understand pairwise differences:
plot(pigs.emm.s, comparisons = TRUE)
The blue bars are confidence intervals for the EMMs, and the red arrows are for the comparisons among them. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”). (Note: Don’t ever use confidence intervals for EMMs to perform comparisons; they can be very misleading.)
Returning to the iris dataset (from the parallel R examples), consider a model in which we examine the association between petal length and width across species. Here, we regress petal length on petal width, species (three levels), and their interaction.
fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris)
summary(fitiris)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width * Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8410 -0.1934 -0.0369 0.1631 1.1707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.328 0.131 10.14 < 2e-16 ***
## Petal.Width 0.546 0.490 1.12 0.267
## Speciesversicolor 0.454 0.374 1.21 0.227
## Speciesvirginica 2.913 0.406 7.17 3.5e-11 ***
## Petal.Width:Speciesversicolor 1.323 0.555 2.38 0.019 *
## Petal.Width:Speciesvirginica 0.101 0.525 0.19 0.848
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.361 on 144 degrees of freedom
## Multiple R-squared: 0.959, Adjusted R-squared: 0.958
## F-statistic: 682 on 5 and 144 DF, p-value: <2e-16
car::Anova(fitiris, type="III") #overall effects of predictors in the model
Sum Sq | Df | F value | Pr(>F) | |
---|---|---|---|---|
(Intercept) | 13.433 | 1 | 102.81 | 0.000 |
Petal.Width | 0.163 | 1 | 1.24 | 0.267 |
Species | 6.747 | 2 | 25.82 | 0.000 |
Petal.Width:Species | 2.018 | 2 | 7.72 | 0.001 |
Residuals | 18.816 | 144 | NA | NA |
Note that this yields a categorical (species) x continuous (petal width) interaction. The output from car::Anova
indicates that the interaction is significant, but we need more detailed guidance on how the slope for petal width is moderated by species. We can visualize the interaction as follows:
interact_plot(fitiris, pred = "Petal.Width", modx = "Species")
In a simple slopes test, we might wish to know whether the slope for Petal.Width
is non-zero in each species individually. Let’s start by getting the estimated marginal means for each species.
emmeans(fitiris, ~Species)
## NOTE: Results may be misleading due to involvement in interactions
## Species emmean SE df lower.CL upper.CL
## setosa 1.98 0.4699 144 1.05 2.91
## versicolor 4.02 0.0609 144 3.90 4.14
## virginica 5.02 0.1636 144 4.69 5.34
##
## Confidence level used: 0.95
And pairwise differences between species:
pairs(emmeans(fitiris, ~Species))
## NOTE: Results may be misleading due to involvement in interactions
## contrast estimate SE df t.ratio p.value
## setosa - versicolor -2.040 0.474 144 -4.31 0.0001
## setosa - virginica -3.034 0.498 144 -6.10 <.0001
## versicolor - virginica -0.994 0.175 144 -5.69 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Transitioning to petal width, because we are interested its linear effect (slope), we use the emtrends
function to estimate the slope in each species individually. In terms of simple slopes, we test whether the Petal.Width slope is non-zero in each Species. The infer
argument in the summary of emtrends
requests t-tests and p-values for the slopes.
summary(emtrends(model = fitiris, ~Species, var="Petal.Width"), infer=TRUE)
Species | Petal.Width.trend | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|
setosa | 0.546 | 0.490 | 144 | -0.422 | 1.51 | 1.11 | 0.267 |
versicolor | 1.869 | 0.261 | 144 | 1.353 | 2.38 | 7.16 | 0.000 |
virginica | 0.647 | 0.188 | 144 | 0.276 | 1.02 | 3.44 | 0.001 |
Finally, we could examine pairwise differences between slopes among species.
pairs(emtrends(model = fitiris, ~Species, var="Petal.Width"))
## contrast estimate SE df t.ratio p.value
## setosa - versicolor -1.323 0.555 144 -2.382 0.0483
## setosa - virginica -0.101 0.525 144 -0.192 0.9799
## versicolor - virginica 1.222 0.322 144 3.798 0.0006
##
## P value adjustment: tukey method for comparing a family of 3 estimates
In the pigs dataset, we have treated protein percentage as a fact (9, 12, 15, or 18 percent). If we keep this representation (as opposed to entering percentage as continuous), we can easily get orthogonal polynomial contrasts in emmeans
. For example, is the effect of protein percent linearly related to leucine, or might it be quadratic or cubic?
pigs.emm.p <- emmeans(pigs.lm, "percent_fac")
ply <- contrast(pigs.emm.p, "poly")
ply
## contrast estimate SE df t.ratio p.value
## linear 0.9374 0.2106 23 4.452 0.0002
## quadratic -0.0971 0.0883 23 -1.099 0.2830
## cubic 0.1863 0.1877 23 0.992 0.3313
##
## Results are averaged over the levels of: source
coef(ply) #show the contrast coefficients
percent_fac | c.1 | c.2 | c.3 | |
---|---|---|---|---|
9 | 9 | -3 | 1 | -1 |
12 | 12 | -1 | -1 | 3 |
15 | 15 | 1 | -1 | -3 |
18 | 18 | 3 | 1 | 1 |
There is a lot more on probing interations here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.
Finally, we can examine effects in multivariate regression models (i.e., multiple DVs). Here, we can examine the sales of two varieties of oranges as a function of their prices, the day, and store. Sales for both oranges are modeled at once (sales1
and sales2
) to get a sense of price interdependence.
org.int <- lm(cbind(sales1, sales2) ~ price1 * price2 + day + store, data = oranges)
summary(org.int)
## Response sales1 :
##
## Call:
## lm(formula = sales1 ~ price1 * price2 + day + store, data = oranges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.465 -2.823 0.463 1.543 6.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.9188 30.9751 2.19 0.0392 *
## price1 -1.4136 0.6017 -2.35 0.0282 *
## price2 -0.5294 0.6846 -0.77 0.4476
## day2 0.1998 2.5723 0.08 0.9388
## day3 7.5747 2.5207 3.01 0.0065 **
## day4 2.9048 2.5201 1.15 0.2614
## day5 9.6828 2.5666 3.77 0.0010 **
## day6 5.5180 2.5384 2.17 0.0408 *
## store2 1.5848 2.9159 0.54 0.5923
## store3 -0.6138 3.0157 -0.20 0.8406
## store4 2.5494 3.0362 0.84 0.4101
## store5 1.8352 2.8605 0.64 0.5278
## store6 7.0187 2.7636 2.54 0.0187 *
## price1:price2 0.0137 0.0137 1.00 0.3275
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.21 on 22 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.62
## F-statistic: 5.39 on 13 and 22 DF, p-value: 0.00028
##
##
## Response sales2 :
##
## Call:
## lm(formula = sales2 ~ price1 * price2 + day + store, data = oranges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.500 -1.864 -0.167 2.783 9.976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.8839 38.7060 -1.24 0.229
## price1 1.6908 0.7519 2.25 0.035 *
## price2 0.8937 0.8555 1.04 0.308
## day2 -2.0322 3.2143 -0.63 0.534
## day3 10.1099 3.1498 3.21 0.004 **
## day4 3.9669 3.1491 1.26 0.221
## day5 7.7297 3.2072 2.41 0.025 *
## day6 5.0629 3.1720 1.60 0.125
## store2 6.7389 3.6437 1.85 0.078 .
## store3 2.0551 3.7683 0.55 0.591
## store4 5.6819 3.7940 1.50 0.148
## store5 6.4009 3.5744 1.79 0.087 .
## store6 5.1412 3.4533 1.49 0.151
## price1:price2 -0.0320 0.0171 -1.87 0.074 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 22 degrees of freedom
## Multiple R-squared: 0.778, Adjusted R-squared: 0.647
## F-statistic: 5.94 on 13 and 22 DF, p-value: 0.000137
Using emmeans
, we can test the difference in sales of the two orange varieties as a function of the price of the first type (price1
).
#test the pairwise differences in price1 slopes between varieties.
emtrends(org.int, pairwise ~ variety, var = "price1", mult.name = "variety")
## $emtrends
## variety price1.trend SE df lower.CL upper.CL
## sales1 -0.749 0.171 22 -1.104 -0.394
## sales2 0.138 0.214 22 -0.306 0.582
##
## Results are averaged over the levels of: day, store
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## sales1 - sales2 -0.887 0.24 22 -3.69 0.0013
##
## Results are averaged over the levels of: day, store