1 Introduction

There are several freely available packages for structural equation modeling (SEM), both in and outside of R. In the R world, the three most popular are lavaan, OpenMX, and sem. I have tended to prefer lavaan because of its user-friendly syntax, which mimics key aspects of of Mplus. Although OpenMX provides a broader set of functions, the learning curve is steeper.

Here, we will consider models in which all variables are observed/manifest, as well as models with latent variables. The first is sometimes called ‘path analysis’, whereas the latter is sometimes called a ‘measurement model.’

2 Use lavaan for simple multiple regression

SEM is largely a multivariate extension of regression in which we can examine many predictors and outcomes at once. SEM also provides the innovation of examining latent structure (i.e., where some variables are not observed). More specifically, the idea of ‘structural equations’ refers to the fact that we have more than one equation representing a model of covariance structure in which we (usually) have multiple criterion variables and multiple predictors.

Let’s start with the simple demonstration that a path model in SEM can recapitulate simple single predictor-single outcome regression. We’ll examine housing price data from Boston’s 1970 census to review important concepts in correlation and regression. This is a nice dataset for regression because there are many interdependent variables: crime, pollutants, age of properties, etc.

We can estimate the same multiple regression models using the lavaan package that we’ll be using for more complex SEMs. And the syntax even has many similarities with lm().

Here’s the single-predictor regression from above, run as a path model in lavaan:

#example dataset from mlbench package with home prices in Boston by census tract
data(BostonHousing2)
BostonSmall <- BostonHousing2 %>% dplyr::select(
  cmedv, #median value of home in 1000s
  crim, #per capita crime by town
  nox, #nitric oxide concentration
  lstat, #proportion of lower status
  rad #proximity to radial highways
  ) %>% mutate(log_crim = log2(crim))

lavaan_m <- 'cmedv ~ log_crim'
mlav <- sem(lavaan_m, data=BostonSmall)
summary(mlav)
## lavaan 0.6-2 ended normally after 11 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          2
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -1.346    0.116  -11.567    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            66.549    4.184   15.906    0.000

And for comparison, the output of lm()

  summary(lm(cmedv ~ log_crim, BostonSmall))
## 
## Call:
## lm(formula = cmedv ~ log_crim, data = BostonSmall)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.31  -5.17  -2.48   2.66  33.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   21.013      0.386    54.4   <2e-16 ***
## log_crim      -1.346      0.117   -11.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.17 on 504 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  133 on 1 and 504 DF,  p-value: <2e-16

The regression coefficient is identical (good!). One thing to note is that we don’t have an intercept in the lavaan output. This highlights an important difference that basic SEM often focuses on the covariance structure of the data. We can include means as well, but typically only when it’s relevant to our scientific questions. For example, do males and females differ on mean level of a depression latent factor?

2.1 Mean structure

We can ask lavaan to include the mean (intercept) in the model in this case using meanstructure=TRUE:

mlav_w_intercept <- sem(lavaan_m, data=BostonSmall, meanstructure=TRUE)
summary(mlav_w_intercept)
## lavaan 0.6-2 ended normally after 12 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          3
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -1.346    0.116  -11.567    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            21.013    0.386   54.494    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            66.549    4.184   15.906    0.000

2.2 Details about parameters in the model

It’s good to get in the habit of examining the ‘parameter’ table in lavaan, which provides an important summary of what parameters are free in the model (i.e., have to be estimated), and what parameters were requested by the user (you!) in the model syntax.

parTable(mlav)
id lhs op rhs user block group free ustart exo label plabel start est se
1 cmedv ~ log_crim 1 1 1 1 NA 0 .p1. 0.00 -1.35 0.116
2 cmedv ~~ cmedv 0 1 1 2 NA 0 .p2. 42.07 66.55 4.184
3 log_crim ~~ log_crim 0 1 1 0 NA 1 .p3. 9.71 9.71 0.000

Here, ‘user’ refers to a parameter we’ve request explicitly in the syntax, and non-zero values for the ‘free’ column denote parameters that are freely estimated by the model.

Note that we can get standardized estimates in lavaan as well. This is a more complicated topic in SEM because we can standardize with respect to the latent variables alone (std.lv) or both the observed and latent variables (std.all). The latter is usually what is reported as standardized estimates in SEM papers.

2.3 Standardized estimates

standardizedSolution(mlav, type="std.all")
lhs op rhs est.std se z pvalue ci.lower ci.upper
cmedv ~ log_crim -0.457 0.033 -13.7 0 -0.523 -0.392
cmedv ~~ cmedv 0.791 0.030 26.0 0 0.731 0.851
log_crim ~~ log_crim 1.000 0.000 NA NA 1.000 1.000

3 Path analysis on housing data

Let’s look at something more interesting. What if we believe that the level nitric oxides (nox) also predicts home prices alongside crime? We can add this as a predictor as in standard multiple regression.

Furthermore, we hypothesize that the proximity of a home to large highways (rad) predicts the concentration of nitric oxides, which predicts lower home prices?

The model syntax could be specified as:

lavaan_m2 <- '
cmedv ~ log_crim + nox #crime and nox predict lower home prices
nox ~ rad #proximity to highways predicts nox
'
mlav2 <- sem(lavaan_m2, data=BostonSmall)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, :
## lavaan WARNING: some observed variances are (at least) a factor 1000 times
## larger than others; use varTable(fit) to investigate

The model looks like this (using the handy semPaths function from semPlot):

semPaths(mlav2, what='std', nCharNodes=6, sizeMan=10,
         edge.label.cex=1.25, curvePivot = TRUE, fade=FALSE)

Here’s the text output:

summary(mlav2)
## lavaan 0.6-2 ended normally after 33 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox             -14.391    3.643   -3.950    0.000
##   nox ~                                               
##     rad               0.008    0.000   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox               0.008    0.001   15.906    0.000

A few things to note:

  1. Note the warning: ‘some observed variances are (at least) a factor 1000 times larger than others.’ This is called ill conditioning.
  2. Our hypotheses all seem to be supported.
  3. The model chi-square is highly significant, suggesting poor global model fit.

3.1 Ill conditioning

Parameter estimation can be hampered when the variances of variables in the model differ substantially (orders of magnitude). Given the above warning, let’s take a look.

varTable(mlav2)
name idx nobs type exo user mean var nlev lnam
cmedv 1 506 numeric 0 0 22.529 84.312 0
nox 3 506 numeric 0 0 0.555 0.013 0
log_crim 6 506 numeric 1 0 -1.126 9.729 0
rad 5 506 numeric 1 0 9.549 75.816 0

Looks like the scale of nox is much smaller than any other predictor, likely because it’s in parts per 10 million! We can rescale variables in this case by multiplying by a constant. This has no effect on the fit or interpretation of the model – we just have to recall what the new units represent. Also, you can always divide out the constant from the parameter estimate to recover the original units, if important.

BostonSmall <- BostonSmall %>% mutate(nox = nox*100) #not parts per 100,000 not 10 million
mlav2 <- sem(lavaan_m2, data=BostonSmall)
summary(mlav2)
## lavaan 0.6-2 ended normally after 18 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox              -0.144    0.036   -3.950    0.000
##   nox ~                                               
##     rad               0.814    0.047   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              83.910    5.275   15.906    0.000

3.2 Model fit indices

You can request more detailed global fit indices from lavaan in the model summary output using fit.measures=TRUE.

summary(mlav2, fit.measures=TRUE)
## lavaan 0.6-2 ended normally after 18 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              638.018
##   Degrees of freedom                                 5
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.570
##   Tucker-Lewis Index (TLI)                      -0.076
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3614.747
##   Loglikelihood unrestricted model (H1)      -3477.566
## 
##   Number of free parameters                          5
##   Akaike (AIC)                                7239.493
##   Bayesian (BIC)                              7260.626
##   Sample-size adjusted Bayesian (BIC)         7244.755
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.519
##   90 Percent Confidence Interval          0.468  0.571
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.090
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox              -0.144    0.036   -3.950    0.000
##   nox ~                                               
##     rad               0.814    0.047   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              83.910    5.275   15.906    0.000

You can also get just the fit measures (including additional statistics) using fitmeasures():

fitmeasures(mlav2)
##                npar                fmin               chisq 
##               5.000               0.271             274.360 
##                  df              pvalue      baseline.chisq 
##               2.000               0.000             638.018 
##         baseline.df     baseline.pvalue                 cfi 
##               5.000               0.000               0.570 
##                 tli                nnfi                 rfi 
##              -0.076              -0.076               1.000 
##                 nfi                pnfi                 ifi 
##               0.570               0.228               0.572 
##                 rni                logl   unrestricted.logl 
##               0.570           -3614.747           -3477.566 
##                 aic                 bic              ntotal 
##            7239.493            7260.626             506.000 
##                bic2               rmsea      rmsea.ci.lower 
##            7244.755               0.519               0.468 
##      rmsea.ci.upper        rmsea.pvalue                 rmr 
##               0.571               0.000               4.249 
##          rmr_nomean                srmr        srmr_bentler 
##               4.249               0.090               0.090 
## srmr_bentler_nomean         srmr_bollen  srmr_bollen_nomean 
##               0.090               0.089               0.089 
##          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.089               0.089              12.050 
##               cn_01                 gfi                agfi 
##              17.987               0.748              -0.259 
##                pgfi                 mfi                ecvi 
##               0.150               0.764               0.562

These look atrocious: CFI is < .95 (and much less than even .9), and RMSEA is much greater than the .08 level that we would consider just ‘okay.’

3.3 Model diagnostics

This suggests the need to examine the fit in more detail. First, we can look at the mismatch between the model-implied and observed covariance matrices.

Conceptually, the goal of structural equation modeling (SEM) is to test whether a theoretically motivated model of the covariance among variables provides a good approximation of the data.

More specifically, we are trying to test how well a parsimonious model (composed of measurement and/or structural components) reproduces the observed covariance matrix. Formally, we are seeking to develop a model whose model-implied covariance matrix approaches the sample (observed) covariance matrix.

\[ \mathbf{S_{XX}} \approx \mathbf{\Sigma}(\hat{\theta}) \]

We can obtain these from lavaan for further diagnosis of model misfit.

First, the model-implied covariance matrix:

fitted(mlav2)
## $cov
##          cmedv  nox    lg_crm rad   
## cmedv     81.58                     
## nox      -36.69 134.01              
## log_crim -11.69  18.82   9.71       
## rad      -30.25  61.57  23.13  75.67
## 
## $mean
##    cmedv      nox log_crim      rad 
##        0        0        0        0

We might be able to interpret this more easily in correlational (standardized) units. That is, what’s the model-implied correlation among variables? The inspect function in lavaan gives access to a number of model details, including this:

inspect(mlav2, what="cor.all")
##          cmedv  nox    lg_crm rad   
## cmedv     1.000                     
## nox      -0.351  1.000              
## log_crim -0.415  0.522  1.000       
## rad      -0.385  0.611  0.853  1.000

How does this compare to the observed correlations?

lavCor(mlav2)
##          cmedv  nox    lg_crm rad   
## cmedv     1.000                     
## nox      -0.429  1.000              
## log_crim -0.457  0.789  1.000       
## rad      -0.385  0.611  0.853  1.000

In particular, getting the misfit of the bivariate associations is very helpful. Here, we ask for residuals in correlational units, which can be more intuitive than dealing with covariances that are unstandardized. Note that this is the subtraction of the observed - model-implied matrices above. Large positive values indicate the model underpredicts the correlation; large negative values suggest overprediction of correlation. Usually values \(|r > .1|\) are worth closer consideration.

resid(mlav2, "cor")
## $type
## [1] "cor.bollen"
## 
## $cor
##          cmedv  nox    lg_crm rad   
## cmedv     0.000                     
## nox      -0.078  0.000              
## log_crim -0.042  0.267  0.000       
## rad       0.000  0.000  0.000  0.000
## 
## $mean
##    cmedv      nox log_crim      rad 
##        0        0        0        0

So the model significantly underpredicts the association between nox and log_crim.

We could visualize the problems, too:

plot_matrix <- function(matrix_toplot){
corrplot::corrplot(matrix_toplot, is.corr = FALSE,
               type = 'lower',
               order = "original",
               tl.col='black', tl.cex=.75)
}
plot_matrix(residuals(mlav2, type="cor")$cor)

3.4 Modification indices

Let’s take a look at the modification indices to see if we can fix the misfit by freeing one or more paths, particularly the relationship between nox and log_crim.

modificationindices(mlav2, minimum.value = 20) #only print MIs > 20 
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
11 nox ~ cmedv 38.5 -0.748 -0.748 -0.584 -0.584
12 nox ~ log_crim 211.7 3.648 3.648 0.982 0.315
14 log_crim ~ nox 82.8 0.045 0.045 0.167 0.167
17 rad ~ nox 80.2 -0.142 -0.142 -0.189 -0.189

Here, we see that model fit would improve massively if we allowed log_crim to predict nox. Whether this makes theoretical sense is a different (and likely more important) matter. For demonstration purposes, let’s accept that this path needs to be freely estimated.

#we can use the add parameter to add a path, while leaving all other model elements the same
mlav3 <- update(mlav2, add="nox ~ log_crim")
summary(mlav3, fit.measures=TRUE)
## lavaan 0.6-2 ended normally after 25 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.080
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.777
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              638.018
##   Degrees of freedom                                 5
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.007
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3477.606
##   Loglikelihood unrestricted model (H1)      -3477.566
## 
##   Number of free parameters                          6
##   Akaike (AIC)                                6967.213
##   Bayesian (BIC)                              6992.572
##   Sample-size adjusted Bayesian (BIC)         6973.527
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.078
##   P-value RMSEA <= 0.05                          0.880
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.002
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.188   -4.924    0.000
##     nox              -0.144    0.051   -2.847    0.004
##   nox ~                                               
##     rad              -0.302    0.068   -4.403    0.000
##     log_crim          3.648    0.191   19.081    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              48.798    3.068   15.906    0.000

This looks way better in terms of fit. There is a strong positive association between crime and nox levels that we missed before. Conceptually, this suggests that the relationship between crime and home prices is partially mediated by crime’s effect on pollutant levels. By contrast, the effect of proximity to highways on home prices appears to be fully mediated by pollutant levels (as indicated by the absence of a large modification index for this path).

semPaths(mlav3, what='std', nCharNodes=6, sizeMan=10,
         edge.label.cex=1.25, curvePivot = TRUE, fade=FALSE)