1 Introduction

There are several freely available packages for structural equation modeling (SEM), both in and outside of R. In the R world, the three most popular are lavaan, OpenMX, and sem. I have tended to prefer lavaan because of its user-friendly syntax, which mimics key aspects of of Mplus. Although OpenMX provides a broader set of functions, the learning curve is steeper.

Here, we will consider models in which all variables are observed/manifest, as well as models with latent variables. The first is sometimes called ‘path analysis’, whereas the latter is sometimes called a ‘measurement model.’ Strictly speaking, when we combine models with a measurement model and a structural component (i.e., directed regression paths), these are called structural equation models.

2 Use lavaan for simple multiple regression

SEM is largely a multivariate extension of regression in which we can examine many predictors and outcomes at once. More specifically, the idea of ‘structural equations’ refers to the fact that we have more than one equation representing a model of covariance structure in which we (usually) have multiple criterion variables and multiple predictors. In addition, SEM offers the potential to model the presence of latent variables that explain covariance among the observed variables.

Let’s start with the simple demonstration that a path model in SEM can recapitulate simple single predictor-single outcome regression. We’ll examine housing price data from Boston’s 1970 census to review important concepts in correlation and regression. This is a nice dataset for regression because there are many interdependent variables: crime, pollutants, age of properties, etc.

We can estimate the same multiple regression models using the lavaan package that we’ll be using for more complex SEMs. And the syntax even has many similarities with lm(), which is used in standard single-outcome multiple regression.

2.1 Housing prices as a function of crime rate

First, note that crime rate is not normally distributed, but instead appears to follow a log-normal distribution. Let’s log2 transform crime rate such that a unit increase in log2(crime) represents a two-fold increase in the crime rate.

#example dataset from mlbench package with home prices in Boston by census tract
data(BostonHousing2)
BostonSmall <- BostonHousing2 %>% dplyr::select(
  cmedv, #median value of home in 1000s
  crim, #per capita crime by town
  nox, #nitric oxide concentration
  lstat, #proportion of lower status
  rad #proximity to radial highways
  ) %>% mutate(log_crim = log2(crim))


g1 <- ggplot(BostonSmall, aes(x=crim)) + geom_histogram(bins=15) + 
  theme_classic() + ggtitle("Histogram of crime rate")

g2 <- ggplot(BostonSmall, aes(x=crim, y=cmedv)) + geom_point() + 
  stat_smooth(method="lm") + ggtitle("Association of crime rate and housing prices") 
plot_grid(g1, g2, nrow=1)

ggplot(BostonSmall, aes(x=log_crim)) + geom_histogram(bins=15) + 
  theme_classic() + ggtitle("Histogram of crime rate")

2.1.1 Standard GLM

  summary(lm(cmedv ~ log_crim, BostonSmall))
## 
## Call:
## lm(formula = cmedv ~ log_crim, data = BostonSmall)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.31  -5.17  -2.48   2.66  33.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   21.013      0.386    54.4   <2e-16 ***
## log_crim      -1.346      0.117   -11.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.17 on 504 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  133 on 1 and 504 DF,  p-value: <2e-16

2.1.2 Run through lavaan

Here’s the single-predictor regression from above, run as a path model in lavaan:

lavaan_m <- 'cmedv ~ log_crim'
mlav <- sem(lavaan_m, data=BostonSmall)
summary(mlav)
## lavaan 0.6-4 ended normally after 11 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          2
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -1.346    0.116  -11.567    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            66.549    4.184   15.906    0.000


The regression coefficient is identical (good!). One thing to note is that we don’t have an intercept in the lavaan output. This highlights an important difference that basic SEM often focuses on the covariance structure of the data. We can include means as well, but typically only when it’s relevant to our scientific questions. For example, do males and females differ on mean level of a depression latent factor?

2.2 Mean structure

We can ask lavaan to include the mean (intercept) in the model in this case using meanstructure=TRUE:

mlav_w_intercept <- sem(lavaan_m, data=BostonSmall, meanstructure=TRUE)
summary(mlav_w_intercept)
## lavaan 0.6-4 ended normally after 12 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          3
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -1.346    0.116  -11.567    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            21.013    0.386   54.494    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            66.549    4.184   15.906    0.000

2.3 Details about parameters in the model

It’s good to get in the habit of examining the ‘parameter’ table in lavaan, which provides an important summary of what parameters are free in the model (i.e., have to be estimated), and what parameters were requested by the user (you!) in the model syntax.

parTable(mlav)
id lhs op rhs user block group free ustart exo label plabel start est se
1 cmedv ~ log_crim 1 1 1 1 NA 0 .p1. 0.00 -1.35 0.116
2 cmedv ~~ cmedv 0 1 1 2 NA 0 .p2. 42.07 66.55 4.184
3 log_crim ~~ log_crim 0 1 1 0 NA 1 .p3. 9.71 9.71 0.000

Here, a 1 in the user column refers to a parameter we’ve request explicitly in the syntax. Non-zero values for the ‘free’ column denote parameters that are freely estimated by the model. These are arbitrarily numbered in ascending order.

Note that we can get standardized estimates in lavaan as well. This is a more complicated topic in SEM because we can standardize with respect to the latent variables alone (std.lv) or both the observed and latent variables (std.all). The latter is usually what is reported as standardized estimates in SEM papers and is sometimes called the ‘completely standardized solution.’

2.4 Standardized estimates

standardizedSolution(mlav, type="std.all")
lhs op rhs est.std se z pvalue ci.lower ci.upper
cmedv ~ log_crim -0.457 0.033 -13.7 0 -0.523 -0.392
cmedv ~~ cmedv 0.791 0.030 26.0 0 0.731 0.851
log_crim ~~ log_crim 1.000 0.000 NA NA 1.000 1.000

We can also ask that the standardized estimates be added to the same output as the overall model by adding standardized=TRUE to the summary() call:

summary(mlav_w_intercept, standardized=TRUE)
## lavaan 0.6-4 ended normally after 12 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          3
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   cmedv ~                                                               
##     log_crim         -1.346    0.116  -11.567    0.000   -1.346   -0.457
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .cmedv            21.013    0.386   54.494    0.000   21.013    2.291
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .cmedv            66.549    4.184   15.906    0.000   66.549    0.791

3 Path analysis on housing data

Let’s look at something more interesting. What if we believe that the level nitric oxides (nox) also predicts home prices alongside crime? We can add this as a predictor as in standard multiple regression.

Furthermore, we hypothesize that the proximity of a home to large highways (rad) predicts the concentration of nitric oxides, which predicts lower home prices?

The model syntax could be specified as:

lavaan_m2 <- '
cmedv ~ log_crim + nox #crime and nox predict lower home prices
nox ~ rad #proximity to highways predicts nox
'
mlav2 <- sem(lavaan_m2, data=BostonSmall)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, :
## lavaan WARNING: some observed variances are (at least) a factor 1000 times
## larger than others; use varTable(fit) to investigate

3.1 Model output in graphical form

The model looks like this (using the handy semPaths function from semPlot):

semPaths(mlav2, what='std', nCharNodes=6, sizeMan=10,
         edge.label.cex=1.25, curvePivot = TRUE, fade=FALSE)

Here’s the text output:

summary(mlav2)
## lavaan 0.6-4 ended normally after 33 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox             -14.391    3.643   -3.950    0.000
##   nox ~                                               
##     rad               0.008    0.000   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox               0.008    0.001   15.906    0.000

A few things to note:

  1. Note the warning: ‘some observed variances are (at least) a factor 1000 times larger than others.’ This is called ill conditioning.
  2. Our hypotheses all seem to be supported.
  3. The model chi-square is highly significant, suggesting poor global model fit.

3.2 Ill conditioning

Parameter estimation can be hampered when the variances of variables in the model differ substantially (orders of magnitude). Given the above warning, let’s take a look.

varTable(mlav2)
name idx nobs type exo user mean var nlev lnam
cmedv 1 506 numeric 0 0 22.529 84.312 0
nox 3 506 numeric 0 0 0.555 0.013 0
log_crim 6 506 numeric 1 0 -1.126 9.729 0
rad 5 506 numeric 1 0 9.549 75.816 0

Looks like the scale of nox is much smaller than any other predictor, likely because it’s in parts per 10 million! We can rescale variables in this case by multiplying by a constant. This has no effect on the fit or interpretation of the model – we just have to recall what the new units represent. Also, you can always divide out the constant from the parameter estimate to recover the original units, if important.

#now parts per 100,000 not 10 million
BostonSmall <- BostonSmall %>% mutate(nox = nox*100)
mlav2 <- sem(lavaan_m2, data=BostonSmall)
summary(mlav2)
## lavaan 0.6-4 ended normally after 18 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox              -0.144    0.036   -3.950    0.000
##   nox ~                                               
##     rad               0.814    0.047   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              83.910    5.275   15.906    0.000

Note: the parameter estimates involving nox have changed by two orders of magnitude because of our transformation, but the test statistics are unchanged, as we would expect. We’ve just rescaled the variable to help model estimation proceed more smoothly.

3.3 Model fit indices

One of the major advantages of standard SEM over simpler regression techniques is that there are many global fit indices that can help us judge how far our model’s predictions lie from the observed data. Although this is a big literature, and there is much more to be said on this topic, below are the most commonly reported global fit indices and their typical ‘cutoffs’ for good fit.

Fit index Goodness or badness index? Range Cutoff
\(\chi^2_M\) Badness > 0 p < .05
Comparative fit index (CFI) Goodness 0–1 > .95
Root mean squared error of approximation (RMSEA) Badness > 0 < .06
Squared root mean residual (SRMR) Badness > 0 < .08

Conceptually, the goal of structural equation modeling (SEM) is to test whether a theoretically motivated model of the covariance among variables provides a good approximation of the data.

More specifically, we are trying to test how well a parsimonious model (composed of measurement and/or structural components) reproduces the observed covariance matrix. Formally, we are seeking to develop a model whose model-implied covariance matrix approaches the sample (observed) covariance matrix.

\[ \mathbf{S_{XX}} \approx \mathbf{\Sigma}(\hat{\theta}) \]

You can request more detailed global fit indices from lavaan in the model summary output using fit.measures=TRUE in the summary() call.

summary(mlav2, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 18 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              638.018
##   Degrees of freedom                                 5
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.570
##   Tucker-Lewis Index (TLI)                      -0.076
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3614.747
##   Loglikelihood unrestricted model (H1)      -3477.566
## 
##   Number of free parameters                          5
##   Akaike (AIC)                                7239.493
##   Bayesian (BIC)                              7260.626
##   Sample-size adjusted Bayesian (BIC)         7244.755
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.519
##   90 Percent Confidence Interval          0.468  0.571
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.090
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox              -0.144    0.036   -3.950    0.000
##   nox ~                                               
##     rad               0.814    0.047   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              83.910    5.275   15.906    0.000

You can also get just the fit measures (including additional statistics) using fitmeasures():

fitmeasures(mlav2)
##                npar                fmin               chisq 
##               5.000               0.271             274.360 
##                  df              pvalue      baseline.chisq 
##               2.000               0.000             638.018 
##         baseline.df     baseline.pvalue                 cfi 
##               5.000               0.000               0.570 
##                 tli                nnfi                 rfi 
##              -0.076              -0.076               1.000 
##                 nfi                pnfi                 ifi 
##               0.570               0.228               0.572 
##                 rni                logl   unrestricted.logl 
##               0.570           -3614.747           -3477.566 
##                 aic                 bic              ntotal 
##            7239.493            7260.626             506.000 
##                bic2               rmsea      rmsea.ci.lower 
##            7244.755               0.519               0.468 
##      rmsea.ci.upper        rmsea.pvalue                 rmr 
##               0.571               0.000               4.249 
##          rmr_nomean                srmr        srmr_bentler 
##               4.249               0.090               0.090 
## srmr_bentler_nomean                crmr         crmr_nomean 
##               0.090               0.115               0.115 
##          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.089               0.089              12.050 
##               cn_01                 gfi                agfi 
##              17.987               0.748              -0.259 
##                pgfi                 mfi                ecvi 
##               0.150               0.764               0.562

3.4 Model diagnostics

The fit for this model looks atrocious: CFI is < .95 (and much less than even .9), and RMSEA is much greater than the .08 level that we would consider just ‘okay.’

This suggests the need to examine the fit in more detail. First, we can look at the mismatch between the model-implied and observed covariance matrices.

We can obtain these from lavaan for further diagnosis of model misfit.

3.4.1 Model-implied covariance matrix

fitted(mlav2)
## $cov
##          cmedv  nox    lg_crm rad   
## cmedv     81.58                     
## nox      -36.69 134.01              
## log_crim -11.69  18.82   9.71       
## rad      -30.25  61.57  23.13  75.67

3.4.2 Model-implied correlation matrix

We might be able to interpret this more easily in correlational (standardized) units. That is, what’s the model-implied correlation among variables? The inspect function in lavaan gives access to a number of model details, including this:

inspect(mlav2, what="cor.all")
##          cmedv  nox    lg_crm rad   
## cmedv     1.000                     
## nox      -0.351  1.000              
## log_crim -0.415  0.522  1.000       
## rad      -0.385  0.611  0.853  1.000

3.4.3 Observed correlation matrix

How does this compare to the observed correlations?

lavCor(mlav2)
##          cmedv  nox    lg_crm rad   
## cmedv     1.000                     
## nox      -0.429  1.000              
## log_crim -0.457  0.789  1.000       
## rad      -0.385  0.611  0.853  1.000

3.4.4 Residual correlations

In particular, let’s examine the misfit of the bivariate associations, which is very helpful. Here, we ask for residuals in correlational units, which can be more intuitive than dealing with covariances that are unstandardized. Note that this is the subtraction of the observed - model-implied matrices above. Large positive values indicate the model underpredicts the correlation; large negative values suggest overprediction of the correlation. Usually values \(|r > .1|\) are worth closer consideration.

resid(mlav2, "cor")
## $type
## [1] "cor.bollen"
## 
## $cov
##          cmedv  nox    lg_crm rad   
## cmedv     0.000                     
## nox      -0.078  0.000              
## log_crim -0.042  0.267  0.000       
## rad       0.000  0.000  0.000  0.000

So the model significantly underpredicts the association between nox and log_crim.

We could visualize the problems, too:

ggcorrplot(residuals(mlav2, type="cor")$cov, type="lower")

3.5 Modification indices

Let’s take a look at the modification indices to see if we can fix the misfit by freeing one or more paths, particularly the relationship between nox and log_crim. Briefly, modification indices examine how much the model \(\chi^2\) would improve if we added a given path to the model.

Remember: if a path is not explicitly part of the model, it is implicitly zero in the estimation. So, if there is a strong association between two variables, but we omit a link between them in the model, our model fit will suffer!

modificationindices(mlav2, minimum.value = 20) #only print MIs > 20 
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
11 nox ~ cmedv 38.5 -0.748 -0.748 -0.584 -0.584
12 nox ~ log_crim 211.7 3.648 3.648 0.982 0.315
14 log_crim ~ nox 82.8 0.045 0.045 0.167 0.167
17 rad ~ nox 80.2 -0.142 -0.142 -0.189 -0.189

Here, we see that model fit would improve massively if we allowed log_crim to predict nox. Whether this makes theoretical sense is a different (and likely more important) matter. For demonstration purposes, let’s accept that this path needs to be freely estimated.

We can use the add argument to update() to add a path, while leaving all other model elements the same.

mlav3 <- update(mlav2, add="nox ~ log_crim")
summary(mlav3, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 25 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.080
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.777
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              638.018
##   Degrees of freedom                                 5
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.007
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3477.606
##   Loglikelihood unrestricted model (H1)      -3477.566
## 
##   Number of free parameters                          6
##   Akaike (AIC)                                6967.213
##   Bayesian (BIC)                              6992.572
##   Sample-size adjusted Bayesian (BIC)         6973.527
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.078
##   P-value RMSEA <= 0.05                          0.880
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.002
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.188   -4.924    0.000
##     nox              -0.144    0.051   -2.847    0.004
##   nox ~                                               
##     rad              -0.302    0.068   -4.403    0.000
##     log_crim          3.648    0.191   19.081    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              48.798    3.068   15.906    0.000

This looks way better in terms of fit. There is a strong positive association between crime and nox levels that we missed before. Conceptually, this suggests that the relationship between crime and home prices is partially mediated by crime’s effect on pollutant levels. By contrast, the effect of proximity to highways on home prices appears to be fully mediated by pollutant levels (as indicated by the absence of a large modification index for this path).

semPaths(mlav3, what='std', nCharNodes=6, sizeMan=10,
         edge.label.cex=1.25, curvePivot = TRUE, fade=FALSE)

4 Testing mediation

If the above model is conceptually reasonable and we were specifically interested in testing mediation, we would typically want to 1) test the indirect effects focally, and 2) use a method for significance testing of the mediated effects that provides trustworthy p-values. As noted some time ago (e.g., MacKinnon et al., 2007), the appropriate test for mediation in an SEM framework is based on the product of the constituent paths that comprise the mediation. Here, we have just two paths in two mediational chains:

\[ \begin{align*} \textrm{rad} &\rightarrow \textrm{nox} \rightarrow \textrm{cmedv} \\ \textrm{log_crim} &\rightarrow \textrm{nox} \rightarrow \textrm{cmedv} \\ \end{align*} \]

To test these specific indirect effects, we need to define new parameters in the lavaan model that are products of the individual paths. This can be accomplished using the := operator (‘defined as’). Note that this does change the number of free parameters in the model because these are just a product of existing parameters. To tell lavaan which estimates to multiply, we must use ‘parameter labels’ by pre-multiplying the variable by an arbitrary label. Here I’ve used ‘a1’ and ‘a2’ for the X -> M paths, and ‘b1’ for the M -> Y path (in Baron-Kenny terms).

m_update <- '
cmedv ~ log_crim + b1*nox #crime and nox predict lower home prices
nox ~ a1*rad + a2*log_crim #proximity to highways predicts nox

i_1 := a1*b1
i_2 := a2*b1
'

mlav4 <- sem(m_update, BostonSmall)
summary(mlav4)
## lavaan 0.6-4 ended normally after 25 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.080
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.777
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.188   -4.924    0.000
##     nox       (b1)   -0.144    0.051   -2.847    0.004
##   nox ~                                               
##     rad       (a1)   -0.302    0.068   -4.403    0.000
##     log_crim  (a2)    3.648    0.191   19.081    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              48.798    3.068   15.906    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     i_1               0.043    0.018    2.391    0.017
##     i_2              -0.525    0.186   -2.816    0.005

This looks promising, but as I noted above, this ‘delta method’ (based on asymptotic normal theory) for testing mediation is known to be problematic because the sampling distribution of the indirect path product term is not normal. Bootstrapping is a common workaround for this dilemma that does not make strong assumptions about the distribution of the coefficient of interest (i.e., the sampling distributions of the two mediated paths). We can implement this using the argument se = "bootstrap". This will reestimate the standard errors of the parameter estimates using 1000 nonparametric bootstrap samples, by default. You can change the number of bootstrap samples using the bootstrap argument

mlav4_boot <- sem(m_update, BostonSmall, se = "bootstrap")
summary(mlav4_boot)
## lavaan 0.6-4 ended normally after 25 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.080
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.777
## 
## Parameter Estimates:
## 
##   Standard Errors                            Bootstrap
##   Number of requested bootstrap draws             1000
##   Number of successful bootstrap draws            1000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.161   -5.728    0.000
##     nox       (b1)   -0.144    0.037   -3.857    0.000
##   nox ~                                               
##     rad       (a1)   -0.302    0.102   -2.946    0.003
##     log_crim  (a2)    3.648    0.281   12.962    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    7.064    9.273    0.000
##    .nox              48.798    3.931   12.414    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     i_1               0.043    0.019    2.278    0.023
##     i_2              -0.525    0.141   -3.730    0.000

As we suspected, both indirect paths are significant, supporting statistical mediation.

5 SEM with latent variables

What about when we are interested in testing models with latent variables? Typically this will be a ‘reflective latent variable’ model in which we think that a putative latent variable is measured by several (usually 3+) manifest indicators. Such variables are often called ‘factors’ or ‘latent traits.’ In an SEM world, confirmatory factor analysis is the most common reflective latent variable model.

Such models are specified in lavaan using the =~ operator (‘measured by’).

Let’s take the Holzinger-Swineford example from the lavaan tutorial, where there were 9 items that measured putatitively different facets of intelligence: visual, textual, and speed. The observed variables are x1-x9.

From the ‘lavaan’ tutorial: This is a ‘classic’ dataset that is used in many papers and books on Structural Equation Modeling (SEM), including some manuals of commercial SEM software packages. The data consists of mental ability test scores of seventh- and eighth-grade children from two different schools (Pasteur and Grant-White). In our version of the dataset, only 9 out of the original 26 tests are included. A CFA model that is often proposed for these 9 variables consists of three latent variables (or factors), each with three indicators:

A 3 factor CFA example

A 3 factor CFA example

5.1 Specifying a factor model

The corresponding lavaan syntax for specifying this model is as follows:

     visual =~ x1 + x2 + x3
    textual =~ x4 + x5 + x6
      speed =~ x7 + x8 + x9

In this example, the model syntax only contains three ‘latent variable definitions’. Each formula has the following format:

    latent variable =~ indicator1 + indicator2 + indicator3

5.2 Typical CFA output

By default, the first indicator has a fixed loading of 1 to scale the underlying factor (a ‘unit loading identification’). Let’s take a look:

summary(fit, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 35 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         21
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      85.306
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745
##   Loglikelihood unrestricted model (H1)      -3695.092
## 
##   Number of free parameters                         21
##   Akaike (AIC)                                7517.490
##   Bayesian (BIC)                              7595.339
##   Sample-size adjusted Bayesian (BIC)         7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent Confidence Interval          0.071  0.114
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.100    5.554    0.000
##     x3                0.729    0.109    6.685    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.065   17.014    0.000
##     x6                0.926    0.055   16.703    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.165    7.152    0.000
##     x9                1.082    0.151    7.155    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.408    0.074    5.552    0.000
##     speed             0.262    0.056    4.660    0.000
##   textual ~~                                          
##     speed             0.173    0.049    3.518    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.114    4.833    0.000
##    .x2                1.134    0.102   11.146    0.000
##    .x3                0.844    0.091    9.317    0.000
##    .x4                0.371    0.048    7.779    0.000
##    .x5                0.446    0.058    7.642    0.000
##    .x6                0.356    0.043    8.277    0.000
##    .x7                0.799    0.081    9.823    0.000
##    .x8                0.488    0.074    6.573    0.000
##    .x9                0.566    0.071    8.003    0.000
##     visual            0.809    0.145    5.564    0.000
##     textual           0.979    0.112    8.737    0.000
##     speed             0.384    0.086    4.451    0.000

5.3 Modification indices for CFA

modificationIndices(fit, minimum.value = 10)
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
28 visual =~ x7 18.6 -0.422 -0.380 -0.349 -0.349
30 visual =~ x9 36.4 0.577 0.519 0.515 0.515
76 x7 ~~ x8 34.1 0.536 0.536 0.859 0.859
78 x8 ~~ x9 14.9 -0.423 -0.423 -0.805 -0.805

The modification indices suggest that the x9 may load on the visual factor, or that x7 and x9 may have a unique residual correlation. This, again, is a problem for theory, but we can test a modified model for demonstration purposes. We use the ~~ operator to specify a (residual) variance or covariance term in the model.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 

x7 ~~ x9 #this specifies a variance or covariance parameter
'

fit2 <- cfa(HS.model, data=HolzingerSwineford1939)
summary(fit2, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 38 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         22
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      79.803
##   Degrees of freedom                                23
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.936
##   Tucker-Lewis Index (TLI)                       0.899
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3734.994
##   Loglikelihood unrestricted model (H1)      -3695.092
## 
##   Number of free parameters                         22
##   Akaike (AIC)                                7513.987
##   Bayesian (BIC)                              7595.544
##   Sample-size adjusted Bayesian (BIC)         7525.772
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.091
##   90 Percent Confidence Interval          0.069  0.113
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.062
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.099    5.574    0.000
##     x3                0.735    0.109    6.766    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.112    0.065   17.030    0.000
##     x6                0.925    0.055   16.707    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                0.865    0.186    4.664    0.000
##     x9                1.140    0.157    7.278    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .x7 ~~                                               
##    .x9               -0.216    0.113   -1.920    0.055
##   visual ~~                                           
##     textual           0.407    0.073    5.543    0.000
##     speed             0.293    0.060    4.902    0.000
##   textual ~~                                          
##     speed             0.201    0.053    3.821    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.553    0.112    4.944    0.000
##    .x2                1.135    0.102   11.163    0.000
##    .x3                0.840    0.090    9.310    0.000
##    .x4                0.370    0.048    7.757    0.000
##    .x5                0.447    0.058    7.651    0.000
##    .x6                0.357    0.043    8.301    0.000
##    .x7                0.666    0.129    5.180    0.000
##    .x8                0.635    0.087    7.275    0.000
##    .x9                0.343    0.134    2.560    0.010
##     visual            0.805    0.144    5.593    0.000
##     textual           0.981    0.112    8.748    0.000
##     speed             0.517    0.141    3.662    0.000

The still isn’t great. We could re-examine modification indices:

modificationIndices(fit2, minimum.value = 20)
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
29 visual =~ x7 32.7 -0.662 -0.594 -0.546 -0.546
31 visual =~ x9 30.1 0.715 0.641 0.637 0.637
77 x7 ~~ x8 27.2 0.559 0.559 0.859 0.859
78 x8 ~~ x9 27.2 -0.637 -0.637 -1.366 -1.366

Hmm, even more possibilities have cropped up – again, this would be the time to consult your prediction or theory about what the latent structure should be. This is a model building and model comparison problem, which is largely beyond the scope of this tutorial. Minimally, however, we could test the global fit differences between these models. These are nested models (because the x7 ~~ x9 residual covariance is 0 in the simpler model), which allows us to use a likelihood ratio test (also called a model chi-square difference):

anova(fit, fit2)
Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
fit2 23 7514 7596 79.8 NA NA NA
fit 24 7517 7595 85.3 5.5 1 0.019

The anova function will test overall fit differences using an LRT approach. The degrees of freedom for the LRT is the difference in the number of free parameters (here, 1).

5.4 Seeing the model specification in detail

We can use the inspect function in lavaan to look at where the free parameters are in matrix specification. The free parameters are numbered (sequentially) and the zeros denote possible parameters that fixed to zero (i.e., not estimated). lavaan uses the LISREL ‘all-y’ matrix specification to print the output. If you don’t know about this, see here: https://psu-psychology.github.io/psy-597-sem-sp2019/02_sem_basics/sem_notation.html#model-representations. And take a class on SEM to get a better sense.

inspect(fit)
## $lambda
##    visual textul speed
## x1      0      0     0
## x2      1      0     0
## x3      2      0     0
## x4      0      0     0
## x5      0      3     0
## x6      0      4     0
## x7      0      0     0
## x8      0      0     5
## x9      0      0     6
## 
## $theta
##    x1 x2 x3 x4 x5 x6 x7 x8 x9
## x1  7                        
## x2  0  8                     
## x3  0  0  9                  
## x4  0  0  0 10               
## x5  0  0  0  0 11            
## x6  0  0  0  0  0 12         
## x7  0  0  0  0  0  0 13      
## x8  0  0  0  0  0  0  0 14   
## x9  0  0  0  0  0  0  0  0 15
## 
## $psi
##         visual textul speed
## visual  16                 
## textual 19     17          
## speed   20     21     18

We can also see the parameter estimates in the matrix form:

inspect(fit, "est")
## $lambda
##    visual textul speed
## x1  1.000  0.000  0.00
## x2  0.554  0.000  0.00
## x3  0.729  0.000  0.00
## x4  0.000  1.000  0.00
## x5  0.000  1.113  0.00
## x6  0.000  0.926  0.00
## x7  0.000  0.000  1.00
## x8  0.000  0.000  1.18
## x9  0.000  0.000  1.08
## 
## $theta
##    x1    x2    x3    x4    x5    x6    x7    x8    x9   
## x1 0.549                                                
## x2 0.000 1.134                                          
## x3 0.000 0.000 0.844                                    
## x4 0.000 0.000 0.000 0.371                              
## x5 0.000 0.000 0.000 0.000 0.446                        
## x6 0.000 0.000 0.000 0.000 0.000 0.356                  
## x7 0.000 0.000 0.000 0.000 0.000 0.000 0.799            
## x8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.488      
## x9 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.566
## 
## $psi
##         visual textul speed
## visual  0.809              
## textual 0.408  0.979       
## speed   0.262  0.173  0.384

5.5 What about a structural model?

The CFA above contains only a measurement model – a three-factor model with correlations among factors. What if we also want to see to what extent to which grade in school predicts levels on the intelligence factors (visual, textual, speed)?

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9

visual ~ grade
textual ~ grade
speed ~ grade
'

fit_structural <- cfa(HS.model, data=HolzingerSwineford1939)
summary(fit_structural, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 36 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         24
## 
##                                                   Used       Total
##   Number of observations                           300         301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      99.363
##   Degrees of freedom                                30
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              979.033
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.926
##   Tucker-Lewis Index (TLI)                       0.889
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3702.602
##   Loglikelihood unrestricted model (H1)      -3652.920
## 
##   Number of free parameters                         24
##   Akaike (AIC)                                7453.203
##   Bayesian (BIC)                              7542.094
##   Sample-size adjusted Bayesian (BIC)         7465.980
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.088
##   90 Percent Confidence Interval          0.069  0.107
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.064
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.100    5.547    0.000
##     x3                0.726    0.109    6.634    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.108    0.065   17.008    0.000
##     x6                0.921    0.055   16.670    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.118    0.143    7.831    0.000
##     x9                0.947    0.126    7.534    0.000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~                                            
##     grade             0.428    0.125    3.428    0.001
##   textual ~                                           
##     grade             0.422    0.120    3.501    0.000
##   speed ~                                             
##     grade             0.571    0.099    5.795    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .visual ~~                                           
##    .textual           0.367    0.070    5.230    0.000
##    .speed             0.203    0.051    3.987    0.000
##  .textual ~~                                          
##    .speed             0.119    0.046    2.595    0.009
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.545    0.115    4.734    0.000
##    .x2                1.135    0.102   11.112    0.000
##    .x3                0.845    0.091    9.286    0.000
##    .x4                0.368    0.048    7.697    0.000
##    .x5                0.448    0.058    7.666    0.000
##    .x6                0.360    0.043    8.330    0.000
##    .x7                0.743    0.079    9.440    0.000
##    .x8                0.463    0.069    6.695    0.000
##    .x9                0.620    0.067    9.213    0.000
##    .visual            0.771    0.142    5.431    0.000
##    .textual           0.942    0.108    8.718    0.000
##    .speed             0.363    0.076    4.799    0.000
semPaths(fit_structural)

As one might expect, children in higher grades score higher on the latent intelligence factors.

Lastly, what if we want to use the general versus specific (residual) variance in a structural model? To get these to play appropriately in the same parameter matrices, we create a single-indicator latent variable for the item residual of interest.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 

x1d =~ 1*x1 #define disturbance factor with loading 1.0 onto indicator (as in RAM notation)
x1 ~~ 0*x1 #zero residual variance of indicator (all loads onto disturbance)

#under standard model, disturbances are uncorrelated with factors
x1d ~~ 0*visual
x1d ~~ 0*textual
x1d ~~ 0*speed

#we can now look whether the x1 specific variance and the visual factor uniquely predict the person age
ageyr ~ x1d + visual
'

fit_unique <- cfa(HS.model, data=HolzingerSwineford1939)
summary(fit_unique, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 37 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         24
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     125.923
##   Degrees of freedom                                31
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              961.228
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.896
##   Tucker-Lewis Index (TLI)                       0.850
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -4178.263
##   Loglikelihood unrestricted model (H1)      -4115.302
## 
##   Number of free parameters                         24
##   Akaike (AIC)                                8404.527
##   Bayesian (BIC)                              8493.497
##   Sample-size adjusted Bayesian (BIC)         8417.383
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.101
##   90 Percent Confidence Interval          0.083  0.120
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.084
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.557    0.100    5.563    0.000
##     x3                0.743    0.110    6.736    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.114    0.066   17.003    0.000
##     x6                0.926    0.056   16.686    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.185    0.166    7.147    0.000
##     x9                1.101    0.154    7.142    0.000
##   x1d =~                                              
##     x1                1.000                           
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ageyr ~                                             
##     x1d              -0.154    0.123   -1.247    0.212
##     visual            0.016    0.093    0.171    0.865
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     x1d               0.000                           
##   textual ~~                                          
##     x1d               0.000                           
##   speed ~~                                            
##     x1d               0.000                           
##   visual ~~                                           
##     textual           0.391    0.073    5.373    0.000
##     speed             0.270    0.056    4.778    0.000
##   textual ~~                                          
##     speed             0.173    0.049    3.529    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.000                           
##    .x2                1.134    0.102   11.150    0.000
##    .x3                0.834    0.091    9.215    0.000
##    .x4                0.372    0.048    7.777    0.000
##    .x5                0.445    0.058    7.617    0.000
##    .x6                0.356    0.043    8.272    0.000
##    .x7                0.806    0.081    9.910    0.000
##    .x8                0.493    0.073    6.707    0.000
##    .x9                0.558    0.071    7.907    0.000
##    .ageyr             1.086    0.090   12.042    0.000
##     visual            0.799    0.144    5.547    0.000
##     textual           0.979    0.112    8.731    0.000
##     speed             0.377    0.085    4.418    0.000
##     x1d               0.559    0.112    4.972    0.000

No dice here (bad fit, nonsig structural estimates), but you get the idea.

6 Categorical data

In ‘vanilla’ SEM using maximum likelihood estimation, we assume that the data are distributed as multivariate normal. This means, in part, that we assume that each variable is univariate normal. Like so:

Importantly, however, many data in social and behavioral sciences are dichotomous or polytomous in nature. These are distributed on an ordinal, or perhaps interval, scale. Furthermore, interval scaling is an assumption we often make that may not be supported in the data. For example, on a 1-5 Likert scale, what reassurance do we have that a one-point increase between 1 and 2 is the same as 4 and 5? Instead, we often have data that look more like this:

In short, we often do not know whether the spacing between anchors is equal, a topic that get expanded on substantially in item response theory (IRT). Furthermore, ordinal data are often a poor approximation of a normal distribution, as in the figure above. This distribution is quite skewed and also ‘chunky’ in the sense that there are only interval-valued responses (e.g., no responses of 1.5 or 2.2). Altogether, treating such data as normal is often a tricky proposition.

Rule of thumb: If we have 5+ anchors for an item, we can probably treat it as continuous without major missteps. If we have < 5 anchors, the assumption of normality can yield biased parameter estimates and fit statistics. See Rhemtulla et al., 2012, Psychological Methods for further details. If we assume normality, we should also examine the distribution of items to see whether they are approximately normal. If not, this may be a further nudge in favor of modeling them as ordinal, not continuous.

6.1 Modeling categorical data with a formal threshold structure

What to do? lavaan supports the formal treatment of endogenous categorical data using a threshold structure. This emerges from the view that the underlying distribution of an item is continuous (Gaussian), but our discretization (e.g., binary or polytomous) cuts this dimension at particular points. Here’s an example from the Samejima graded threshold model with 4 anchors per item:

We have 4 levels of the variable (1, 2, 3, 4), but only three thresholds – each specifies the boundary between two adjacent levels (anchors). If we are motivated to account for this structure, these thresholds can be specified as free parameters in the model. This is essentially estimating where the \(\tau\) parameters fall along the continuum; they need not be evenly spaced!

Note that in our initial CFA, we treated x1 - x9 as normally/continuously distributed. It turns out that they are (i.e., not highly discretized):

ggplot(HolzingerSwineford1939, aes(x=x1)) + geom_histogram(bins=9)

But if we have data that have 2, 3, or 4 values, treating the variables as continuous is usually inappropriate and can lead to biased, inaccurate results.

Typically, models with a threshold structure are estimated using a ‘weighted least squares’ (WLS) estimator, rather than maximum likelihood (ML; the typical estimator in SEM). The mean and covariance adjusted WLS (aka ‘WLSMV’) is usually the way to go because it can handle nonnormality of the multivariate distribution better than the typical WLS. See DiStefano and Morgan 2014, Structural Equation Modeling, for details.

Here’s a quick demo – what if we had only trichotomous items for each of our intelligence test items?

ggplot(HSbinary, aes(x=x1)) + geom_bar()

We tell lavaan which items are ordered categorial using the ordered argument.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 
'
fit_cat <- cfa(HS.model, HSbinary, ordered=names(HSbinary))
summary(fit_cat)
## lavaan 0.6-4 ended normally after 30 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         30
## 
##   Number of observations                           301
## 
##   Estimator                                       DWLS      Robust
##   Model Fit Test Statistic                      45.123      63.573
##   Degrees of freedom                                24          24
##   P-value (Chi-square)                           0.006       0.000
##   Scaling correction factor                                  0.757
##   Shift parameter                                            3.971
##     for simple second-order correction (Mplus variant)
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model        Unstructured
##   Standard Errors                           Robust.sem
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.553    0.120    4.626    0.000
##     x3                0.643    0.125    5.121    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.085    0.067   16.275    0.000
##     x6                0.972    0.057   17.010    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.200    0.215    5.576    0.000
##     x9                1.479    0.258    5.735    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.331    0.055    6.052    0.000
##     speed             0.195    0.051    3.797    0.000
##   textual ~~                                          
##     speed             0.148    0.043    3.415    0.001
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.000                           
##    .x2                0.000                           
##    .x3                0.000                           
##    .x4                0.000                           
##    .x5                0.000                           
##    .x6                0.000                           
##    .x7                0.000                           
##    .x8                0.000                           
##    .x9                0.000                           
##     visual            0.000                           
##     textual           0.000                           
##     speed             0.000                           
## 
## Thresholds:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     x1|t1            -1.363    0.103  -13.239    0.000
##     x1|t2             0.844    0.083   10.224    0.000
##     x2|t1            -1.556    0.115  -13.508    0.000
##     x2|t2             0.741    0.080    9.259    0.000
##     x3|t1            -0.353    0.074   -4.766    0.000
##     x3|t2             0.626    0.078    8.047    0.000
##     x4|t1            -0.797    0.081   -9.799    0.000
##     x4|t2             0.956    0.086   11.151    0.000
##     x5|t1            -0.820    0.082  -10.012    0.000
##     x5|t2             0.527    0.076    6.925    0.000
##     x6|t1             0.188    0.073    2.588    0.010
##     x6|t2             1.529    0.113   13.498    0.000
##     x7|t1            -0.820    0.082  -10.012    0.000
##     x7|t2             1.024    0.088   11.641    0.000
##     x8|t1            -0.129    0.073   -1.783    0.075
##     x8|t2             1.882    0.145   12.989    0.000
##     x9|t1            -0.425    0.075   -5.678    0.000
##     x9|t2             1.835    0.140   13.131    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.266                           
##    .x2                0.775                           
##    .x3                0.697                           
##    .x4                0.267                           
##    .x5                0.136                           
##    .x6                0.308                           
##    .x7                0.684                           
##    .x8                0.545                           
##    .x9                0.308                           
##     visual            0.734    0.165    4.438    0.000
##     textual           0.733    0.060   12.216    0.000
##     speed             0.316    0.081    3.883    0.000
## 
## Scales y*:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     x1                1.000                           
##     x2                1.000                           
##     x3                1.000                           
##     x4                1.000                           
##     x5                1.000                           
##     x6                1.000                           
##     x7                1.000                           
##     x8                1.000                           
##     x9                1.000

Note that we now have threshold estimates for each item, where higher values denote a higher estimate for the boundary between one category and the next along the latent continuum that putatively underlies the item. Additional details about categorical data are beyond the scope here!

7 Estimators

And finally, lavaan gives you a number of different algorithms to estimate parameters in the model. “ML” is the default for continuous data, “WLS” is the default for (partly) categorical data.

‘Robust’ variants of these estimators typically robustify the model to non-normality (and potentially other stuff like clustering) at the level of the overall model chi-square test and the standard errors and, therefore, the significance tests. Having your statistics be robust to non-normality is usually a good thing… So, many people would use ‘MLR’ as their go-to for continuous data and ‘WLSMV’ for categorical data.

You can specify this using the estimator argument.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 
'
fit_mlr <- cfa(HS.model, HolzingerSwineford1939, estimator="MLR")
summary(fit_mlr, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 35 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         21
## 
##   Number of observations                           301
## 
##   Estimator                                         ML      Robust
##   Model Fit Test Statistic                      85.306      87.132
##   Degrees of freedom                                24          24
##   P-value (Chi-square)                           0.000       0.000
##   Scaling correction factor                                  0.979
##     for the Yuan-Bentler correction (Mplus variant)
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              918.852     880.082
##   Degrees of freedom                                36          36
##   P-value                                        0.000       0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.931       0.925
##   Tucker-Lewis Index (TLI)                       0.896       0.888
## 
##   Robust Comparative Fit Index (CFI)                         0.930
##   Robust Tucker-Lewis Index (TLI)                            0.895
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745   -3737.745
##   Scaling correction factor                                  1.133
##     for the MLR correction
##   Loglikelihood unrestricted model (H1)      -3695.092   -3695.092
##   Scaling correction factor                                  1.051
##     for the MLR correction
## 
##   Number of free parameters                         21          21
##   Akaike (AIC)                                7517.490    7517.490
##   Bayesian (BIC)                              7595.339    7595.339
##   Sample-size adjusted Bayesian (BIC)         7528.739    7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092       0.093
##   90 Percent Confidence Interval          0.071  0.114       0.073  0.115
##   P-value RMSEA <= 0.05                          0.001       0.001
## 
##   Robust RMSEA                                               0.092
##   90 Percent Confidence Interval                             0.072  0.114
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065       0.065
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Observed information based on                Hessian
##   Standard Errors                   Robust.huber.white
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.132    4.191    0.000
##     x3                0.729    0.141    5.170    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.066   16.946    0.000
##     x6                0.926    0.061   15.089    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.130    9.046    0.000
##     x9                1.082    0.266    4.060    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.408    0.099    4.110    0.000
##     speed             0.262    0.060    4.366    0.000
##   textual ~~                                          
##     speed             0.173    0.056    3.081    0.002
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.156    3.509    0.000
##    .x2                1.134    0.112   10.135    0.000
##    .x3                0.844    0.100    8.419    0.000
##    .x4                0.371    0.050    7.382    0.000
##    .x5                0.446    0.057    7.870    0.000
##    .x6                0.356    0.047    7.658    0.000
##    .x7                0.799    0.097    8.222    0.000
##    .x8                0.488    0.120    4.080    0.000
##    .x9                0.566    0.119    4.768    0.000
##     visual            0.809    0.180    4.486    0.000
##     textual           0.979    0.121    8.075    0.000
##     speed             0.384    0.107    3.596    0.000

Note that we now have a column of ‘robust’ global fit indices, as well as a note that standard errors are estimated using a Huber-White sandwich estimator (robust to non-normality and clustering).

8 Missing data

By default, lavaan will usually apply listwise deletion such that people missing any variable are dropped. This is usually not a great idea, both because you can lose a ton of data and because it can give a biased view of the data (for details, see Schafer and Graham 2002, Psychological Methods). Although well beyond this tutorial, it’s usually best to use so-called full-information maximum likelhood (FIML) under an assumption that the data are missing at random – a technical term that missingness on a given variable can be related to other variables, but not to the variable itself. Using FIML, the estimation tries to estimate each parameter based on the cases that have available data that pertain. If you want to understand this better, read Craig Enders great 2010 book, Applied Missing Data Analysis.

In lavaan, add missing='ml' to the cfa, sem, or lavaan function calls.

I’m going to ‘miss out’ a few data points for illustration

8.1 Listwise deletion

Here’s what happens with the default:

fit_listwise <- cfa(HS.model, HolzingerSwineford1939_miss, estimator="ML")
summary(fit_listwise, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 37 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         21
## 
##                                                   Used       Total
##   Number of observations                           241         301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      70.910
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              742.755
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.934
##   Tucker-Lewis Index (TLI)                       0.900
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2963.801
##   Loglikelihood unrestricted model (H1)      -2928.346
## 
##   Number of free parameters                         21
##   Akaike (AIC)                                5969.602
##   Bayesian (BIC)                              6042.782
##   Sample-size adjusted Bayesian (BIC)         5976.217
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.090
##   90 Percent Confidence Interval          0.066  0.115
##   P-value RMSEA <= 0.05                          0.004
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.068
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.586    0.117    5.021    0.000
##     x3                0.703    0.120    5.836    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.116    0.069   16.073    0.000
##     x6                0.917    0.060   15.303    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.234    0.216    5.722    0.000
##     x9                1.254    0.219    5.721    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.402    0.080    5.007    0.000
##     speed             0.258    0.059    4.376    0.000
##   textual ~~                                          
##     speed             0.177    0.052    3.391    0.001
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.507    0.116    4.355    0.000
##    .x2                1.129    0.114    9.934    0.000
##    .x3                0.872    0.098    8.872    0.000
##    .x4                0.380    0.053    7.150    0.000
##    .x5                0.376    0.061    6.215    0.000
##    .x6                0.372    0.048    7.747    0.000
##    .x7                0.817    0.088    9.270    0.000
##    .x8                0.510    0.077    6.599    0.000
##    .x9                0.539    0.081    6.681    0.000
##     visual            0.744    0.149    4.983    0.000
##     textual           1.023    0.129    7.902    0.000
##     speed             0.286    0.082    3.472    0.001


Note the output:

                                                 Used       Total
  Number of observations                           241         301

This should be very worrisome!!

8.2 FIML-based output under a MAR assumption

And here is the model using the ML-based treatment of missing data under a missing at random (MAR) assumption.

fit_fiml <- cfa(HS.model, HolzingerSwineford1939_miss, estimator="ML", missing="ml")
summary(fit_fiml, fit.measures=TRUE)
## lavaan 0.6-4 ended normally after 54 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         30
## 
##   Number of observations                           301
##   Number of missing patterns                         3
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      82.309
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              870.958
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.930
##   Tucker-Lewis Index (TLI)                       0.895
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3663.262
##   Loglikelihood unrestricted model (H1)      -3622.107
## 
##   Number of free parameters                         30
##   Akaike (AIC)                                7386.524
##   Bayesian (BIC)                              7497.737
##   Sample-size adjusted Bayesian (BIC)         7402.594
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.090
##   90 Percent Confidence Interval          0.069  0.111
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.059
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Observed information based on                Hessian
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.545    0.112    4.854    0.000
##     x3                0.697    0.119    5.837    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.123    0.068   16.536    0.000
##     x6                0.924    0.059   15.626    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.190    0.151    7.857    0.000
##     x9                1.096    0.197    5.575    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.413    0.080    5.151    0.000
##     speed             0.279    0.057    4.941    0.000
##   textual ~~                                          
##     speed             0.175    0.049    3.571    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                4.985    0.068   73.253    0.000
##    .x2                6.088    0.068   89.855    0.000
##    .x3                2.250    0.065   34.579    0.000
##    .x4                3.061    0.067   45.694    0.000
##    .x5                4.341    0.074   58.452    0.000
##    .x6                2.191    0.065   33.860    0.000
##    .x7                4.186    0.063   66.766    0.000
##    .x8                5.527    0.058   94.854    0.000
##    .x9                5.374    0.058   92.546    0.000
##     visual            0.000                           
##     textual           0.000                           
##     speed             0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.509    0.119    4.258    0.000
##    .x2                1.148    0.106   10.879    0.000
##    .x3                0.893    0.097    9.254    0.000
##    .x4                0.380    0.050    7.528    0.000
##    .x5                0.435    0.061    7.134    0.000
##    .x6                0.375    0.047    8.003    0.000
##    .x7                0.806    0.087    9.223    0.000
##    .x8                0.488    0.091    5.357    0.000
##    .x9                0.562    0.090    6.275    0.000
##     visual            0.786    0.152    5.169    0.000
##     textual           0.971    0.113    8.596    0.000
##     speed             0.377    0.091    4.139    0.000

And this is more reassuring:

  Number of observations                           301
  Number of missing patterns                         3

Again, the theory and formal approach to missing data is beyond this tutorial, but I hope this gives a sense of how to handle missingness in lavaan. Note that WLS and WLSMV are complete case estimators, so doesn’t allow for the FIML approach (it’s not an ML estimator, after all). You can pass missing='pairwise' for the WLS family, which also estimates using parameters based on pairwise availability, though in a much less elegant way (I think).