If you have followed the class up to this point, you now have all of the core ingredients necessary to conduct basic SEM analyses! Now, as we move forward with using SEM to characterize a variety of datasets, there are a few odds and ends that are important to know.

1 Ill conditioning

Let’s return to our example from Week 1 of the class, which focused on housing prices in Boson. What if we believe that nitric oxide levels (nox) and crime predict home prices? We can add these as predictors to a path analysis, just as in standard multiple regression.

Furthermore, let’s hypothesize that the proximity of a home to large highways (rad) predicts the concentration of nitric oxides, which predicts lower home prices.

The model syntax could be specified as:

#example dataset from mlbench package with home prices in Boston by census tract
data(BostonHousing2)
BostonSmall <- BostonHousing2 %>% dplyr::select(
  cmedv, #median value of home in 1000s
  crim, #per capita crime by town
  nox, #nitric oxide concentration
  lstat, #proportion of lower status
  rad #proximity to radial highways
  ) %>% mutate(log_crim = log2(crim))

lavaan_m2 <- '
cmedv ~ log_crim + nox #crime and nox predict lower home prices
nox ~ rad #proximity to highways predicts nox
'
mlav2 <- sem(lavaan_m2, data=BostonSmall)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, :
## lavaan WARNING: some observed variances are (at least) a factor 1000 times
## larger than others; use varTable(fit) to investigate

The model looks like this (using the handy semPaths function from semPlot):

semPaths(mlav2, what='std', nCharNodes=6, sizeMan=10,
         edge.label.cex=1.25, curvePivot = TRUE, fade=FALSE)

Here’s the text output:

summary(mlav2)
## lavaan 0.6-3 ended normally after 33 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox             -14.391    3.643   -3.950    0.000
##   nox ~                                               
##     rad               0.008    0.000   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox               0.008    0.001   15.906    0.000

A few things to note:

  1. Note the warning: ‘some observed variances are (at least) a factor 1000 times larger than others.’ This is called ill conditioning.
  2. Our hypotheses all seem to be supported.
  3. The model chi-square is highly significant, suggesting poor global model fit.

Parameter estimation can be hampered when the variances of variables in the model differ substantially (orders of magnitude). The most common consequences are that the model does not converge or that you get warnings about problems estimating the standard errors. Here, the model did not fail in these ways, but did object to the variances. Given the above warning, let’s take a look.

varTable(mlav2)
name idx nobs type exo user mean var nlev lnam
cmedv 1 506 numeric 0 0 22.529 84.312 0
nox 3 506 numeric 0 0 0.555 0.013 0
log_crim 6 506 numeric 1 0 -1.126 9.729 0
rad 5 506 numeric 1 0 9.549 75.816 0

Looks like the scale of nox is much smaller than any other predictor, likely because it’s in parts per 10 million! We can rescale variables in this case by multiplying by a constant. This has no effect on the fit or interpretation of the model – we just have to recall what the new units represent. Also, you can always divide out the constant from the parameter estimate to recover the original units, if important.

BostonSmall <- BostonSmall %>% mutate(nox = nox*100) #not parts per 100,000 not 10 million
mlav2 <- sem(lavaan_m2, data=BostonSmall)
summary(mlav2)
## lavaan 0.6-3 ended normally after 18 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           506
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     274.360
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cmedv ~                                             
##     log_crim         -0.925    0.135   -6.831    0.000
##     nox              -0.144    0.036   -3.950    0.000
##   nox ~                                               
##     rad               0.814    0.047   17.382    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cmedv            65.499    4.118   15.906    0.000
##    .nox              83.910    5.275   15.906    0.000

Note that our fit statistics do not change at all (because the model converged in the first place), and our parameter estimates only change by a factor of 100, which is because we rescaled nox by multiplying the value x100.

2 Estimators

Estimators are algorithms that seek to fit a given model (for us, variants of SEM) to a dataset. By ‘fit a model,’ I mean that we try to identify parameter estimates that minimize the discrepancy between the observed covariance matrix, \(\boldsymbol{S}_{XX}\) and the model-implied covariance matrix, \(\boldsymbol{\Sigma}(\hat{\boldsymbol{\theta}})\). The algorithm searches iteratively for parameter estimates that would reduce the discrepancy, and the algorithm converges when the discrepancy cannot be further reduced by alternative parameter estimates.

Remember this:

In addition to ‘ML’ – our trusty default maximum likelihood estimator – ther are many other estimators for SEMs. Here’s the synopsis from the lavaan help (http://lavaan.ugent.be/tutorial/est.html):

If all data is continuous, the default estimator in the lavaan package is maximum likelihood (estimator = “ML”). Alternative estimators available in lavaan are:

Many estimators have ‘robust’ variants, meaning that they provide robust standard errors and a scaled test statistic. For example, for the maximum likelihood estimator, lavaan provides the following robust variants:

For the DWLS and ULS estimators, lavaan also provides ‘robust’ variants: WLSM, WLSMVS, WLSMV, ULSM, ULSMVS, ULSMV. Note that for the robust WLS variants, we use the diagonal of the weight matrix for estimation, but we use the full weight matrix to correct the standard errors and to compute the test statistic.

3 Missing data

By default, lavaan will usually apply listwise deletion such that people missing any variable are dropped. This is usually not a great idea, both because you can lose a ton of data and because it can give a biased view of the data (for details, see Schafer and Graham, 2002, Psychological Methods). Although the details are beyond this overview, it’s usually best to use so-called full-information maximum likelhood (FIML) under an assumption that the data are missing at random (MAR) – a technical term that missingness on a given variable can be related to other variables, but not to the variable itself. Using FIML, the estimation tries to estimate all parameters based on the cases that have available data. If you want to understand this better, read Craig Enders great 2010 book, Applied Missing Data Analysis.

In lavaan, add missing='ml' to the cfa, sem, or lavaan function calls in order to request that lavaan use all available data (not just complete data) in estimating model parameters. Importantly, assuming that the data are MAR, full information estimation should improve the accuracy of parameter estimates, likely making the estimates more similar to what we would obtain if we had no missing data at all. Said more formally, FIML will produce unbiased parameter estimates and standard errors if the data are MAR or MCAR (see below).

3.1 Missing data mechanisms

From Schafer and Graham, 2002, Psychological Methods

3.2 Estimators versus full information estimation

Please note that the estimator used to fit a SEM is different from FIML or the idea of full information. Specifically, the estimator is the algorithm by which we estimate model parameters based on the data. Up to this point, we have focused on maximum likelihood (ML) as the primary estimation approach. Although ML is indeed very useful as a default estimator, full information maximum likelihood (FIML) is not a variant of ML estimation, but instead a property of it.

At a basic level, the FIML approach changes the individual likelihood function – that is, the likelihood of observing an individual’s data given a model with a set of parameter estimates. More specifically, FIML does not require that every case have the same number of observed variables when estimating the individual likelihood. Rather, the individual likelihood is based on all available data, and the individual only contributes to parameter estimates for which he/she has relevant data. For example, if I lack a score on an item underlying a latent factor, my data will not contribute to the corresponding factor loading estimate.

As a final technical note, the model chi-square and corresponding fit statistics are a bit different under a FIML approach. Specifically, FIML always estimates two models, the H0 model and the H1 model. The H0 model is the “unrestricted” or saturated model in which all variables are correlated. The H1 model is the model specified by you. The difference between the two log-likelihoods is used to derive the chi-square under FIML, rather than just using the fit function from ML alone.

Returning to ‘full information,’ any estimator that uses all available information to estimate every parameter in a model is a ‘full information estimator.’ For example, Bayesian estimators of SEMs (aka BSEM) also provide full information estimates of parameters.

3.3 Illustration of FIML approach to SEM analysis

Let’s take the Holzinger-Swineford example from the lavaan tutorial, where there were 9 items that measured putatitively different facets of intelligence: visual, textual, and speed. The observed variables are x1-x9.

From the ‘lavaan’ tutorial:

This is a ‘classic’ dataset that is used in many papers and books on Structural Equation Modeling (SEM), including some manuals of commercial SEM software packages. The data consists of mental ability test scores of seventh- and eighth-grade children from two different schools (Pasteur and Grant-White). In our version of the dataset, only 9 out of the original 26 tests are included. A CFA model that is often proposed for these 9 variables consists of three latent variables (or factors), each with three indicators:

  • a visual factor measured by 3 variables: x1, x2 and x3
  • a textual factor measured by 3 variables: x4, x5 and x6
  • a speed factor measured by 3 variables: x7, x8 and x9

I’m going to ‘miss out’ a few data points for illustration

Here’s what happens with the default:

fit_listwise <- cfa(HS.model, HolzingerSwineford1939_miss, estimator="ML")
summary(fit_listwise, fit.measures=TRUE)
## lavaan 0.6-3 ended normally after 37 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         21
## 
##                                                   Used       Total
##   Number of observations                           241         301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      70.910
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              742.755
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.934
##   Tucker-Lewis Index (TLI)                       0.900
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2963.801
##   Loglikelihood unrestricted model (H1)      -2928.346
## 
##   Number of free parameters                         21
##   Akaike (AIC)                                5969.602
##   Bayesian (BIC)                              6042.782
##   Sample-size adjusted Bayesian (BIC)         5976.217
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.090
##   90 Percent Confidence Interval          0.066  0.115
##   P-value RMSEA <= 0.05                          0.004
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.068
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.586    0.117    5.021    0.000
##     x3                0.703    0.120    5.836    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.116    0.069   16.073    0.000
##     x6                0.917    0.060   15.303    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.234    0.216    5.722    0.000
##     x9                1.254    0.219    5.721    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.402    0.080    5.007    0.000
##     speed             0.258    0.059    4.376    0.000
##   textual ~~                                          
##     speed             0.177    0.052    3.391    0.001
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.507    0.116    4.355    0.000
##    .x2                1.129    0.114    9.934    0.000
##    .x3                0.872    0.098    8.872    0.000
##    .x4                0.380    0.053    7.150    0.000
##    .x5                0.376    0.061    6.215    0.000
##    .x6                0.372    0.048    7.747    0.000
##    .x7                0.817    0.088    9.270    0.000
##    .x8                0.510    0.077    6.599    0.000
##    .x9                0.539    0.081    6.681    0.000
##     visual            0.744    0.149    4.983    0.000
##     textual           1.023    0.129    7.902    0.000
##     speed             0.286    0.082    3.472    0.001

Note the output:

                                                 Used       Total
  Number of observations                           241         301

This should be very worrisome!!

Okay, here’s the same model under FIML using missing="ml".

fit_fiml <- cfa(HS.model, HolzingerSwineford1939_miss, estimator="ML", missing="ml")
summary(fit_fiml, fit.measures=TRUE)
## lavaan 0.6-3 ended normally after 54 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         30
## 
##   Number of observations                           301
##   Number of missing patterns                         3
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      82.309
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              870.958
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.930
##   Tucker-Lewis Index (TLI)                       0.895
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3663.262
##   Loglikelihood unrestricted model (H1)      -3622.107
## 
##   Number of free parameters                         30
##   Akaike (AIC)                                7386.524
##   Bayesian (BIC)                              7497.737
##   Sample-size adjusted Bayesian (BIC)         7402.594
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.090
##   90 Percent Confidence Interval          0.069  0.111
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.059
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Observed information based on                Hessian
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.545    0.112    4.854    0.000
##     x3                0.697    0.119    5.837    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.123    0.068   16.536    0.000
##     x6                0.924    0.059   15.626    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.190    0.151    7.857    0.000
##     x9                1.096    0.197    5.575    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.413    0.080    5.151    0.000
##     speed             0.279    0.057    4.941    0.000
##   textual ~~                                          
##     speed             0.175    0.049    3.571    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                4.985    0.068   73.253    0.000
##    .x2                6.088    0.068   89.855    0.000
##    .x3                2.250    0.065   34.579    0.000
##    .x4                3.061    0.067   45.694    0.000
##    .x5                4.341    0.074   58.452    0.000
##    .x6                2.191    0.065   33.860    0.000
##    .x7                4.186    0.063   66.766    0.000
##    .x8                5.527    0.058   94.854    0.000
##    .x9                5.374    0.058   92.546    0.000
##     visual            0.000                           
##     textual           0.000                           
##     speed             0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.509    0.119    4.258    0.000
##    .x2                1.148    0.106   10.879    0.000
##    .x3                0.893    0.097    9.254    0.000
##    .x4                0.380    0.050    7.528    0.000
##    .x5                0.435    0.061    7.134    0.000
##    .x6                0.375    0.047    8.003    0.000
##    .x7                0.806    0.087    9.223    0.000
##    .x8                0.488    0.091    5.357    0.000
##    .x9                0.562    0.090    6.275    0.000
##     visual            0.786    0.152    5.169    0.000
##     textual           0.971    0.113    8.596    0.000
##     speed             0.377    0.091    4.139    0.000

And this is more reassuring:

  Number of observations                           301
  Number of missing patterns                         3

Again, the theory and formal approach to missing data is beyond this tutorial, but I hope this gives a sense of how to handle missingness in lavaan. Note that WLS and WLSMV are complete case estimators, so doesn’t allow for the FIML approach (it’s not an ML estimator, after all). You can pass missing='pairwise' for the WLS family, which also estimates using parameters based on pairwise availability, though in a much less elegant way (I think).

4 Categorical data

In ‘vanilla’ SEM using maximum likelihood estimation, we assume that the data are distributed as multivariate normal. This means, in part, that we assume that each variable is univariate normal. Like so:

p1 <- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, n = 500, args = list(mean = 0, sd = 1)) + ylab("Density") + xlab("Item distribution\n<---- Individual Differences ---->")
p1

Importantly, however, many data in social and behavioral sciences are dichotomous or polytomous in nature. These are distributed on an ordinal, or perhaps interval, scale. Furthermore, interval scaling is an assumption we often make that may not be supported in the data. For example, on a 1-5 Likert scale, what reassurance do we have that a one-point increase between 1 and 2 is the same as 4 and 5? Instead, we often have data that look more like this:

df <- data.frame(x=c(rep(1, 50), rep(2, 25), rep(3, 15), rep(4, 10), rep(5, 3)))
ggplot(df, aes(x=x)) + geom_bar() + ylab("Frequency (n)") + xlab("Response")

In short, we often do not know whether the spacing between anchors is equal, a topic that get expanded on substantially in item response theory (IRT). Furthermore, ordinal data are often a poor approximation of a normal distribution, as in the figure above. This distribution is quite skewed and also ‘chunky’ in the sense that there are only interval-valued responses (e.g., no responses of 1.5 or 2.2). Altogether, treating such data as normal is often a tricky proposition.

Rule of thumb: If we have 5+ anchors for an item, we can probably treat it as continuous without major missteps. If we have < 5 anchors, the assumption of normality can yield biased parameter estimates and fit statistics. See Rhemtulla et al., 2012, Psychological Methods for further details. If we assume normality, we should also examine the distribution of items to see whether they are approximately normal. If not, this may be a further nudge in favor of modeling them as ordinal, not continuous.

4.1 Modeling categorical data with a formal threshold structure

What to do? lavaan supports the formal treatment of endogenous categorical data using a threshold structure. This emerges from the view that the underlying distribution of an item is continuous (Gaussian), but our discretization (e.g., binary or polytomous) cuts this dimension at particular points. Here’s an example from the Samejima graded threshold model with 4 anchors per item:

df <- data.frame(x=c(-1.5, -0.5, 1.2), xend=c(-1.5, -0.5, 1.2), y=c(0, 0, 0), yend=c(0.45, 0.45, 0.45), label=c("tau[1]", "tau[2]", "tau[3]"))
df_labs <- data.frame(x=c(-2.0, -1, .35, 1.7), y=c(0, 0, 0, 0), label=c("Pr(y=1)", "Pr(y=2)", "Pr(y=3)", "Pr(y=4)"))
p1 <- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, n = 500, args = list(mean = 0, sd = 1)) + ylab("Density") + xlab("Item distribution\n<---- Individual Differences ---->") +
  geom_segment(aes(x=x, xend=xend, y=y, yend=yend), data=df) +
  geom_text(aes(label=label, x=x, y=yend+0.02), data=df, parse=TRUE, size=8) +
  geom_text(aes(label=label, x=x, y=y), data=df_labs, parse=FALSE, size=6)
p1

We have 4 levels of the variable (1, 2, 3, 4), but only three thresholds – each specifies the boundary between two adjacent levels (anchors). If we are motivated to account for this structure, these thresholds can be specified as free parameters in the model. This is essentially estimating where the \(\tau\) parameters fall along the continuum; they need not be evenly spaced!

4.2 Demo of categorical data in CFA

Typically, models with a threshold structure are estimated using a ‘weighted least squares’ (WLS) estimator, rather than maximum likelihood (ML; the typical estimator in SEM). Note that there are Bayesian and ML approaches to estimating a model with explicitly categorical data, but these are less common, especially in lavaan. The mean and covariance adjusted WLS (aka ‘WLSMV’) is usually the way to go because it can handle nonnormality of the multivariate distribution better than the typical WLS. See DiStefano and Morgan 2014, Structural Equation Modeling, for details.

Here’s a quick demo – what if we had only trichotomous items for each of our intelligence test items?

lattice::histogram(HSbinary$x1, xlab="Discrete x1")

We tell lavaan which items are ordered categorial using the ordered argument.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 
'
fit_cat <- cfa(HS.model, HSbinary, ordered=names(HSbinary))
summary(fit_cat)
## lavaan 0.6-3 ended normally after 30 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         30
## 
##   Number of observations                           301
## 
##   Estimator                                       DWLS      Robust
##   Model Fit Test Statistic                      45.123      63.573
##   Degrees of freedom                                24          24
##   P-value (Chi-square)                           0.006       0.000
##   Scaling correction factor                                  0.757
##   Shift parameter                                            3.971
##     for simple second-order correction (Mplus variant)
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model        Unstructured
##   Standard Errors                           Robust.sem
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.553    0.120    4.626    0.000
##     x3                0.643    0.125    5.121    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.085    0.067   16.275    0.000
##     x6                0.972    0.057   17.010    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.200    0.215    5.576    0.000
##     x9                1.479    0.258    5.735    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.331    0.055    6.052    0.000
##     speed             0.195    0.051    3.797    0.000
##   textual ~~                                          
##     speed             0.148    0.043    3.415    0.001
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.000                           
##    .x2                0.000                           
##    .x3                0.000                           
##    .x4                0.000                           
##    .x5                0.000                           
##    .x6                0.000                           
##    .x7                0.000                           
##    .x8                0.000                           
##    .x9                0.000                           
##     visual            0.000                           
##     textual           0.000                           
##     speed             0.000                           
## 
## Thresholds:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     x1|t1            -1.363    0.103  -13.239    0.000
##     x1|t2             0.844    0.083   10.224    0.000
##     x2|t1            -1.556    0.115  -13.508    0.000
##     x2|t2             0.741    0.080    9.259    0.000
##     x3|t1            -0.353    0.074   -4.766    0.000
##     x3|t2             0.626    0.078    8.047    0.000
##     x4|t1            -0.797    0.081   -9.799    0.000
##     x4|t2             0.956    0.086   11.151    0.000
##     x5|t1            -0.820    0.082  -10.012    0.000
##     x5|t2             0.527    0.076    6.925    0.000
##     x6|t1             0.188    0.073    2.588    0.010
##     x6|t2             1.529    0.113   13.498    0.000
##     x7|t1            -0.820    0.082  -10.012    0.000
##     x7|t2             1.024    0.088   11.641    0.000
##     x8|t1            -0.129    0.073   -1.783    0.075
##     x8|t2             1.882    0.145   12.989    0.000
##     x9|t1            -0.425    0.075   -5.678    0.000
##     x9|t2             1.835    0.140   13.131    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.266                           
##    .x2                0.775                           
##    .x3                0.697                           
##    .x4                0.267                           
##    .x5                0.136                           
##    .x6                0.308                           
##    .x7                0.684                           
##    .x8                0.545                           
##    .x9                0.308                           
##     visual            0.734    0.165    4.438    0.000
##     textual           0.733    0.060   12.216    0.000
##     speed             0.316    0.081    3.883    0.000
## 
## Scales y*:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     x1                1.000                           
##     x2                1.000                           
##     x3                1.000                           
##     x4                1.000                           
##     x5                1.000                           
##     x6                1.000                           
##     x7                1.000                           
##     x8                1.000                           
##     x9                1.000

Note that we now have threshold estimates for each item, where higher values denote a higher estimate for the boundary between one category and the next along the latent continuum that putatively underlies the item. Additional details about categorical data are beyond the scope here!

5 Non-normality

Categorical/ordinal data are one source of violations of multivariate normality. But we may have continuously distributed data that are also not especially normal, as in the case of distributions with large skew or kurtosis. If the univariate distributions are not normal, then neither shall the multivariate distribution be normal.

If we have continuous data that violate normality, typically our chi-square will be too large – thus, we will tend to reject models more often than we should. But, violations of normality will result in standard errors that are too small – thus, we will tend to interpret too many parameters as significant (a kind of Type I error).

In SEM, there are separate workarounds for the model \(\chi^2\) and standard error problems. But, these are often packaged together as ‘robust’ estimation approaches. Long story short, there is a corrected variant of the \(\chi^2\) attributable to Satorra and Bentler, and later developed by Yuan and Bentler. This approach estimates a scaling correction factor that is based on estimates of multivariate kurtosis. The model chi-square is divided by the correction factor estimate a \(chi^2\) to yield a global test that is robust to non-normality in the data.

If the data are MVN, then the correction factor will be 1.0 – and the robust and traditional \(chi^2\) statistics will be very similar. But if the data is highly non-normal, the correction factor will go above 1.0 (e.g., 1.1 or 1.6). In general, it’s probably a good idea to use the Satorra-Bentler \(\chi^2\) to guard against the problem of rejecting too many well-fitting SEMs of non-normal data.

Correcting the standard errors for non-normality relies on what is called a Huber-White sandwich estimator. Completely skipping the details, the important point is that this estimator corrects for non-normality (as well as clustering/nesting of observations).

Putting these things together – Satorra-Bentler \(\chi^2\) and Huber-White SEs – we arrive at what has been called the ‘MLR’ estimator in Mplus and lavaan. In general, this is probably a good ‘go to’ as your default estimator because of its attractive properties with non-normal data.

5.1 Estimate the CFA with the MLR approaach

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '
fit <- cfa(HS.model, data=HolzingerSwineford1939, estimator="MLR")
summary(fit, fit.measures=TRUE)
## lavaan 0.6-3 ended normally after 35 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         21
## 
##   Number of observations                           301
## 
##   Estimator                                         ML      Robust
##   Model Fit Test Statistic                      85.306      87.132
##   Degrees of freedom                                24          24
##   P-value (Chi-square)                           0.000       0.000
##   Scaling correction factor                                  0.979
##     for the Yuan-Bentler correction (Mplus variant)
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              918.852     880.082
##   Degrees of freedom                                36          36
##   P-value                                        0.000       0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.931       0.925
##   Tucker-Lewis Index (TLI)                       0.896       0.888
## 
##   Robust Comparative Fit Index (CFI)                         0.930
##   Robust Tucker-Lewis Index (TLI)                            0.895
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745   -3737.745
##   Scaling correction factor                                  1.133
##     for the MLR correction
##   Loglikelihood unrestricted model (H1)      -3695.092   -3695.092
##   Scaling correction factor                                  1.051
##     for the MLR correction
## 
##   Number of free parameters                         21          21
##   Akaike (AIC)                                7517.490    7517.490
##   Bayesian (BIC)                              7595.339    7595.339
##   Sample-size adjusted Bayesian (BIC)         7528.739    7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092       0.093
##   90 Percent Confidence Interval          0.071  0.114       0.073  0.115
##   P-value RMSEA <= 0.05                          0.001       0.001
## 
##   Robust RMSEA                                               0.092
##   90 Percent Confidence Interval                             0.072  0.114
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065       0.065
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Observed information based on                Hessian
##   Standard Errors                   Robust.huber.white
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.132    4.191    0.000
##     x3                0.729    0.141    5.170    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.066   16.946    0.000
##     x6                0.926    0.061   15.089    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.130    9.046    0.000
##     x9                1.082    0.266    4.060    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.408    0.099    4.110    0.000
##     speed             0.262    0.060    4.366    0.000
##   textual ~~                                          
##     speed             0.173    0.056    3.081    0.002
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.156    3.509    0.000
##    .x2                1.134    0.112   10.135    0.000
##    .x3                0.844    0.100    8.419    0.000
##    .x4                0.371    0.050    7.382    0.000
##    .x5                0.446    0.057    7.870    0.000
##    .x6                0.356    0.047    7.658    0.000
##    .x7                0.799    0.097    8.222    0.000
##    .x8                0.488    0.120    4.080    0.000
##    .x9                0.566    0.119    4.768    0.000
##     visual            0.809    0.180    4.486    0.000
##     textual           0.979    0.121    8.075    0.000
##     speed             0.384    0.107    3.596    0.000