Question: “Regarding measurement invariance, can you say more about how to conduct measurement invariance when the comparison group is the original validated measure? For example, if in my study, I am trying use a previously validated construct that is validated in a US sample, and using it for the first time in an Indian population, what are the guidelines for demonstrating measurement invariance here? We discussed model comparisons in the context of our own data (e.g., in a scenario where we have data from US and Indian populations) but not in the context of when we only have one of those (e.g., non-validated sample = Indian) and want to compare to previously validated samples that we did not collect, but that are available in the literature.”
Answer: If you have the original mean and covariance matrix, as well as the sample size, from the published paper, you may be able to hack together something workable. Remember that SEM models a mean vector and covariance matrix as its target. You can provide a covariance matrix to lavaan
as an input.
#Brown book, Chapter 5
#input: the lower part of a correlation matrix, see p.169, Table 5.2 (the lisrel syntax)
cormat <- '
1.000
0.300 1.000
0.229 0.261 1.000
0.411 0.406 0.429 1.000
0.172 0.252 0.218 0.481 1.000
0.214 0.268 0.267 0.579 0.484 1.000
0.200 0.214 0.241 0.543 0.426 0.492 1.000
0.185 0.230 0.185 0.545 0.463 0.548 0.522 1.000
0.134 0.146 0.108 0.186 0.122 0.131 0.108 0.151 1.000
0.134 0.099 0.061 0.223 0.133 0.188 0.105 0.170 0.448 1.000
0.160 0.131 0.158 0.161 0.044 0.124 0.066 0.061 0.370 0.350 1.000
0.087 0.088 0.101 0.198 0.077 0.177 0.128 0.112 0.356 0.359 0.507 1.000'
#convert it to a square matrix
Cmat <- getCov(cormat)
# the standard deviations of the variables (Table 5.2)
neodat.sdev = c(2.06, 1.52, 1.92, 1.41, 1.73, 1.77, 2.49, 2.27, 2.68, 1.75, 2.57, 2.66)
# convert the correlation matrix to a covariance matrix (lavaan needs a covariance matrix as input)
covmat <- lavaan::cor2cov(Cmat, sds=neodat.sdev)
# assign row and column names to the covariance matrix
rownames(covmat) = c("X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12")
colnames(covmat)=rownames(covmat)
#model set up - this is the model presented in Figure 1, p.168
# you can play around with this model following the text of Ch.5
new.model <- '#X4 loads on both coping and social factors
coping =~ X1 + X2 + X3 + X4
social =~ X5 + X6 + X7 + X8 + X4
enhance =~ X9 + X10 + X11 + X12
# residual correlation
X11 ~~ X12'
#model fit
fit <- cfa(new.model, sample.cov=covmat, sample.nobs=500)
summary(fit)
## lavaan 0.6-3 ended normally after 60 iterations
##
## Optimization method NLMINB
## Number of free parameters 29
##
## Number of observations 500
##
## Estimator ML
## Model Fit Test Statistic 44.955
## Degrees of freedom 49
## P-value (Chi-square) 0.638
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## coping =~
## X1 1.000
## X2 0.740 0.094 7.909 0.000
## X3 0.933 0.118 7.903 0.000
## X4 0.719 0.118 6.070 0.000
## social =~
## X5 1.000
## X6 1.209 0.093 13.049 0.000
## X7 1.572 0.127 12.354 0.000
## X8 1.518 0.118 12.870 0.000
## X4 0.565 0.087 6.485 0.000
## enhance =~
## X9 1.000
## X10 0.648 0.070 9.293 0.000
## X11 0.776 0.093 8.340 0.000
## X12 0.802 0.096 8.327 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .X11 ~~
## .X12 1.460 0.300 4.873 0.000
## coping ~~
## social 0.704 0.108 6.531 0.000
## enhance 0.669 0.145 4.613 0.000
## social ~~
## enhance 0.566 0.128 4.412 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .X1 3.117 0.230 13.546 0.000
## .X2 1.694 0.125 13.527 0.000
## .X3 2.705 0.200 13.536 0.000
## .X4 0.454 0.070 6.502 0.000
## .X5 1.794 0.130 13.835 0.000
## .X6 1.384 0.115 12.015 0.000
## .X7 3.240 0.248 13.089 0.000
## .X8 2.393 0.194 12.352 0.000
## .X9 3.958 0.400 9.895 0.000
## .X10 1.710 0.170 10.063 0.000
## .X11 4.657 0.371 12.545 0.000
## .X12 4.997 0.398 12.561 0.000
## coping 1.118 0.217 5.158 0.000
## social 1.193 0.164 7.293 0.000
## enhance 3.210 0.490 6.550 0.000
The downside of this approach is that you cannot look at higher moments of the data that can affect fit. In particular, information about skewness and kurtosis is lost when we fit a model based on a mean (first moment) and variance (second moment) structure. Thus, the use of MLR to help adjust our test statistics for non-normality cannot help us.
An alternative, and probably the better approach in general, is to contact the authors of the published paper to ask if they would share the data for this purpose. They may wish to be co-author on the paper, which is up for discussion. But in the era of open science, people are increasingly willing to share their data without insisting upon authorship.
Question: “I am interested in learning more about how to apply SEM in a longitudinal multilevel context. For example, in a daily diary cross-lag panel design. What are the advantages to using SEM for these types of studies as compared to regular multilevel?”
Answer: The main advantage of SEM for multilevel questions is that it allows you to look at latent variables in nested data (e.g., longitudinal). Just as in GLM, if we have nested observations and do not account for this, our test statistics will be invalid.
Many multilevel models can be fit as-is within an SEM framework – that is, SEM software supports models that are truly the same as in the multilevel case using the lme4
or nlme
packages. This means that if you are more comfortable thinking in SEM terms, it may be to your advantage to stay within lavaan
or Mplus
solely for consistency in representation.
The other main advantage of multilevel is that it uses latent decomposition to dissect a so-called ‘level 1’ variable (lower level) into between- versus within-cluster components. In MLM, this dissection is on your shoulders, usually requiring you to within-cluster center an L1 variable and also enter cluster means as an L2 predictor. Without parceling the L1 variable, your coefficients will represent the total effect, which combines between- and within-cluster effects. This is an often overlooked step in MLM that can drastically alter interpretations – people often assume that a significant effect of an L1 variable represents a within-cluster (e.g., within-person) effect. But this need not be the case. Read: Curran & Bauer (2011), Annual Review of Psychology for information about disaggregating variance across levels.
For reasons that go beyond this class, MSEM provides a more accurate parceling of variance between the levels. At the between-person level, L1 (within) variables are treated as latent factors that capture the person/cluster tendencies in the L1 process.
“I’m interested in exploring a good example of a paper that uses SEM. We’ve learned several steps in SEM process, and I’m wondering what the norm is in how much you report in the paper. Are there tests you run, but don’t actually report in the paper?”
This paper from Rogier Kievit has a reasonably good level of detail in terms of explaining the models and results: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5641983/
“It’s also to the point where I’ve now learned several methods techniques (e.g., regression, MLM, growth modeling, SEM), and I’m now starting to wonder how they’re all connected. And I agree with Natalia - it would be cool to discuss some longitudinal aspects.”
And Griselda: “Like Blake and Natalia, I am also interested in learning more about longitudinal SEM and the connection between the different statistical methods that Blake has mentioned.”
Let’s discuss!
Question: “I’d be interested in having a brief discussion about how to conceptualize hypotheses and alternative models before going into the model testing. I told Cynthia what I’m doing for my project and she said I should be anticipating a different relationship between my latent variables. I said no you’re wrong. But now I’m stuck between two different (not necessarily opposed, but conflicting) primary hypotheses. I also anticipate playing around a little bit with factor loadings, especially because I have a lot of item level data for some of the factors, but only a few loadings for other factors. So conceptually, there’s not much room for movement on some of my factors, but lots of room to remove or rearrange exogenous variables for other factors (I’m especially torn between using established questionnaire subscales that don’t exactly reflect what I want versus creating my own latent factor from questions I select, just based on my own hypothesis).”
Overall, this sounds like a case of building different SEM models to test different hypotheses, but let’s discuss!
Question: “I was hoping we could talk more about some useful visualization tools in R, including scatterplots for several variables (as many as five). It seems like ggplot2 is a good package, but I’m not sure how best to use it.”
You can get Hadley Wickham’s ggplot2: Elegant graphics for data analysis (2nd ed.) for free from our library: http://link.springer.com.ezaccess.libraries.psu.edu/book/10.1007%2F978-3-319-24277-4
And you can also access his book R for Data Science here: https://r4ds.had.co.nz
For scatterplot matrices in R using ggplot, I would recommend the ggpairs
function in the GGally
package. Here’s an example:
data(diamonds, package="ggplot2")
ggpairs(diamonds %>% select(carat, depth, price))
This function is sensitive to the data structure for each variable, so it will adapt the plot for nominal and ordinal data. For example, for nominal data, you’ll get a bar plot that shows a graphical depiction of the corresponding contingency table.
ggpairs(diamonds %>% select(color, clarity))
Question: “I am interested in learning more about parceling. In what context is parceling used and is this a recommended strategy?”
Parceling is a technique in which a set of ordinal or binary items are averaged or summed to produce a new item. Each parcel should consist of items that are very similar in content and that tap into the same basic construct.
Parcels are usually treated as continuous items in the factor model and usually have better reliabilities than the individual items alone (because we are combining information that is alike).
Advantages of parceling
Problems with parceling
Take home
Overall, if you have ordinal data and a large sample size, you’re better off using a threshold model in which the thresholds are estimated as part of the SEM for each item. But if you are pressed for sample size or have many indicators to contend with, parceling may help.
Question:: “I am also wondering if we can talk about the use of count data in SEM models. If the count variables are not severely skewed based on the threshold provided by Kline, then is it ok to use ML as the estimator? If the count data does present a normality issue, then what would be the best estimator to use when you have this type of data in your model?”
Answer:
There are a few challenges with count data.
Although this is not supported in lavaan
yet, in Mplus, we can use alternative statistical distributions to capture count variables. In particular, Mplus supports Poisson-distributed outcomes.
x <- 0:16
pdf <- dpois(x, lambda=6)
df <- data.frame(x=x, dens=pdf)
ggplot(df, aes(x=x, y=dens)) + geom_bar(stat = "identity") + ylab("density") + xlab("Count") + ggtitle("mean = 6, sd = sqrt(6)")
pdf <- dpois(x, lambda=2)
df <- data.frame(x=x, dens=pdf)
ggplot(df, aes(x=x, y=dens)) + geom_bar(stat = "identity") + ylab("density") + xlab("Count") + ggtitle("mean = 2, sd = sqrt(2)")
The Poisson distribution is a discrete probability distribution that captures counts, often with an emphasis on the rate at which an event occurs in a given interval of time. This distrubution is controlled by a single parameter, \(\lambda\) that determines both the mean and the variance:
\[ \begin{align*} M &= \lambda \\ SD &= \sqrt{\lambda} \end{align*} \]
At higher values of \(\lambda\), the Poisson distribution is well approximated by a Gaussian (normal) distribution. Thus, if your counts are high, the distribution may look reasonably normal and a normal assumption on the variable may hold.
But if you have a low mean, or a high level of dispersion in the counts, modeling a variable as Poisson is good. Moreover, if the variance is far larger than the mean, you may need the negative binomial distribution, which extends a Poisson by allowing the variance to be more dispersed (larger) than in the Poisson.
There are examples of treating variables as counts (Poisson or negative binomial) in the Mplus User’s Guide if you go down this path.