mu <- rep(0,3)
Sigma <- tribble(
~a, ~b, ~c,
1, 0.2, 0.3,
0.2, 1, 0.7,
0.3, 0.7, 1
)
abc = data.frame(MASS::mvrnorm(1000, mu=mu, Sigma=Sigma, empirical=TRUE))
colnames(abc) = LETTERS[1:3]
psych::describe(abc)
## vars n mean sd median trimmed mad min max range skew kurtosis
## A 1 1000 0 1 -0.03 -0.01 1.02 -3.50 3.10 6.60 0.04 0.15
## B 2 1000 0 1 0.00 0.00 1.01 -3.54 4.01 7.55 -0.02 0.16
## C 3 1000 0 1 0.01 0.00 0.97 -3.65 3.15 6.80 -0.01 0.21
## se
## A 0.03
## B 0.03
## C 0.03
cov(abc)
## A B C
## A 1.0 0.2 0.3
## B 0.2 1.0 0.7
## C 0.3 0.7 1.0
syntax <- '
B ~ A
C ~ B
'
m <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m)
## lavaan (0.5-23.1097) converged normally after 11 iterations
##
## Number of observations 1000
##
## Estimator ML
## Minimum Function Test Statistic 53.704
## Degrees of freedom 1
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## B ~
## A 0.200 0.031 6.455 0.000
## C ~
## B 0.700 0.023 30.997 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .B 0.959 0.043 22.361 0.000
## .C 0.509 0.023 22.361 0.000
semPaths(m)
Note that the residual variance estimates are also estimates of \(1 - R^2\) because we have standardized variables.
parTable(m)
## id lhs op rhs user block group free ustart exo label plabel start est
## 1 1 B ~ A 1 1 1 1 NA 0 .p1. 0.000 0.200
## 2 2 C ~ B 1 1 1 2 NA 0 .p2. 0.000 0.700
## 3 3 B ~~ B 0 1 1 3 NA 0 .p3. 0.499 0.959
## 4 4 C ~~ C 0 1 1 4 NA 0 .p4. 0.499 0.509
## 5 5 A ~~ A 0 1 1 0 NA 1 .p5. 0.999 0.999
## se
## 1 0.031
## 2 0.023
## 3 0.043
## 4 0.023
## 5 0.000
parameterestimates(m)
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 B ~ A 0.200 0.031 6.46 0 0.139 0.261
## 2 C ~ B 0.700 0.023 31.00 0 0.656 0.744
## 3 B ~~ B 0.959 0.043 22.36 0 0.875 1.043
## 4 C ~~ C 0.509 0.023 22.36 0 0.465 0.554
## 5 A ~~ A 0.999 0.000 NA NA 0.999 0.999
Note that the variance of A is exogenous, but estimated. More specifically, it will be set at the sample estimate, meaning that the variance of A is not part of the likelihood function and does not contribute to any fit statistics.
fitted(m)
## $cov
## B C A
## B 0.999
## C 0.699 0.999
## A 0.200 0.140 0.999
##
## $mean
## B C A
## 0 0 0
Notice how low the covariance is between A and C compared to our groundtruth value of 0.7. That is, the association between A and C is not fully explained by the indirect association of A with C via B.
We know from our simulation that there is a relationship between A and C that is not entirely attributable to the indirect effect of A on B via B. Thus, we have an omitted path problem in which a key covariance term is not captured by our model.
We will cover modification indices in more detail soon, but for now, know that the modificationIndices
function in lavaan
provides estimates of how much the model fit (\(\chi^2\)) would fit if a specific parameter were added to the model.
modificationIndices(m)
## lhs op rhs mi epc sepc.lv sepc.all sepc.nox
## 6 B ~~ C 52.3 -0.799 -0.799 -0.800 -0.800
## 7 B ~ C 52.3 -1.569 -1.569 -1.569 -1.569
## 8 C ~ A 52.3 0.167 0.167 0.167 0.167
## 10 A ~ C 52.3 0.327 0.327 0.327 0.327
We will return to the issue of models that fit the data equall well, but it’s important to note that in this case, there is only one degree of freedom, and four equally good possibilities for capturing the relationship between A and C. This highlights the importance of having clear theoretical predictions about which variables are correlated versus which are likely to be (weak) causes or effects. As in regression, one can reverse predictor versus criterion with minimal change in syntax, but the choice to do so should be theoretically motivated. Likewise, in SEM, if one does now have clear predictions about the relationship between two variables, sometimes an undirected covariance between them is the appropriately weak parameterization.
syn2 <- '
B ~ A
C ~ B
A ~ C
'
m2 <- sem(syn2, abc, meanstructure=FALSE)
summary(m2)
## lavaan (0.5-23.1097) converged normally after 16 iterations
##
## Number of observations 1000
##
## Estimator ML
## Minimum Function Test Statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## B ~
## A -0.022 0.047 -0.460 0.646
## C ~
## B 0.703 0.024 29.427 0.000
## A ~
## C 0.314 0.043 7.357 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .B 1.008 0.050 20.277 0.000
## .C 0.510 0.023 22.360 0.000
## .A 0.909 0.041 22.351 0.000
Note that we now have a just-identified model (0 degrees of freedom) because we have 3 variances and 3 covariances in \(\mathbf{S_{XX}}\) and we have 6 free parameters estimated here.
We could also conceive of a situation in which the model that lacks an \(A-C\) relationship is nevertheless reasonably good.
mu <- rep(0,3)
Sigma <- tribble(
~a, ~b, ~c,
1, 0.2, 0.16,
0.2, 1, 0.7,
0.16, 0.7, 1
)
abc = data.frame(MASS::mvrnorm(1000, mu=mu, Sigma=Sigma, empirical=TRUE))
colnames(abc) = LETTERS[1:3]
psych::describe(abc)
## vars n mean sd median trimmed mad min max range skew kurtosis
## A 1 1000 0 1 -0.02 0 0.97 -3.23 3.31 6.54 0.03 -0.01
## B 2 1000 0 1 0.00 0 0.96 -3.20 3.76 6.96 0.04 0.18
## C 3 1000 0 1 0.01 0 0.99 -2.75 2.93 5.68 0.05 -0.06
## se
## A 0.03
## B 0.03
## C 0.03
cov(abc)
## A B C
## A 1.00 0.2 0.16
## B 0.20 1.0 0.70
## C 0.16 0.7 1.00
syntax <- '
B ~ A
C ~ B
'
m <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m)
## lavaan (0.5-23.1097) converged normally after 12 iterations
##
## Number of observations 1000
##
## Estimator ML
## Minimum Function Test Statistic 0.817
## Degrees of freedom 1
## P-value (Chi-square) 0.366
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## B ~
## A 0.200 0.031 6.455 0.000
## C ~
## B 0.700 0.023 30.997 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .B 0.959 0.043 22.361 0.000
## .C 0.509 0.023 22.361 0.000
#semPaths(m)
fitted(m)
## $cov
## B C A
## B 0.999
## C 0.699 0.999
## A 0.200 0.140 0.999
##
## $mean
## B C A
## 0 0 0
We can see that the A-C relationship is 0.14 in the output and 0.16 in the groundtruth. Thus, we are capturing their association by modeling the indirect effect alone.
sds <- c(1, 10, 100)
abc_rescale <- cor2cov(cov(abc), sds)
print(abc_rescale)
## A B C
## A 1 2 16
## B 2 100 700
## C 16 700 10000
syntax <- '
B ~ b1*A
C ~ b2*B
med := b1*b2
'
m_ind <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m_ind)
## lavaan (0.5-23.1097) converged normally after 12 iterations
##
## Number of observations 1000
##
## Estimator ML
## Minimum Function Test Statistic 0.817
## Degrees of freedom 1
## P-value (Chi-square) 0.366
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## B ~
## A (b1) 0.200 0.031 6.455 0.000
## C ~
## B (b2) 0.700 0.023 30.997 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .B 0.959 0.043 22.361 0.000
## .C 0.509 0.023 22.361 0.000
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
## med 0.140 0.022 6.319 0.000
#semPaths(m_ind)
parTable(m_ind)
## id lhs op rhs user block group free ustart exo label plabel start
## 1 1 B ~ A 1 1 1 1 NA 0 b1 .p1. 0.000
## 2 2 C ~ B 1 1 1 2 NA 0 b2 .p2. 0.000
## 3 3 B ~~ B 0 1 1 3 NA 0 .p3. 0.500
## 4 4 C ~~ C 0 1 1 4 NA 0 .p4. 0.500
## 5 5 A ~~ A 0 1 1 0 NA 1 .p5. 0.999
## 6 6 med := b1*b2 1 0 0 0 NA 0 med 0.000
## est se
## 1 0.200 0.031
## 2 0.700 0.023
## 3 0.959 0.043
## 4 0.509 0.023
## 5 0.999 0.000
## 6 0.140 0.022
Notice how there is an additional parameter estimated, but it is not free because it is a simple product of b1
and b2
.