1 Simulate data

mu <- rep(0,3)
Sigma <- tribble(
  ~a, ~b, ~c,
  1, 0.2, 0.3,
  0.2, 1, 0.7,
  0.3, 0.7, 1
)

abc = data.frame(MASS::mvrnorm(1000, mu=mu, Sigma=Sigma, empirical=TRUE))
colnames(abc) = LETTERS[1:3]
psych::describe(abc)
vars n mean sd median trimmed mad min max range skew kurtosis se
A 1 1000 0 1 -0.022 -0.017 1.081 -2.82 3.52 6.34 0.190 -0.071 0.032
B 2 1000 0 1 0.029 0.010 1.056 -2.96 2.81 5.78 -0.071 -0.296 0.032
C 3 1000 0 1 0.023 0.003 0.969 -3.50 3.61 7.11 -0.002 0.035 0.032
cov(abc)
##     A   B   C
## A 1.0 0.2 0.3
## B 0.2 1.0 0.7
## C 0.3 0.7 1.0

2 Simple model

syntax <- '
B ~ A
C ~ B
'

m <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m)
## lavaan 0.6-3 ended normally after 11 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          4
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      53.704
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A                 0.200    0.031    6.455    0.000
##   C ~                                                 
##     B                 0.700    0.023   30.997    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 0.959    0.043   22.361    0.000
##    .C                 0.509    0.023   22.361    0.000
semPaths_default(m, what="std", edge.label.cex=1.5)

Note that the residual variance estimates are also estimates of \(1 - R^2\) because we have standardized variables.

3 Under the hood: parameter table

parTable(m)
id lhs op rhs user block group free ustart exo label plabel start est se
1 B ~ A 1 1 1 1 NA 0 .p1. 0.000 0.200 0.031
2 C ~ B 1 1 1 2 NA 0 .p2. 0.000 0.700 0.023
3 B ~~ B 0 1 1 3 NA 0 .p3. 0.499 0.959 0.043
4 C ~~ C 0 1 1 4 NA 0 .p4. 0.499 0.509 0.023
5 A ~~ A 0 1 1 0 NA 1 .p5. 0.999 0.999 0.000

4 Parameter estimates

parameterestimates(m)
lhs op rhs est se z pvalue ci.lower ci.upper
B ~ A 0.200 0.031 6.46 0 0.139 0.261
C ~ B 0.700 0.023 31.00 0 0.656 0.744
B ~~ B 0.959 0.043 22.36 0 0.875 1.043
C ~~ C 0.509 0.023 22.36 0 0.465 0.554
A ~~ A 0.999 0.000 NA NA 0.999 0.999

Note that the variance of A is exogenous, but estimated. More specifically, it will be set at the sample estimate, meaning that the variance of A is not part of the likelihood function and does not contribute to any fit statistics.

5 Model-implied covariance

Use fitted() to obtain the model-implied covariance matrix, \(\boldsymbol{\Sigma}(\hat{\boldsymbol{\theta}})\):

fitted(m)
## $cov
##   B     C     A    
## B 0.999            
## C 0.699 0.999      
## A 0.200 0.140 0.999

Notice how low the covariance is between A and C compared to our groundtruth value of 0.7. That is, the association between A and C is not fully explained by the indirect association of A with C via B. Also note that the matrix order is not necessarily the same as the sample covariance matrix. This is just a trivial rearrangement of the row/column order, not something meaningful.

6 Sample covariance

Use cov() to obtain the sample covariance matrix, \(\mathbf{S_{XX}}\):

cov(abc)
##     A   B   C
## A 1.0 0.2 0.3
## B 0.2 1.0 0.7
## C 0.3 0.7 1.0

7 Model modification

We know from our simulation that there is a relationship between A and C that is not entirely attributable to the indirect effect of A on B via B. Thus, we have an omitted path problem in which a key covariance term is not captured by our model.

We will cover modification indices in more detail soon, but for now, know that the modificationIndices function in lavaan provides estimates of how much the model fit (\(\chi^2\)) would fit if a specific parameter were added to the model.

modificationIndices(m)
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
6 B ~~ C 52.3 -0.799 -0.799 -1.143 -1.143
7 B ~ C 52.3 -1.569 -1.569 -1.569 -1.569
8 C ~ A 52.3 0.167 0.167 0.167 0.167
10 A ~ C 52.3 0.327 0.327 0.327 0.327

7.1 Equivalent models

We will return to the issue of models that fit the data equally well, but it’s important to note that in this case, there is only one degree of freedom, and four equally good possibilities for capturing the relationship between A and C. This highlights the importance of having clear theoretical predictions about which variables are correlated versus which are likely to be (weak) causes or effects. As in regression, one can reverse predictor versus criterion with minimal change in syntax, but the choice to do so should be theoretically motivated. Likewise, in SEM, if one does now have clear predictions about the relationship between two variables, sometimes an undirected covariance between them is the appropriately weak parameterization.

syn2 <- '
B ~ A
C ~ B
A ~ C
'

m2 <- sem(syn2, abc, meanstructure=FALSE)
summary(m2)
## lavaan 0.6-3 ended normally after 16 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
##   Minimum Function Value               0.0000000000000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A                -0.022    0.047   -0.460    0.646
##   C ~                                                 
##     B                 0.703    0.024   29.427    0.000
##   A ~                                                 
##     C                 0.314    0.043    7.357    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 1.008    0.050   20.277    0.000
##    .C                 0.510    0.023   22.360    0.000
##    .A                 0.909    0.041   22.351    0.000

Note that we now have a just-identified model (0 degrees of freedom) because we have 3 variances and 3 covariances in \(\mathbf{S_{XX}}\) and we have 6 free parameters estimated here.

7.2 Making the remaining covariance trivial

We could also conceive of a situation in which the model that lacks an \(A-C\) relationship is nevertheless reasonably good.

mu <- rep(0,3)
Sigma <- tribble(
  ~a, ~b, ~c,
  1, 0.2, 0.16,
  0.2, 1, 0.7,
  0.16, 0.7, 1
)

abc = data.frame(MASS::mvrnorm(1000, mu=mu, Sigma=Sigma, empirical=TRUE))
colnames(abc) = LETTERS[1:3]
psych::describe(abc)
vars n mean sd median trimmed mad min max range skew kurtosis se
A 1 1000 0 1 0.020 0.010 1.026 -3.49 2.99 6.48 -0.110 0.037 0.032
B 2 1000 0 1 -0.019 0.000 0.969 -2.88 3.12 6.00 0.032 0.023 0.032
C 3 1000 0 1 -0.025 -0.007 0.987 -3.02 2.92 5.94 0.065 -0.127 0.032
cov(abc)
##      A   B    C
## A 1.00 0.2 0.16
## B 0.20 1.0 0.70
## C 0.16 0.7 1.00
syntax <- '
B ~ A
C ~ B
'

m <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m)
## lavaan 0.6-3 ended normally after 12 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          4
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.817
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.366
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A                 0.200    0.031    6.455    0.000
##   C ~                                                 
##     B                 0.700    0.023   30.997    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 0.959    0.043   22.361    0.000
##    .C                 0.509    0.023   22.361    0.000
#semPaths(m)
fitted(m)
## $cov
##   B     C     A    
## B 0.999            
## C 0.699 0.999      
## A 0.200 0.140 0.999

We can see that the A-C relationship is 0.14 in the output and 0.16 in the groundtruth. Thus, we are capturing their association by modeling the indirect effect alone.

8 Ill conditioning

sds <- c(1, 10, 100)
abc_rescale <- cor2cov(cov(abc), sds)
print(abc_rescale)
##    A   B     C
## A  1   2    16
## B  2 100   700
## C 16 700 10000

9 Indirect effects

syntax <- '
B ~ b1*A
C ~ b2*B
med := b1*b2
'

m_ind <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m_ind)
## lavaan 0.6-3 ended normally after 12 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          4
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.817
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.366
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A         (b1)    0.200    0.031    6.455    0.000
##   C ~                                                 
##     B         (b2)    0.700    0.023   30.997    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 0.959    0.043   22.361    0.000
##    .C                 0.509    0.023   22.361    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     med               0.140    0.022    6.319    0.000
#semPaths(m_ind)

9.1 parTable for indirect effects

parTable(m_ind)
id lhs op rhs user block group free ustart exo label plabel start est se
1 B ~ A 1 1 1 1 NA 0 b1 .p1. 0.000 0.200 0.031
2 C ~ B 1 1 1 2 NA 0 b2 .p2. 0.000 0.700 0.023
3 B ~~ B 0 1 1 3 NA 0 .p3. 0.499 0.959 0.043
4 C ~~ C 0 1 1 4 NA 0 .p4. 0.500 0.509 0.023
5 A ~~ A 0 1 1 0 NA 1 .p5. 0.999 0.999 0.000
6 med := b1*b2 1 0 0 0 NA 0 med 0.000 0.140 0.022

Notice how there is an additional parameter estimated, but it is not free because it is a simple product of b1 and b2.

9.2 Full details for the model with indirect effects

Use inspect()

inspect(m_ind)
## $lambda
##   B C A
## B 0 0 0
## C 0 0 0
## A 0 0 0
## 
## $theta
##   B C A
## B 0    
## C 0 0  
## A 0 0 0
## 
## $psi
##   B C A
## B 3    
## C 0 4  
## A 0 0 0
## 
## $beta
##   B C A
## B 0 0 1
## C 2 0 0
## A 0 0 0