Simulate data

mu <- rep(0,3)
Sigma <- tribble(
  ~a, ~b, ~c,
  1, 0.2, 0.3,
  0.2, 1, 0.7,
  0.3, 0.7, 1
)

abc = data.frame(MASS::mvrnorm(1000, mu=mu, Sigma=Sigma, empirical=TRUE))
colnames(abc) = LETTERS[1:3]
psych::describe(abc)
##   vars    n mean sd median trimmed  mad   min  max range  skew kurtosis
## A    1 1000    0  1  -0.03   -0.01 1.02 -3.50 3.10  6.60  0.04     0.15
## B    2 1000    0  1   0.00    0.00 1.01 -3.54 4.01  7.55 -0.02     0.16
## C    3 1000    0  1   0.01    0.00 0.97 -3.65 3.15  6.80 -0.01     0.21
##     se
## A 0.03
## B 0.03
## C 0.03
cov(abc)
##     A   B   C
## A 1.0 0.2 0.3
## B 0.2 1.0 0.7
## C 0.3 0.7 1.0

simple model

syntax <- '
B ~ A
C ~ B
'

m <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m)
## lavaan (0.5-23.1097) converged normally after  11 iterations
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               53.704
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A                 0.200    0.031    6.455    0.000
##   C ~                                                 
##     B                 0.700    0.023   30.997    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 0.959    0.043   22.361    0.000
##    .C                 0.509    0.023   22.361    0.000
semPaths(m)

Note that the residual variance estimates are also estimates of \(1 - R^2\) because we have standardized variables.

Under the hood: parameter table

parTable(m)
##   id lhs op rhs user block group free ustart exo label plabel start   est
## 1  1   B  ~   A    1     1     1    1     NA   0         .p1. 0.000 0.200
## 2  2   C  ~   B    1     1     1    2     NA   0         .p2. 0.000 0.700
## 3  3   B ~~   B    0     1     1    3     NA   0         .p3. 0.499 0.959
## 4  4   C ~~   C    0     1     1    4     NA   0         .p4. 0.499 0.509
## 5  5   A ~~   A    0     1     1    0     NA   1         .p5. 0.999 0.999
##      se
## 1 0.031
## 2 0.023
## 3 0.043
## 4 0.023
## 5 0.000

Parameter estimates

parameterestimates(m)
##   lhs op rhs   est    se     z pvalue ci.lower ci.upper
## 1   B  ~   A 0.200 0.031  6.46      0    0.139    0.261
## 2   C  ~   B 0.700 0.023 31.00      0    0.656    0.744
## 3   B ~~   B 0.959 0.043 22.36      0    0.875    1.043
## 4   C ~~   C 0.509 0.023 22.36      0    0.465    0.554
## 5   A ~~   A 0.999 0.000    NA     NA    0.999    0.999

Note that the variance of A is exogenous, but estimated. More specifically, it will be set at the sample estimate, meaning that the variance of A is not part of the likelihood function and does not contribute to any fit statistics.

Model-implied covariance

fitted(m)
## $cov
##   B     C     A    
## B 0.999            
## C 0.699 0.999      
## A 0.200 0.140 0.999
## 
## $mean
## B C A 
## 0 0 0

Notice how low the covariance is between A and C compared to our groundtruth value of 0.7. That is, the association between A and C is not fully explained by the indirect association of A with C via B.

Model modification

We know from our simulation that there is a relationship between A and C that is not entirely attributable to the indirect effect of A on B via B. Thus, we have an omitted path problem in which a key covariance term is not captured by our model.

We will cover modification indices in more detail soon, but for now, know that the modificationIndices function in lavaan provides estimates of how much the model fit (\(\chi^2\)) would fit if a specific parameter were added to the model.

modificationIndices(m)
##    lhs op rhs   mi    epc sepc.lv sepc.all sepc.nox
## 6    B ~~   C 52.3 -0.799  -0.799   -0.800   -0.800
## 7    B  ~   C 52.3 -1.569  -1.569   -1.569   -1.569
## 8    C  ~   A 52.3  0.167   0.167    0.167    0.167
## 10   A  ~   C 52.3  0.327   0.327    0.327    0.327

Equivalent models

We will return to the issue of models that fit the data equall well, but it’s important to note that in this case, there is only one degree of freedom, and four equally good possibilities for capturing the relationship between A and C. This highlights the importance of having clear theoretical predictions about which variables are correlated versus which are likely to be (weak) causes or effects. As in regression, one can reverse predictor versus criterion with minimal change in syntax, but the choice to do so should be theoretically motivated. Likewise, in SEM, if one does now have clear predictions about the relationship between two variables, sometimes an undirected covariance between them is the appropriately weak parameterization.

syn2 <- '
B ~ A
C ~ B
A ~ C
'

m2 <- sem(syn2, abc, meanstructure=FALSE)
summary(m2)
## lavaan (0.5-23.1097) converged normally after  16 iterations
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A                -0.022    0.047   -0.460    0.646
##   C ~                                                 
##     B                 0.703    0.024   29.427    0.000
##   A ~                                                 
##     C                 0.314    0.043    7.357    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 1.008    0.050   20.277    0.000
##    .C                 0.510    0.023   22.360    0.000
##    .A                 0.909    0.041   22.351    0.000

Note that we now have a just-identified model (0 degrees of freedom) because we have 3 variances and 3 covariances in \(\mathbf{S_{XX}}\) and we have 6 free parameters estimated here.

making the remaining covariance trivial

We could also conceive of a situation in which the model that lacks an \(A-C\) relationship is nevertheless reasonably good.

mu <- rep(0,3)
Sigma <- tribble(
  ~a, ~b, ~c,
  1, 0.2, 0.16,
  0.2, 1, 0.7,
  0.16, 0.7, 1
)

abc = data.frame(MASS::mvrnorm(1000, mu=mu, Sigma=Sigma, empirical=TRUE))
colnames(abc) = LETTERS[1:3]
psych::describe(abc)
##   vars    n mean sd median trimmed  mad   min  max range skew kurtosis
## A    1 1000    0  1  -0.02       0 0.97 -3.23 3.31  6.54 0.03    -0.01
## B    2 1000    0  1   0.00       0 0.96 -3.20 3.76  6.96 0.04     0.18
## C    3 1000    0  1   0.01       0 0.99 -2.75 2.93  5.68 0.05    -0.06
##     se
## A 0.03
## B 0.03
## C 0.03
cov(abc)
##      A   B    C
## A 1.00 0.2 0.16
## B 0.20 1.0 0.70
## C 0.16 0.7 1.00
syntax <- '
B ~ A
C ~ B
'

m <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m)
## lavaan (0.5-23.1097) converged normally after  12 iterations
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.817
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.366
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A                 0.200    0.031    6.455    0.000
##   C ~                                                 
##     B                 0.700    0.023   30.997    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 0.959    0.043   22.361    0.000
##    .C                 0.509    0.023   22.361    0.000
#semPaths(m)
fitted(m)
## $cov
##   B     C     A    
## B 0.999            
## C 0.699 0.999      
## A 0.200 0.140 0.999
## 
## $mean
## B C A 
## 0 0 0

We can see that the A-C relationship is 0.14 in the output and 0.16 in the groundtruth. Thus, we are capturing their association by modeling the indirect effect alone.

Ill conditioning

sds <- c(1, 10, 100)
abc_rescale <- cor2cov(cov(abc), sds)
print(abc_rescale)
##    A   B     C
## A  1   2    16
## B  2 100   700
## C 16 700 10000

indirect effects

syntax <- '
B ~ b1*A
C ~ b2*B
med := b1*b2
'

m_ind <- sem(syntax, data=abc, meanstructure=FALSE)
summary(m_ind)
## lavaan (0.5-23.1097) converged normally after  12 iterations
## 
##   Number of observations                          1000
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.817
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.366
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   B ~                                                 
##     A         (b1)    0.200    0.031    6.455    0.000
##   C ~                                                 
##     B         (b2)    0.700    0.023   30.997    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .B                 0.959    0.043   22.361    0.000
##    .C                 0.509    0.023   22.361    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     med               0.140    0.022    6.319    0.000
#semPaths(m_ind)

parTable for indirect effects

parTable(m_ind)
##   id lhs op   rhs user block group free ustart exo label plabel start
## 1  1   B  ~     A    1     1     1    1     NA   0    b1   .p1. 0.000
## 2  2   C  ~     B    1     1     1    2     NA   0    b2   .p2. 0.000
## 3  3   B ~~     B    0     1     1    3     NA   0         .p3. 0.500
## 4  4   C ~~     C    0     1     1    4     NA   0         .p4. 0.500
## 5  5   A ~~     A    0     1     1    0     NA   1         .p5. 0.999
## 6  6 med := b1*b2    1     0     0    0     NA   0   med        0.000
##     est    se
## 1 0.200 0.031
## 2 0.700 0.023
## 3 0.959 0.043
## 4 0.509 0.023
## 5 0.999 0.000
## 6 0.140 0.022

Notice how there is an additional parameter estimated, but it is not free because it is a simple product of b1 and b2.