1 Problems with emphasizing global fit

Tomarken and Waller (2003) note the following problems with models that appear to fit well according to global statistics:

There are many equivalent models that fit equally well
There may be several alternative models that could fit better
Important variables may be omitted (that would affect fit)
Fit of lower-order components may be poor
An ill-fitting partition (component) may be masked within a well-fitting overall model
Global stats may be insensitive to particular kinds of misspecification (esp. in mean structure models)
One may make post hoc modifications to the model that reduce the validity and reproducibility of the results.

They also offer the following advice for avoiding overreliance on global fit statistics:

Acknowledge plausible equivalent models and design studies to rule out these alternatives
Compare the fit of a candidate model against nonequivalent alternatives (e.g., one-factor versus three-factor)
Acknowledge potentially important variables that were omitted from the model or not measured. How might these alter the conclusions?
Report and evaluate lower-order model components
Test components of the model to verify fit in each (this can proceed hierarchically)
Design studies with sensitivity to detect non-trivial misspecifications and include measures to flag trivial misspecifications (I think this is quite hard in practice)
Clearly distinguish between a priori models and post hoc modifications, or models generated from a search process.

2 Equivalent models

Equivalent models have the same degrees of freedom but specify different relationships among variables. Furthermore, equivalent models fit the data equally well, providing identical fit statistics and sample likelihood. These can pose a problem in SEM when our core conclusions rest on specifying the directionality of a relationship if changing the cause and effect would result in equivalent fit.

In general, directed (causal) relations cannot be inferred from covariance data without strong assumptions about the processes that generated the data. For example, if one variable causes another, the cause must precede the effect in time. Likewise, causal relationships are more plausible if one variable is manipulated experimentally and its effects on dependent variables are measured. Thus, the problem of equivalent models largely points to a limitation of relying on observational techniques and correlational measures.

Observational equivalence: one model generates every probability distribution (univariate or multivariate) generated by an alternative model. In other words, the models cannot be distinguished because the data they generate is identical.
Covariance equivalence: every covariance matrix possible under one model could be equally generated by another. This leads to the same residual structure and conditional independence relationships.

2.1 Lee-Hershberger replacing rules for structural models

Within a block of variables at the beginning of a structural model that is just identified and with unidirectional relations to subsequent variables, direct effects, correlated disturbances, and equality-constrained reciprocal effects are interchangeable. This means we can replace \(Y1 \rightarrow Y2\) with \(Y1 \leftarrow Y2\), \(Y2 \leftrightharpoons Y1\), or \(D1 \leftrightarrow D2\).
At subsequent places in the model where two endogenous variables have the same causes and their relations are unidirectional, all of the following may be substituted for one another: \(Y1 \rightarrow Y2\) with \(Y1 \leftarrow Y2\), \(Y2 \leftrightharpoons Y1\) (equality-constrained), or \(D1 \leftrightarrow D2\).

Limitations of replacing rules:

They may not be transitive: reapplying replacing rules from previous applications may not generate additional versions of the model that are all equivalent.
The replacing rules can generate a new structural model whose conditional independence relationships are not identical (a problem with d-separation, as we’ll discuss vis-a-vis causality).

2.2 Examples of equivalent models

From Mulaik (2009)

2.2.1 \(Y1\) causes \(Y2\)

2.2.2 \(Y2\) causes \(Y1\)

2.2.3 \(Y1\) and \(Y2\) have correlated disturbances

Note that this model does not include any causal path between \(Y1\) and \(Y2\), which seems like a conceptually distinct position from Models A and B.

2.2.4 Saturate relationships among \(X1\), \(X2\), \(Y1\), and \(Y2\)

2.2.5 Make \(X1 - Y1\) and \(X2 - Y1\) relationships undirected

2.2.6 Change the direction of causality

2.3 What to do?

2.3.1 After the data are collected

From Hershberger 2006

Use an index for judging model quality that is not based on global goodness-of-fit or likelihood. For example the information complexity criterion (ICOMP) approach tries to quantify model misfit (residual covariances) relative to its complexity (correlations among parameter estimates, where greater correlation indicates complexity). See semGOF package. Among equivalent models, a model with lower ICOMP would be preferred.
Eliminate equivalent models that have problems during estimation such as many iterations, negative variance estiamtes, or parameters that have to be constrained to avoid improper solutions. This is a heuristic approach.
Combine \(R^2\) measures for endogenous variables into a single measure (e.g., mean) and prefer the model with the highest value. This is also heuristic and relatively untested.
Examine extended individual case residuals (EICR). Although the residual covariance matrix is identical among equivalent models, the residuals in terms of predicted versus observed scores on manifest variables are not. Thus, according to Raykov and Penev (2001), a model that minimizes individual case residuals should be preferred over equivalents.

2.3.2 Before the data are collected

Try to come up with situations where reversing some paths, even if ancillary, is implausible. Recall that the L-H replacing rules only yield equivalent models when two endogenous variables have the same causes.

Consider this example:

m <- '
Happiness ~ 0.3*Money + 0.3*Love
Travel ~ 0.4*Happiness + 0.3*Money
'

d <- simulateData(m, sample.nobs=500)

#groundtruth
f1.syn <- '
Happiness ~ Money + Love
Travel ~ Happiness + Money
'
f1 <- sem(f1.syn, d)
semPaths_default(f1)

fitmeasures(f1)["chisq"]

## chisq 
##  0.87

is not equivalent to this:

#not equivalent
f2.syn <- '
Happiness ~ Money + Love + Travel
Travel ~ Money
'
f2 <- sem(f2.syn, d)
semPaths_default(f2)

fitmeasures(f2)["chisq"]

## chisq 
##  4.89

Why not? Love is not a cause of Travel in either model, so the causes of the endogenous variables are not the same (does not satisfy rule 2).

This, on the other hand, would be equivalent according to rule 1. The relations among exogenous variables are just identified and have unidirectional effects on other variables. Thus, we can shift the \(Money \leftrightarrow Love\) relationship from an unanalyzed association to a directed effect.

#equivalent
f3.syn <- '
Love ~ Money
Happiness ~ Money + Love
Travel ~ Happiness + Money
'
f3 <- sem(f3.syn, d, conditional.x=FALSE)
semPaths_default(f3)

fitmeasures(f3)["chisq"]

## chisq 
##  0.87

#equivalent
f4.syn <- '
Happiness ~ Love
Travel ~ Happiness + Money
Money ~ Happiness
'
f4 <- sem(f4.syn, d, conditional.x=FALSE)
semPaths_default(f4)

fitmeasures(f4)["chisq"]

## chisq 
##  3.53

The closer the model is to saturated, the larger the number of equivalent models. When possible, try to test a relatively sparse structural model. More specifically, in saturated blocks, there is little information to adjudicate directionality. Thus, it is useful to include structural variables that are thought to be unrelated to others.
Models with many paths suffer from many equivalents. Can you test your key hypotheses in models with fewer variables?
Consider including ancillary variables that are not related to the original set in the model. In particular, consider adding antecedents (predictors) of variables in a source block such that they do not have any relationship to the focal (destination) block.

3 Are two models nested, non-nested, or equivalent?

When we are considering two or more alternative models, we need to undersetand their relationships to each other. Models can be nested, non-nested, or equivalent.

Conceptually, model A is nested in model B if model A contains a subset of the parameters in model B. For example, if we remove a correlation between factors in one model, but estimate it in the other, then the restricted model (zero covariance between factors) is nested in the full model. Thus, it is no surprise that the correlated factors model can capture the predictions of the uncorrelated factors model because it can estimate the correlation of zero.

Non-nested models contain parameters that do not represent a part versus whole kind of relationship – that is, the parameters of one model are not a subset of the other.

And equivalent models, as we’ve seen above, have the same sample likelihood and degrees of freedom, even though they may have substantively different interpretations (e.g., changing the sign of a directed path).

Bentler and Satorra (2010) defined a useful nesting and equivalence testing (NET) procedure:

Conduct typical SEM on Model \(M1\). Save the \(df_{M1}\) and model-implied covariance \(\boldsymbol{\Sigma}(\hat{\boldsymbol{\theta}})\).
Conduct a SEM on an alternative model \(M2\) using the \(\boldsymbol{\Sigma}(\hat{\boldsymbol{\theta}})\) as data (not raw). Save the \(\chi^2_{M2}\) and \(df_{M2}\).
Compute \(d = df_{M1} - df_{M2}\) and set a small criterion, \(\epsilon\), on \(\chi^2\) for equivalence (e.g., .001)
Test for relationship between models:
- If \(d > 0\) and \(\chi^2_{M2} < \epsilon\), the models are nested.
- If \(d = 0\) and \(\chi^2_{M2} < \epsilon\), the models are equivalent.
- If \(d < 0\) or \(\chi^2_{M2} \geq \epsilon\), the models are not nested.

The primary idea is that when there are more degrees of freedom in \(M1\) than \(M2\) and the model-implied covariance matrix can be reproduced by \(M2\), then \(M1\) is nested in \(M2\).

4 Demo of model comparison

From Brown Ch. 8, Fig 8.1. This is a measure of coping styles administered to 275 college students. There are four first-order latent variables, which reflect different ways of coping with stress: Problem Solving, Cognitive Restructuring, Express Emotions, and Social Support. The authors also anticipated that there are two higher-order factors that account for correlations among the first-order factors: Problem-Focused Coping and Emotion-Focused Coping.

4.1 First-order CFA

rawdat <- '
1.00
0.78  1.00
0.80  0.77  1.00
0.56  0.51  0.48  1.00
0.52  0.51  0.46  0.78  1.00
0.59  0.51  0.51  0.80  0.79  1.00
0.16  0.15  0.17  0.14  0.18  0.16  1.00
0.19  0.13  0.18  0.14  0.16  0.16  0.81  1.00
0.12  0.17  0.17  0.17  0.20  0.16  0.75  0.80  1.00
0.16  0.13  0.17  0.15  0.16  0.18  0.56  0.52  0.50  1.00
0.16  0.14  0.18  0.15  0.16  0.18  0.51  0.58  0.51  0.81  1.00
0.16  0.15  0.14  0.16  0.16  0.14  0.52  0.57  0.52  0.80  0.79  1.00
'

#standard deviations of variables
sds <- as.numeric(strsplit('1.40  2.10  1.30  2.30  2.40  1.90  2.00  2.90  2.30  3.10  2.20  1.20', '\\s+', perl=TRUE)[[1]])

cormat <- getCov(rawdat)

#this is the covariance matrix to be analyzed
covmat <- lavaan::cor2cov(cormat, sds)

rownames(covmat) <- colnames(covmat) <- c("P1", "P2", "P3", "C1", "C2", "C3", "E1", "E2", "E3", "S1", "S2", "S3")

#to get Vuong test to work as expected, need raw data, not just a covmat
df <- mvrnorm(n=275, mu=rep(0, 12), Sigma=covmat, empirical = TRUE)

bmodel <- '
ProbSol =~ P1 + P2 + P3
CogRest =~ C1 + C2 + C3
ExpEmo =~ E1 + E2 + E3
SocSupp =~ S1 + S2 + S3

#explicitly name factor correlations
ProbSol ~~ CogRest
ProbSol ~~ ExpEmo
ProbSol ~~ SocSupp
CogRest ~~ ExpEmo
CogRest ~~ SocSupp
ExpEmo ~~ SocSupp
'

#l1fac_m <- sem(bmodel, sample.cov=covmat, sample.nobs = 275)
l1fac_m <- sem(bmodel, df)
summary(l1fac_m, fit.measures=TRUE, standardized=TRUE)

## lavaan 0.6-3 ended normally after 65 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         30
## 
##   Number of observations                           275
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      82.970
##   Degrees of freedom                                48
##   P-value (Chi-square)                           0.001
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2761.514
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.987
##   Tucker-Lewis Index (TLI)                       0.982
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -5642.801
##   Loglikelihood unrestricted model (H1)      -5601.316
## 
##   Number of free parameters                         30
##   Akaike (AIC)                               11345.602
##   Bayesian (BIC)                             11454.106
##   Sample-size adjusted Bayesian (BIC)        11358.981
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.051
##   90 Percent Confidence Interval          0.032  0.070
##   P-value RMSEA <= 0.05                          0.426
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.017
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ProbSol =~                                                            
##     P1                1.000                               1.273    0.911
##     P2                1.424    0.071   20.173    0.000    1.812    0.864
##     P3                0.896    0.043   20.778    0.000    1.140    0.879
##   CogRest =~                                                            
##     C1                1.000                               2.038    0.888
##     C2                1.027    0.051   19.964    0.000    2.094    0.874
##     C3                0.842    0.040   21.117    0.000    1.715    0.904
##   ExpEmo =~                                                             
##     E1                1.000                               1.743    0.873
##     E2                1.542    0.072   21.395    0.000    2.688    0.929
##     E3                1.133    0.059   19.043    0.000    1.974    0.860
##   SocSupp =~                                                            
##     S1                1.000                               2.787    0.901
##     S2                0.706    0.033   21.679    0.000    1.969    0.896
##     S3                0.381    0.018   21.230    0.000    1.062    0.886
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ProbSol ~~                                                            
##     CogRest           1.718    0.209    8.217    0.000    0.662    0.662
##     ExpEmo            0.453    0.149    3.042    0.002    0.204    0.204
##     SocSupp           0.695    0.237    2.928    0.003    0.196    0.196
##   CogRest ~~                                                            
##     ExpEmo            0.712    0.238    2.991    0.003    0.201    0.201
##     SocSupp           1.146    0.381    3.011    0.003    0.202    0.202
##   ExpEmo ~~                                                             
##     SocSupp           3.255    0.396    8.218    0.000    0.670    0.670
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .P1                0.333    0.051    6.526    0.000    0.333    0.170
##    .P2                1.111    0.130    8.560    0.000    1.111    0.253
##    .P3                0.384    0.048    8.043    0.000    0.384    0.228
##    .C1                1.118    0.143    7.812    0.000    1.118    0.212
##    .C2                1.356    0.162    8.351    0.000    1.356    0.236
##    .C3                0.654    0.093    7.010    0.000    0.654    0.182
##    .E1                0.946    0.112    8.425    0.000    0.946    0.237
##    .E2                1.155    0.205    5.638    0.000    1.155    0.138
##    .E3                1.372    0.155    8.862    0.000    1.372    0.260
##    .S1                1.805    0.242    7.448    0.000    1.805    0.189
##    .S2                0.947    0.124    7.655    0.000    0.947    0.196
##    .S3                0.308    0.038    8.103    0.000    0.308    0.215
##     ProbSol           1.620    0.169    9.558    0.000    1.000    1.000
##     CogRest           4.152    0.452    9.186    0.000    1.000    1.000
##     ExpEmo            3.039    0.339    8.958    0.000    1.000    1.000
##     SocSupp           7.770    0.823    9.435    0.000    1.000    1.000

semPaths_default(l1fac_m, sizeMan=7)

4.2 Second-order CFA

What about a model with a higher-order factor structure: Problem-Focused versus Emotion-Focused?

hcfa <- '
ProbSol =~ P1 + P2 + P3
CogRest =~ C1 + C2 + C3
ExpEmo =~ E1 + E2 + E3
SocSupp =~ S1 + S2 + S3

PFCope =~ ProbSol + CogRest
EFCope =~ ExpEmo + SocSupp
'

#hcfa_m <- sem(hcfa, sample.cov=covmat, sample.nobs = 275)
hcfa_m <- sem(hcfa, df)
summary(hcfa_m, fit.measures=TRUE, standardized=TRUE)

## lavaan 0.6-3 ended normally after 63 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         29
## 
##   Number of observations                           275
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      83.005
##   Degrees of freedom                                49
##   P-value (Chi-square)                           0.002
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2761.514
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.987
##   Tucker-Lewis Index (TLI)                       0.983
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -5642.819
##   Loglikelihood unrestricted model (H1)      -5601.316
## 
##   Number of free parameters                         29
##   Akaike (AIC)                               11343.637
##   Bayesian (BIC)                             11448.524
##   Sample-size adjusted Bayesian (BIC)        11356.570
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.050
##   90 Percent Confidence Interval          0.031  0.068
##   P-value RMSEA <= 0.05                          0.469
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.017
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ProbSol =~                                                            
##     P1                1.000                               1.273    0.911
##     P2                1.424    0.071   20.174    0.000    1.812    0.864
##     P3                0.896    0.043   20.777    0.000    1.140    0.879
##   CogRest =~                                                            
##     C1                1.000                               2.038    0.888
##     C2                1.028    0.051   19.966    0.000    2.094    0.874
##     C3                0.842    0.040   21.115    0.000    1.715    0.904
##   ExpEmo =~                                                             
##     E1                1.000                               1.743    0.873
##     E2                1.542    0.072   21.393    0.000    2.687    0.928
##     E3                1.133    0.059   19.047    0.000    1.975    0.860
##   SocSupp =~                                                            
##     S1                1.000                               2.787    0.901
##     S2                0.706    0.033   21.678    0.000    1.969    0.896
##     S3                0.381    0.018   21.230    0.000    1.062    0.886
##   PFCope =~                                                             
##     ProbSol           1.000                               0.812    0.812
##     CogRest           1.609    0.422    3.813    0.000    0.816    0.816
##   EFCope =~                                                             
##     ExpEmo            1.000                               0.826    0.826
##     SocSupp           1.572    0.407    3.865    0.000    0.812    0.812
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   PFCope ~~                                                             
##     EFCope            0.448    0.146    3.062    0.002    0.301    0.301
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .P1                0.333    0.051    6.525    0.000    0.333    0.170
##    .P2                1.111    0.130    8.560    0.000    1.111    0.253
##    .P3                0.384    0.048    8.043    0.000    0.384    0.228
##    .C1                1.119    0.143    7.812    0.000    1.119    0.212
##    .C2                1.355    0.162    8.350    0.000    1.355    0.236
##    .C3                0.655    0.093    7.012    0.000    0.655    0.182
##    .E1                0.946    0.112    8.423    0.000    0.946    0.237
##    .E2                1.157    0.205    5.644    0.000    1.157    0.138
##    .E3                1.372    0.155    8.859    0.000    1.372    0.260
##    .S1                1.806    0.242    7.448    0.000    1.806    0.189
##    .S2                0.947    0.124    7.655    0.000    0.947    0.196
##    .S3                0.308    0.038    8.102    0.000    0.308    0.215
##     ProbSol           0.552    0.279    1.982    0.047    0.341    0.341
##     CogRest           1.388    0.721    1.925    0.054    0.334    0.334
##     ExpEmo            0.968    0.530    1.826    0.068    0.319    0.319
##     SocSupp           2.652    1.314    2.018    0.044    0.341    0.341
##     PFCope            1.068    0.308    3.467    0.001    1.000    1.000
##     EFCope            2.071    0.593    3.491    0.000    1.000    1.000

semPaths_default(hcfa_m, sizeMan=7)

4.3 Bentler NET approach

We can test whether models are nested, equivalent, or not nested using Bentler’s nesting and equivalent testing procedure (NET). As mentioned above, this involves fitting model 1 using a candidate SEM, saving the model-implied covariance matrix, then fitting model 2 using that matrix as the input. We examine the difference in degrees of freedom. If \(df_1 - df_2 > 0\) and the \(\chi^2\) of model 2 is essentially zero (usually < .001) then we say model 1 is nested in model 2. Note that this is an asymmetric test. Model 1 may not be nested in model 2, but the converse (2 nested in 1) could be.

4.4 Is the saturated structural model nested in hierarchical factor model?

Following Bentler steps:

df1 <- fitmeasures(l1fac_m)["df"] #degrees of freedom for saturated factor model (no higher-order structure)
mimplied_l1fac <- fitted(l1fac_m)$cov #model implied cov

#now use the model-implied covariance as the data to analyze for hcfa_m
hcfa_m_fromcov <- sem(hcfa, sample.cov=mimplied_l1fac, sample.nobs = 275)
df2 <- fitmeasures(hcfa_m_fromcov)["df"]
d <- df1 - df2
cat("df for NET: ", d, "\n")

## df for NET:  -1

chi2 <- fitmeasures(hcfa_m_fromcov)["chisq"]
cat("Chi-square for NET test: ", chi2, "\n")

## Chi-square for NET test:  0.0349

We see that the degrees of freedom difference is negative. Thus, the saturated model is not nested in the hierarchical model. This makes sense intuitively in that the hierarchical model imposes some structure on the factor model, whereas the saturated model estimates unique covariances among factors.

4.5 Is the hierarchical model nested in the saturated model?

We can reverse the NET test (in terms of what is \(M1\) versus \(M2\)). To test whether the hierarchical CFA is nested in the saturated model, we first fit the saturated CFA, then test the hierarchical model using the model-implied covariance matrix from the saturated model.

#is the H CFA nested in the saturated (first-order)
df1 <- fitmeasures(hcfa_m)["df"]
mimplied_hcfa <- fitted(hcfa_m)$cov

#now use the model-implied covariance as the data to analyze for hcfa_m

hcfa_m_fromcov <- sem(bmodel, sample.cov=mimplied_hcfa, sample.nobs = 275)
df2 <- fitmeasures(hcfa_m_fromcov)["df"]
d <- df1 - df2
cat("df for NET: ", d, "\n")

## df for NET:  1

chi2 <- fitmeasures(hcfa_m_fromcov)["chisq"]
cat("Chi-square for NET test: ", chi2, "\n")

## Chi-square for NET test:  0

Here, we see the expected result for a nested model: the df difference is positive (1) and the \(\chi^2\) of the second model (using the model-implied covariance of the first) is 0. Thus, the hierarchical CFA model is nested in the saturated first-order factor model.

4.6 Conducting NET the easy way

The semTools package provides a net() function that will compute nesting/equivalence tests for all pairs of models passed in (as separate arguments). This package also provides the moreFitIndices function, which provides the \(AIC_C\) as aic.smallN (see below for its role in model evidence).

#the easy way
semTools::net(l1fac_m, hcfa_m)

## 
##         If cell [R, C] is TRUE, the model in row R is nested within column C.
## 
##         If the models also have the same degrees of freedom, they are equivalent.
## 
##         NA indicates the model in column C did not converge when fit to the
##         implied means and covariance matrix from the model in row R.
## 
##         The hidden diagonal is TRUE because any model is equivalent to itself.
##         The upper triangle is hidden because for models with the same degrees
##         of freedom, cell [C, R] == cell [R, C].  For all models with different
##         degrees of freedom, the upper diagonal is all FALSE because models with
##         fewer degrees of freedom (i.e., more parameters) cannot be nested
##         within models with more degrees of freedom (i.e., fewer parameters).
##         
##                   l1fac_m hcfa_m
## l1fac_m (df = 48)               
## hcfa_m (df = 49)  TRUE

semTools::moreFitIndices(l1fac_m)

## Warning in semTools::moreFitIndices(l1fac_m): AICc (aic.smallN) was
## developed for univariate linear models. It is probably not appropriate to
## use AICc to compare SEMs.

##       gammaHat    adjGammaHat baseline.rmsea     aic.smallN     bic.priorN 
##       9.79e-01       9.66e-01       3.85e-01       1.14e+04       1.15e+04 
##            hqc            sic 
##       1.14e+04       1.14e+04

5 Model evidence, K-L distance

“All models are wrong but some are useful” - George Box

Having established whether two models are nested, not nested, or equivalent, we can move onto the important consideration of model evidence. In the case of equivalent models, relative evidence cannot be based on global fit given that both models will have the same likelihood and same number of parameters (see above for potential solutions to this problem). But in the case of non-nested and nested models, we would like to be able to make a statement such as, “Model A fits the data significantly better than Model B, p < .01,” or, “The evidence for Model A is ten times stronger than the evidence for Model B.”

5.1 Binary judgments: likelihood ratios and Vuong tests

The first statement concerns a binary judgment about model fit – that is, we are interested in testing the null hypothesis that two models fit the data equally well. We have seen this approach in previous weeks when using the likelihood ratio test (LRT) – aka the model chi-square difference test. In the LRT approach, we can compare two nested models (using the anova function in lavaan) to test the null hypothesis that one model (with fewer parameters – the nested model) fit the data equally well. If we reject this null (typically \(p < .05\)), then we conclude that the nested model fits significantly worse than the full model.

In our example above, we learned from NET (and could figure out intuitively) that the models are nested. Does the hierarchical factor model (nested) fit significantly worse than the first-order factor model (full)?

anova(l1fac_m, hcfa_m)

	Df	AIC	BIC	Chisq	Chisq diff	Df diff	Pr(>Chisq)
l1fac_m	48	11346	11454	83	NA	NA	NA
hcfa_m	49	11344	11449	83	0.035	1	0.852

No, we do not reject the null, meaning that we conclude that the nested model (hierarchical factors) does not fit significantly worse than the full model (saturated associations among first-order factors), \(p = .85\). Thus, even though the hierarchical factor model has one less free parameter (and thus, one additional degree of freedom), it does not fit significantly worse. Typically, in this case, we would retain the simpler model given that its fit is not worse.

If we are interested in binary comparisons of non-nested models, we can usually use a Vuong test, which will also provide a test of the null hypothesis that one model as well as an alternative. See this paper from Ed Merkle: https://arxiv.org/pdf/1402.6720.pdf. And the corresponding nonnest2 package in R, which plays well with lavaan. Although we know that our models are nested above, we can still apply the Vuong test (using nested=TRUE), which yields this output. Note that if we were comparing non-nested models, we would keep nested=FALSE!

nonnest2::vuongtest(l1fac_m, hcfa_m, nested=TRUE)

## 
## Model 1 
##  Class: lavaan 
##  Call: lavaan::lavaan(model = bmodel, data = df, model.type = "sem", ...
## 
## Model 2 
##  Class: lavaan 
##  Call: lavaan::lavaan(model = hcfa, data = df, model.type = "sem", int.ov.free = TRUE, ...
## 
## Variance test 
##   H0: Model 1 and Model 2 are indistinguishable 
##   H1: Model 1 and Model 2 are distinguishable 
##     w2 = 0.000,   p = 0.847
## 
## Robust likelihood ratio test of distinguishable models 
##   H0: Model 2 fits as well as Model 1 
##   H1: Model 1 fits better than Model 2 
##     LR = 0.035,   p = 0.862

Here, the binary judgments are the same as the LRT above – the hierarchical factor model does not fit significantly worse than the first-order model.

5.2 Relative evidence and Kullback-Leibler (K-L) distance

These, however, is a binary judgment. What about the evidence in favor of the hierarchical model versus the first-order model? This is a dimensional question and could be framed in terms of the probability that one model is the best in the set. For example, is the evidence in factor of the hierarchical factor model three times greater than the first-order model?

To make judgments about relative evidence, we need a measure of how well one model fits compared to another (or many others).

The goal of most statistical modeling is to approximate the key factors that generated our empirical data. The data generation processes are usually complex and largely unknown, so our models will almost always be deficient in many ways. Nevertheless, we may wish to know how close one model is to the unknown truth relative to a comparator model. This is the substance of model selection theory – i.e., how one selects among models based on relative evidence.

Kullback-Leibler information quantifies the information lost when lost when a simple model, g is used to approximate the true model, f:

\[ I(f,g) = \int{f(x) \textrm{log} \left( \frac{f(x)}{g(x|\theta)} \right) dx} \] This can also be thought of as the distance between our model, g and the population (unknown) model, f.

In the simplest case, we can think of trying to identify a probability distribution (e.g., Gaussian) that best describes a given empirical sample. Recall from the M-L lecture our attempt to fit a Gaussian distribution using a Gamma distribution. The sample likelihood for the Gamma, even at optimized (fitted) parameters, can never be as close as a Gaussian fit.

ML Gamma fit

X <- rnorm(5000, mean=50, sd=sqrt(50))
gfit <- suppressWarnings(fitdistr(X, "gamma"))
gfit$loglik

## [1] -16981

datafromgam <- rgamma(n=5000, shape=gfit$estimate["shape"], rate=gfit$estimate["rate"])

ML Gaussian fit

gfit <- suppressWarnings(fitdistr(X, "normal"))
gfit$loglik

## [1] -16959

datafromgauss <- rnorm(n=5000, mean=gfit$estimate["mean"], sd=gfit$estimate["sd"])

df <- data.frame(model=rep(c("groundtruth", "gamma", "gauss"), each=5000), 
                           v=c(X, datafromgam, datafromgauss))
                 
ggplot(df, aes(x=v, color=model)) + geom_density()

The sample LL is an approximation of a K-L distance between a model and ‘truth’, but read on to AIC below…

In general, we do not know the true model, which means we cannot compute the exact K-L distance. Fortunately, however, we are often interested in relative distance – that is, how much better is model A than model B? In this circumstance, the K-L distance derivation includes two terms: one is based on the expectation under the ‘true’ model (can’t compute that!), the other is based on the relative distance between the model and the truth.

Since we can’t compute the first term, it means we do not have a measure of absolute fit using likelihood-based and K-L methods. But, because we can compute the relative distance, we can also compute the relative distance between two or more models. As a result, we can compare relative evidence across a set of candidate models. All may be bad, or all may be good, by some absolute criterion, but we can ascertain which is relatively best. Global fit criteria in SEM often give some information about the absolute fit (esp. residual-based methods). Likelihood methods (K-L distance) are more about relative evidence.

K-L distances are on a true ratio scale (i.e., a meaningful zero point and equal interval spacing). As a result, differences in distances can be compared, and a zero K-L difference implies identical relative fit. Moreover, K-L distances tend to become larger as the sample size increases. By analogy, AIC differences among models tend to be larger in large samples compared to small ones. This does not mean that the scaling or interpretation changes! Rather, it underscores that discriminating among models becomes easier in large samples.

5.3 Akaike Information Criterion (AIC)

The derivation of the AIC was a major breakthrough in information theory and statistics because it provides a statistical link between K-L distance (a more theoretical notion) and maximum likelihood-based methods.

The AIC provides an approximation of the relative distance of the fitted model and the unknown (true) mechanisms that generated the data. Importantly, it is a large sample criterion (based on asymptotic properties, so 1000s, ideally) and is also not especially meaningful if the models in question are simply inappropriate or poor (e.g., fitting a binary outcome with a continuous model).

\[ AIC = -2 \textrm{log} (\mathcal{L}(\hat{\theta}|y)) + 2k \] where \(k\) is the number of free parameters, and \(\mathcal{L}(\hat{\theta}|y)\) is the sample likelihood.

Note that by multiplying the LL by -2, the criterion is now scaled such that lower values are better. By contrast, for LL, higher values indicate better fit (greater likelihood of the data given the model).

5.4 Corrected AIC

Because AIC is based on a large sample derivation (i.e., its statistical properties are ideal as the sample size goes to infinity), it may select overly complex models in smaller samples (think hundreds, not thousands). This led to the derivation of a variant of AIC, usually denoted \(AIC_C\) (Sugiura, 1978), that includes an additional penalty for sample size:

\[ AIC_C = -2 \textrm{log} (\mathcal{L}(\hat{\theta}|y)) + 2k + \frac{2k(k+1)}{n - k -1} \]

When your cases:parameters ratio is below 40 (which it usually is!), Burnham and Anderson recommend this over AIC. But this was derived for univariate (one DV) regression, not SEM. There is not an equivalent in SEM (that I know of), but also not a specific reason that the above derivation is not approximately correct. Nevertheless, there is (to my knowledge) no consensus on whether \(AIC_C\) should be preferred to the standard \(AIC\) in SEM.

6 Comparing relative evidence of candidate SEMs

One of the most important considerations when comparing models is that the set of endogenous variables is the same and on the same scale! This is often misunderstood in applied papers and can lead to bizarre circumstances where AICs differ by thousands of points. That’s not impossible, but more likely to reflect differences in the likelihood function that render models non-comparable using LL-based methods (incl. AIC/BIC). For example, if you log-transform some of the variables, their contribution to the sample likelihood will be remarkably different, rendering models with and without that transformation incomparable (at least in likelihood terms).

Recall that the sample likelihood is computed on the endogenous variables. Thus, if the endogenous set changes (e.g., changing the direction of an arrow), LL methods cannot necessarily be compared. In lavaan, this relates to the conditional.x option. See ?lavOptions. If it is FALSE, then model-implied statistics, including LL, reflect both exogenous and endogenous variables. This is likely what you’d want in cases where you’re testing alternative models with paths direction changes that would make an endogenous variable exogenous, or vice versa.

If you’re not sure, check the unrestricted (saturated) LL for the two candidate models. If they are not identical, you cannot use AIC/BIC to compare them. You might also check your setting of conditional.x in lavaan. Fortunately, lavaan will usually provide a warning: “WARNING: some models are based on a different set of observed variables.” If you see that, watch out! Your models are almost definitely not comparable, either in terms of LRT (nested) or AIC/BIC (non-nested) comparisons since both are based on the sample likelihood.

Note that you will come across many “IC” (Information Criterion) measures in SEM and statistics, including the Bayesian information criterion (BIC). A discussion of these measures is beyond the scope of the class, but it’s worth mentioning that they have different statistical properties as relates to model selection (i.e., choosing which model fits best). Virtually all of the IC-based approaches build on the sample log-likelihood and are concerned with the fit-parsimony tradeoff. That is, we are interested in identifying a model that is as parsimonious as possible, while also fitting the data well.

We can always improve fit – that is, sample likelihood – by adding additional free parameters. Even if the parameters are not useful (e.g., adding a predictor whose p-value is .5), they can only help in fit, even if trivially. This is akin to the motivation for adjusted \(R^2\) in regression, where a predictor can only contribute positively to predicted variance in the outcome. Thus, we do not want to prioritize the sample likelihood alone. Instead, we would like to prefer models that have the fewest free parameters possible. But if we delete a free parameter that has a huge effect in the dataset (e.g., a big regression path), the fit will worsen substantially. This is the dilemma of fit versus parsimony, and the motivation for measures such as AIC and BIC that try to balance these considerations.

(Some quotes from Hallquist & Pilkonis, 2010)

6.1 Efficiency (AIC)

AIC is an asymptotically efficient criterion for model selection, which means that it tends to select the model that minimizes prediction error as sample size increases. More generally, AIC is an information criterion primarily concerned with minimizing the relative Kullback-Leibler (K-L) distance between a statistical model and the unknown mechanisms giving rise to the observed data (Claeskens & Hjort, 2008).

An underlying assumption of AIC is that the “truth” of the data is very complex (i.e., many causal factors of varying strengths) and that no one “true model” exists (Burnham & Anderson, 2002). Thus, AIC is useful for selecting the best approximate model using a finite number of parameters when “truth” is countably infinite.

6.2 Consistency (BIC)

By contrast, the BIC originates from the Bayesian tradition in statistics and is concerned with the statistical property of consistency, which refers to the one “true model” being selected with increasing probability as sample size increases (Yang, 2005). Underlying the property of consistency are several assumptions worth examining: (1) a single “true” model exists that best represents the mechanisms that gave rise to the observed data, (2) one of the models in the set of models tested is the “true” model, and (3) the focus of model selection is to identify the “true” model rather than to provide a close approximation to the truth. Given these assumptions, Burnham and Anderson (2002) have argued against the use of consistent model selection criteria in the biological, medical, and social sciences because such data are highly complex, there is a low likelihood of including the “true model” in any set of models considered, and it is unlikely that a single “true model” exists.

6.3 Comparing models based on AIC or BIC differences

Summarizing extensive simulation work, Burnham and Anderson (2002) offer useful rules-of-thumb for inferring the degree of certainty afforded to a particular model relative to the best fitting model in the set:

models within 0–2 points on the AIC also have substantial support
models in the 4–7 point range have considerably less support
models greater than 10 points apart have almost no support.

Moreover, AIC values for a set of models can be subtracted from the model with the lowest AIC value and these difference scores can be transformed into a set of Akaike weights that represent the evidential weight of a particular model relative to all other models in the set (Burnham & Anderson, 2002). The important general point is that conclusions drawn from latent variable models would benefit from being contextualized in terms of model selection uncertainty and the degree of support for one model over alternatives.

For details on Akaike weights or evidence ratios, I would recommend checking out the AICcmodavg package and Burnham and Anderson’s 2002 book. Briefly, the Akaike weight is the probability that model \(i\) is the Kullback-Leibler best model in the set of models \(I\) that we’ve tested. A K-L best model just means that the relative distance from the unknown truth is shortest, and thus its relative evidence is best, but the weight gives a sense of uncertainty in this judgment.

aa <- aictab(list(first_order=l1fac_m, higher_order=hcfa_m))

print(aa)

## 
## Model selection based on AICc:
## 
##               K  AICc Delta_AICc AICcWt Cum.Wt    LL
## higher_order 29 11351       0.00   0.78   0.78 -5643
## first_order  30 11353       2.49   0.22   1.00 -5643

Returning to our first-order versus hierarchical factor model example, we see that the difference in AIC between these is 2.49, which would fall into the weak range of Burnham and Anderson’s rules. The AIC weights sum to 1.0 and reveal here that the probability that the hierarchical model is the best in the set of tested models (here, two). A 78% probability isn’t terrible, but it’s not definitive either – we should have some uncertainty about whether the hierarchical model is best. The fact that the hierarchical model is also more parsimonious may tilt our conclusions further in that direction.

Furthermore, note that AIC weights do not consider the important possibility that there is another untested model that would capture the data far better. Thus, we must always remember that these measures provide relative evidnece of one model versus a set of tested alternatives, but not alternatives we failed to consider!

Finally, we can take the ratios of AIC weights as an evidence ratio, which quantifies the ratio in favor of one model (with lower AIC) compared to another (higher AIC).

evidence(aa)

## 
## Evidence ratio between models 'higher_order' and 'first_order':
## 3.47

Here, the ratio of evidence in favor of the higher-order factor models is 3.47x that of the first-order factor model. Many statisticians would prefer that such ratios (closely related to Bayes factors) are 10 or higher, minimally, to make strong claims that one model is better than another. But here, we could contextualize the evidence of the hierarhical model as “moderately” higher than the alternative, first-order model.

Model comparison and evidence

PSY 597 Week 9

Michael Hallquist

19 Mar 2019