Tomarken and Waller (2003) note the following problems with models that appear to fit well according to global statistics:
They also offer the following advice for avoiding overreliance on global fit statistics:
Equivalent models have the same degrees of freedom but specify different relationships among variables. Furthermore, equivalent models fit the data equally well, providing identical fit statistics and sample likelihood. These can pose a problem in SEM when our core conclusions rest on specifying the directionality of a relationship if changing the cause and effect would result in equivalent fit.
In general, directed (causal) relations cannot be inferred from covariance data without strong assumptions about the processes that generated the data. For example, if one variable causes another, the cause must precede the effect in time. Likewise, causal relationships are more plausible if one variable is manipulated experimentally and its effects on dependent variables are measured. Thus, the problem of equivalent models largely points to a limitation of relying on observational techniques and correlational measures.
Within a block of variables at the beginning of a structural model that is just identified and with unidirectional relations to subsequent variables, direct effects, correlated disturbances, and equality-constrained reciprocal effects are interchangeable. This means we can replace \(Y1 \rightarrow Y2\) with \(Y1 \leftarrow Y2\), \(Y2 \leftrightharpoons Y1\), or \(D1 \leftrightarrow D2\).
At subsequent places in the model where two endogenous variables have the same causes and their relations are unidirectional, all of the following may be substituted for one another: \(Y1 \rightarrow Y2\) with \(Y1 \leftarrow Y2\), \(Y2 \leftrightharpoons Y1\) (equality-constrained), or \(D1 \leftrightarrow D2\).
Limitations of replacing rules:
From Mulaik (2009)
From Hershberger 2006
Use an index for judging model quality that is not based on global goodness-of-fit or likelihood. For example the information complexity criterion (ICOMP) approach tries to quantify model misfit (residual covariances) relative to its complexity (correlations among parameter estimates, where greater correlation indicates complexity). See semGOF
package. Among equivalent models, a model with lower ICOMP would be preferred.
Eliminate equivalent models that have problems during estimation such as many iterations, negative variance estiamtes, or parameters that have to be constrained to avoid improper solutions. This is a heuristic approach.
Combine \(R^2\) measures for endogenous variables into a single measure (e.g., mean) and prefer the model with the highest value. This is also heuristic and relatively untested.
Examine extended individual case residuals (EICR). Although the residual covariance matrix is identical among equivalent models, the residuals in terms of predicted versus observed scores on manifest variables are not. Thus, according to Raykov and Penev (2001), a model that minimizes individual case residuals should be preferred over equivalents.
Consider this example:
m <- '
Happiness ~ 0.3*Money + 0.3*Love
Travel ~ 0.4*Happiness + 0.3*Money
'
d <- simulateData(m, sample.nobs=500)
#groundtruth
f1.syn <- '
Happiness ~ Money + Love
Travel ~ Happiness + Money
'
f1 <- sem(f1.syn, d)
semPaths_default(f1)
fitmeasures(f1)["chisq"]
## chisq
## 0.87
is not equivalent to this:
#not equivalent
f2.syn <- '
Happiness ~ Money + Love + Travel
Travel ~ Money
'
f2 <- sem(f2.syn, d)
semPaths_default(f2)
fitmeasures(f2)["chisq"]
## chisq
## 4.89
Why not? Love is not a cause of Travel in either model, so the causes of the endogenous variables are not the same (does not satisfy rule 2).
This, on the other hand, would be equivalent according to rule 1. The relations among exogenous variables are just identified and have unidirectional effects on other variables. Thus, we can shift the \(Money \leftrightarrow Love\) relationship from an unanalyzed association to a directed effect.
#equivalent
f3.syn <- '
Love ~ Money
Happiness ~ Money + Love
Travel ~ Happiness + Money
'
f3 <- sem(f3.syn, d, conditional.x=FALSE)
semPaths_default(f3)
fitmeasures(f3)["chisq"]
## chisq
## 0.87
#equivalent
f4.syn <- '
Happiness ~ Love
Travel ~ Happiness + Money
Money ~ Happiness
'
f4 <- sem(f4.syn, d, conditional.x=FALSE)
semPaths_default(f4)
fitmeasures(f4)["chisq"]
## chisq
## 3.53
The closer the model is to saturated, the larger the number of equivalent models. When possible, try to test a relatively sparse structural model. More specifically, in saturated blocks, there is little information to adjudicate directionality. Thus, it is useful to include structural variables that are thought to be unrelated to others.
Models with many paths suffer from many equivalents. Can you test your key hypotheses in models with fewer variables?
Consider including ancillary variables that are not related to the original set in the model. In particular, consider adding antecedents (predictors) of variables in a source block such that they do not have any relationship to the focal (destination) block.
When we are considering two or more alternative models, we need to undersetand their relationships to each other. Models can be nested, non-nested, or equivalent.
Conceptually, model A is nested in model B if model A contains a subset of the parameters in model B. For example, if we remove a correlation between factors in one model, but estimate it in the other, then the restricted model (zero covariance between factors) is nested in the full model. Thus, it is no surprise that the correlated factors model can capture the predictions of the uncorrelated factors model because it can estimate the correlation of zero.
Non-nested models contain parameters that do not represent a part versus whole kind of relationship – that is, the parameters of one model are not a subset of the other.
And equivalent models, as we’ve seen above, have the same sample likelihood and degrees of freedom, even though they may have substantively different interpretations (e.g., changing the sign of a directed path).
Bentler and Satorra (2010) defined a useful nesting and equivalence testing (NET) procedure:
Conduct typical SEM on Model \(M1\). Save the \(df_{M1}\) and model-implied covariance \(\boldsymbol{\Sigma}(\hat{\boldsymbol{\theta}})\).
Conduct a SEM on an alternative model \(M2\) using the \(\boldsymbol{\Sigma}(\hat{\boldsymbol{\theta}})\) as data (not raw). Save the \(\chi^2_{M2}\) and \(df_{M2}\).
Compute \(d = df_{M1} - df_{M2}\) and set a small criterion, \(\epsilon\), on \(\chi^2\) for equivalence (e.g., .001)
The primary idea is that when there are more degrees of freedom in \(M1\) than \(M2\) and the model-implied covariance matrix can be reproduced by \(M2\), then \(M1\) is nested in \(M2\).
From Brown Ch. 8, Fig 8.1. This is a measure of coping styles administered to 275 college students. There are four first-order latent variables, which reflect different ways of coping with stress: Problem Solving, Cognitive Restructuring, Express Emotions, and Social Support. The authors also anticipated that there are two higher-order factors that account for correlations among the first-order factors: Problem-Focused Coping and Emotion-Focused Coping.
rawdat <- '
1.00
0.78 1.00
0.80 0.77 1.00
0.56 0.51 0.48 1.00
0.52 0.51 0.46 0.78 1.00
0.59 0.51 0.51 0.80 0.79 1.00
0.16 0.15 0.17 0.14 0.18 0.16 1.00
0.19 0.13 0.18 0.14 0.16 0.16 0.81 1.00
0.12 0.17 0.17 0.17 0.20 0.16 0.75 0.80 1.00
0.16 0.13 0.17 0.15 0.16 0.18 0.56 0.52 0.50 1.00
0.16 0.14 0.18 0.15 0.16 0.18 0.51 0.58 0.51 0.81 1.00
0.16 0.15 0.14 0.16 0.16 0.14 0.52 0.57 0.52 0.80 0.79 1.00
'
#standard deviations of variables
sds <- as.numeric(strsplit('1.40 2.10 1.30 2.30 2.40 1.90 2.00 2.90 2.30 3.10 2.20 1.20', '\\s+', perl=TRUE)[[1]])
cormat <- getCov(rawdat)
#this is the covariance matrix to be analyzed
covmat <- lavaan::cor2cov(cormat, sds)
rownames(covmat) <- colnames(covmat) <- c("P1", "P2", "P3", "C1", "C2", "C3", "E1", "E2", "E3", "S1", "S2", "S3")
#to get Vuong test to work as expected, need raw data, not just a covmat
df <- mvrnorm(n=275, mu=rep(0, 12), Sigma=covmat, empirical = TRUE)
bmodel <- '
ProbSol =~ P1 + P2 + P3
CogRest =~ C1 + C2 + C3
ExpEmo =~ E1 + E2 + E3
SocSupp =~ S1 + S2 + S3
#explicitly name factor correlations
ProbSol ~~ CogRest
ProbSol ~~ ExpEmo
ProbSol ~~ SocSupp
CogRest ~~ ExpEmo
CogRest ~~ SocSupp
ExpEmo ~~ SocSupp
'
#l1fac_m <- sem(bmodel, sample.cov=covmat, sample.nobs = 275)
l1fac_m <- sem(bmodel, df)
summary(l1fac_m, fit.measures=TRUE, standardized=TRUE)
## lavaan 0.6-3 ended normally after 65 iterations
##
## Optimization method NLMINB
## Number of free parameters 30
##
## Number of observations 275
##
## Estimator ML
## Model Fit Test Statistic 82.970
## Degrees of freedom 48
## P-value (Chi-square) 0.001
##
## Model test baseline model:
##
## Minimum Function Test Statistic 2761.514
## Degrees of freedom 66
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.987
## Tucker-Lewis Index (TLI) 0.982
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -5642.801
## Loglikelihood unrestricted model (H1) -5601.316
##
## Number of free parameters 30
## Akaike (AIC) 11345.602
## Bayesian (BIC) 11454.106
## Sample-size adjusted Bayesian (BIC) 11358.981
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.051
## 90 Percent Confidence Interval 0.032 0.070
## P-value RMSEA <= 0.05 0.426
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.017
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ProbSol =~
## P1 1.000 1.273 0.911
## P2 1.424 0.071 20.173 0.000 1.812 0.864
## P3 0.896 0.043 20.778 0.000 1.140 0.879
## CogRest =~
## C1 1.000 2.038 0.888
## C2 1.027 0.051 19.964 0.000 2.094 0.874
## C3 0.842 0.040 21.117 0.000 1.715 0.904
## ExpEmo =~
## E1 1.000 1.743 0.873
## E2 1.542 0.072 21.395 0.000 2.688 0.929
## E3 1.133 0.059 19.043 0.000 1.974 0.860
## SocSupp =~
## S1 1.000 2.787 0.901
## S2 0.706 0.033 21.679 0.000 1.969 0.896
## S3 0.381 0.018 21.230 0.000 1.062 0.886
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ProbSol ~~
## CogRest 1.718 0.209 8.217 0.000 0.662 0.662
## ExpEmo 0.453 0.149 3.042 0.002 0.204 0.204
## SocSupp 0.695 0.237 2.928 0.003 0.196 0.196
## CogRest ~~
## ExpEmo 0.712 0.238 2.991 0.003 0.201 0.201
## SocSupp 1.146 0.381 3.011 0.003 0.202 0.202
## ExpEmo ~~
## SocSupp 3.255 0.396 8.218 0.000 0.670 0.670
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .P1 0.333 0.051 6.526 0.000 0.333 0.170
## .P2 1.111 0.130 8.560 0.000 1.111 0.253
## .P3 0.384 0.048 8.043 0.000 0.384 0.228
## .C1 1.118 0.143 7.812 0.000 1.118 0.212
## .C2 1.356 0.162 8.351 0.000 1.356 0.236
## .C3 0.654 0.093 7.010 0.000 0.654 0.182
## .E1 0.946 0.112 8.425 0.000 0.946 0.237
## .E2 1.155 0.205 5.638 0.000 1.155 0.138
## .E3 1.372 0.155 8.862 0.000 1.372 0.260
## .S1 1.805 0.242 7.448 0.000 1.805 0.189
## .S2 0.947 0.124 7.655 0.000 0.947 0.196
## .S3 0.308 0.038 8.103 0.000 0.308 0.215
## ProbSol 1.620 0.169 9.558 0.000 1.000 1.000
## CogRest 4.152 0.452 9.186 0.000 1.000 1.000
## ExpEmo 3.039 0.339 8.958 0.000 1.000 1.000
## SocSupp 7.770 0.823 9.435 0.000 1.000 1.000
semPaths_default(l1fac_m, sizeMan=7)
What about a model with a higher-order factor structure: Problem-Focused versus Emotion-Focused?
hcfa <- '
ProbSol =~ P1 + P2 + P3
CogRest =~ C1 + C2 + C3
ExpEmo =~ E1 + E2 + E3
SocSupp =~ S1 + S2 + S3
PFCope =~ ProbSol + CogRest
EFCope =~ ExpEmo + SocSupp
'
#hcfa_m <- sem(hcfa, sample.cov=covmat, sample.nobs = 275)
hcfa_m <- sem(hcfa, df)
summary(hcfa_m, fit.measures=TRUE, standardized=TRUE)
## lavaan 0.6-3 ended normally after 63 iterations
##
## Optimization method NLMINB
## Number of free parameters 29
##
## Number of observations 275
##
## Estimator ML
## Model Fit Test Statistic 83.005
## Degrees of freedom 49
## P-value (Chi-square) 0.002
##
## Model test baseline model:
##
## Minimum Function Test Statistic 2761.514
## Degrees of freedom 66
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.987
## Tucker-Lewis Index (TLI) 0.983
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -5642.819
## Loglikelihood unrestricted model (H1) -5601.316
##
## Number of free parameters 29
## Akaike (AIC) 11343.637
## Bayesian (BIC) 11448.524
## Sample-size adjusted Bayesian (BIC) 11356.570
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.050
## 90 Percent Confidence Interval 0.031 0.068
## P-value RMSEA <= 0.05 0.469
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.017
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ProbSol =~
## P1 1.000 1.273 0.911
## P2 1.424 0.071 20.174 0.000 1.812 0.864
## P3 0.896 0.043 20.777 0.000 1.140 0.879
## CogRest =~
## C1 1.000 2.038 0.888
## C2 1.028 0.051 19.966 0.000 2.094 0.874
## C3 0.842 0.040 21.115 0.000 1.715 0.904
## ExpEmo =~
## E1 1.000 1.743 0.873
## E2 1.542 0.072 21.393 0.000 2.687 0.928
## E3 1.133 0.059 19.047 0.000 1.975 0.860
## SocSupp =~
## S1 1.000 2.787 0.901
## S2 0.706 0.033 21.678 0.000 1.969 0.896
## S3 0.381 0.018 21.230 0.000 1.062 0.886
## PFCope =~
## ProbSol 1.000 0.812 0.812
## CogRest 1.609 0.422 3.813 0.000 0.816 0.816
## EFCope =~
## ExpEmo 1.000 0.826 0.826
## SocSupp 1.572 0.407 3.865 0.000 0.812 0.812
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## PFCope ~~
## EFCope 0.448 0.146 3.062 0.002 0.301 0.301
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .P1 0.333 0.051 6.525 0.000 0.333 0.170
## .P2 1.111 0.130 8.560 0.000 1.111 0.253
## .P3 0.384 0.048 8.043 0.000 0.384 0.228
## .C1 1.119 0.143 7.812 0.000 1.119 0.212
## .C2 1.355 0.162 8.350 0.000 1.355 0.236
## .C3 0.655 0.093 7.012 0.000 0.655 0.182
## .E1 0.946 0.112 8.423 0.000 0.946 0.237
## .E2 1.157 0.205 5.644 0.000 1.157 0.138
## .E3 1.372 0.155 8.859 0.000 1.372 0.260
## .S1 1.806 0.242 7.448 0.000 1.806 0.189
## .S2 0.947 0.124 7.655 0.000 0.947 0.196
## .S3 0.308 0.038 8.102 0.000 0.308 0.215
## ProbSol 0.552 0.279 1.982 0.047 0.341 0.341
## CogRest 1.388 0.721 1.925 0.054 0.334 0.334
## ExpEmo 0.968 0.530 1.826 0.068 0.319 0.319
## SocSupp 2.652 1.314 2.018 0.044 0.341 0.341
## PFCope 1.068 0.308 3.467 0.001 1.000 1.000
## EFCope 2.071 0.593 3.491 0.000 1.000 1.000
semPaths_default(hcfa_m, sizeMan=7)
We can test whether models are nested, equivalent, or not nested using Bentler’s nesting and equivalent testing procedure (NET). As mentioned above, this involves fitting model 1 using a candidate SEM, saving the model-implied covariance matrix, then fitting model 2 using that matrix as the input. We examine the difference in degrees of freedom. If \(df_1 - df_2 > 0\) and the \(\chi^2\) of model 2 is essentially zero (usually < .001) then we say model 1 is nested in model 2. Note that this is an asymmetric test. Model 1 may not be nested in model 2, but the converse (2 nested in 1) could be.
Following Bentler steps:
df1 <- fitmeasures(l1fac_m)["df"] #degrees of freedom for saturated factor model (no higher-order structure)
mimplied_l1fac <- fitted(l1fac_m)$cov #model implied cov
#now use the model-implied covariance as the data to analyze for hcfa_m
hcfa_m_fromcov <- sem(hcfa, sample.cov=mimplied_l1fac, sample.nobs = 275)
df2 <- fitmeasures(hcfa_m_fromcov)["df"]
d <- df1 - df2
cat("df for NET: ", d, "\n")
## df for NET: -1
chi2 <- fitmeasures(hcfa_m_fromcov)["chisq"]
cat("Chi-square for NET test: ", chi2, "\n")
## Chi-square for NET test: 0.0349
We see that the degrees of freedom difference is negative. Thus, the saturated model is not nested in the hierarchical model. This makes sense intuitively in that the hierarchical model imposes some structure on the factor model, whereas the saturated model estimates unique covariances among factors.
We can reverse the NET test (in terms of what is \(M1\) versus \(M2\)). To test whether the hierarchical CFA is nested in the saturated model, we first fit the saturated CFA, then test the hierarchical model using the model-implied covariance matrix from the saturated model.
#is the H CFA nested in the saturated (first-order)
df1 <- fitmeasures(hcfa_m)["df"]
mimplied_hcfa <- fitted(hcfa_m)$cov
#now use the model-implied covariance as the data to analyze for hcfa_m
hcfa_m_fromcov <- sem(bmodel, sample.cov=mimplied_hcfa, sample.nobs = 275)
df2 <- fitmeasures(hcfa_m_fromcov)["df"]
d <- df1 - df2
cat("df for NET: ", d, "\n")
## df for NET: 1
chi2 <- fitmeasures(hcfa_m_fromcov)["chisq"]
cat("Chi-square for NET test: ", chi2, "\n")
## Chi-square for NET test: 0
Here, we see the expected result for a nested model: the df difference is positive (1) and the \(\chi^2\) of the second model (using the model-implied covariance of the first) is 0. Thus, the hierarchical CFA model is nested in the saturated first-order factor model.
The semTools
package provides a net()
function that will compute nesting/equivalence tests for all pairs of models passed in (as separate arguments). This package also provides the moreFitIndices
function, which provides the \(AIC_C\) as aic.smallN
(see below for its role in model evidence).
#the easy way
semTools::net(l1fac_m, hcfa_m)
##
## If cell [R, C] is TRUE, the model in row R is nested within column C.
##
## If the models also have the same degrees of freedom, they are equivalent.
##
## NA indicates the model in column C did not converge when fit to the
## implied means and covariance matrix from the model in row R.
##
## The hidden diagonal is TRUE because any model is equivalent to itself.
## The upper triangle is hidden because for models with the same degrees
## of freedom, cell [C, R] == cell [R, C]. For all models with different
## degrees of freedom, the upper diagonal is all FALSE because models with
## fewer degrees of freedom (i.e., more parameters) cannot be nested
## within models with more degrees of freedom (i.e., fewer parameters).
##
## l1fac_m hcfa_m
## l1fac_m (df = 48)
## hcfa_m (df = 49) TRUE
semTools::moreFitIndices(l1fac_m)
## Warning in semTools::moreFitIndices(l1fac_m): AICc (aic.smallN) was
## developed for univariate linear models. It is probably not appropriate to
## use AICc to compare SEMs.
## gammaHat adjGammaHat baseline.rmsea aic.smallN bic.priorN
## 9.79e-01 9.66e-01 3.85e-01 1.14e+04 1.15e+04
## hqc sic
## 1.14e+04 1.14e+04
“All models are wrong but some are useful” - George Box
Having established whether two models are nested, not nested, or equivalent, we can move onto the important consideration of model evidence. In the case of equivalent models, relative evidence cannot be based on global fit given that both models will have the same likelihood and same number of parameters (see above for potential solutions to this problem). But in the case of non-nested and nested models, we would like to be able to make a statement such as, “Model A fits the data significantly better than Model B, p < .01,” or, “The evidence for Model A is ten times stronger than the evidence for Model B.”
The first statement concerns a binary judgment about model fit – that is, we are interested in testing the null hypothesis that two models fit the data equally well. We have seen this approach in previous weeks when using the likelihood ratio test (LRT) – aka the model chi-square difference test. In the LRT approach, we can compare two nested models (using the anova
function in lavaan
) to test the null hypothesis that one model (with fewer parameters – the nested model) fit the data equally well. If we reject this null (typically \(p < .05\)), then we conclude that the nested model fits significantly worse than the full model.
In our example above, we learned from NET (and could figure out intuitively) that the models are nested. Does the hierarchical factor model (nested) fit significantly worse than the first-order factor model (full)?
anova(l1fac_m, hcfa_m)
Df | AIC | BIC | Chisq | Chisq diff | Df diff | Pr(>Chisq) | |
---|---|---|---|---|---|---|---|
l1fac_m | 48 | 11346 | 11454 | 83 | NA | NA | NA |
hcfa_m | 49 | 11344 | 11449 | 83 | 0.035 | 1 | 0.852 |
No, we do not reject the null, meaning that we conclude that the nested model (hierarchical factors) does not fit significantly worse than the full model (saturated associations among first-order factors), \(p = .85\). Thus, even though the hierarchical factor model has one less free parameter (and thus, one additional degree of freedom), it does not fit significantly worse. Typically, in this case, we would retain the simpler model given that its fit is not worse.
If we are interested in binary comparisons of non-nested models, we can usually use a Vuong test, which will also provide a test of the null hypothesis that one model as well as an alternative. See this paper from Ed Merkle: https://arxiv.org/pdf/1402.6720.pdf. And the corresponding nonnest2
package in R, which plays well with lavaan
. Although we know that our models are nested above, we can still apply the Vuong test (using nested=TRUE
), which yields this output. Note that if we were comparing non-nested models, we would keep nested=FALSE
!
nonnest2::vuongtest(l1fac_m, hcfa_m, nested=TRUE)
##
## Model 1
## Class: lavaan
## Call: lavaan::lavaan(model = bmodel, data = df, model.type = "sem", ...
##
## Model 2
## Class: lavaan
## Call: lavaan::lavaan(model = hcfa, data = df, model.type = "sem", int.ov.free = TRUE, ...
##
## Variance test
## H0: Model 1 and Model 2 are indistinguishable
## H1: Model 1 and Model 2 are distinguishable
## w2 = 0.000, p = 0.847
##
## Robust likelihood ratio test of distinguishable models
## H0: Model 2 fits as well as Model 1
## H1: Model 1 fits better than Model 2
## LR = 0.035, p = 0.862
Here, the binary judgments are the same as the LRT above – the hierarchical factor model does not fit significantly worse than the first-order model.
These, however, is a binary judgment. What about the evidence in favor of the hierarchical model versus the first-order model? This is a dimensional question and could be framed in terms of the probability that one model is the best in the set. For example, is the evidence in factor of the hierarchical factor model three times greater than the first-order model?
To make judgments about relative evidence, we need a measure of how well one model fits compared to another (or many others).
The goal of most statistical modeling is to approximate the key factors that generated our empirical data. The data generation processes are usually complex and largely unknown, so our models will almost always be deficient in many ways. Nevertheless, we may wish to know how close one model is to the unknown truth relative to a comparator model. This is the substance of model selection theory – i.e., how one selects among models based on relative evidence.
Kullback-Leibler information quantifies the information lost when lost when a simple model, g is used to approximate the true model, f:
\[ I(f,g) = \int{f(x) \textrm{log} \left( \frac{f(x)}{g(x|\theta)} \right) dx} \] This can also be thought of as the distance between our model, g and the population (unknown) model, f.
In the simplest case, we can think of trying to identify a probability distribution (e.g., Gaussian) that best describes a given empirical sample. Recall from the M-L lecture our attempt to fit a Gaussian distribution using a Gamma distribution. The sample likelihood for the Gamma, even at optimized (fitted) parameters, can never be as close as a Gaussian fit.
ML Gamma fit
X <- rnorm(5000, mean=50, sd=sqrt(50))
gfit <- suppressWarnings(fitdistr(X, "gamma"))
gfit$loglik
## [1] -16981
datafromgam <- rgamma(n=5000, shape=gfit$estimate["shape"], rate=gfit$estimate["rate"])
ML Gaussian fit
gfit <- suppressWarnings(fitdistr(X, "normal"))
gfit$loglik
## [1] -16959
datafromgauss <- rnorm(n=5000, mean=gfit$estimate["mean"], sd=gfit$estimate["sd"])
df <- data.frame(model=rep(c("groundtruth", "gamma", "gauss"), each=5000),
v=c(X, datafromgam, datafromgauss))
ggplot(df, aes(x=v, color=model)) + geom_density()
The sample LL is an approximation of a K-L distance between a model and ‘truth’, but read on to AIC below…
In general, we do not know the true model, which means we cannot compute the exact K-L distance. Fortunately, however, we are often interested in relative distance – that is, how much better is model A than model B? In this circumstance, the K-L distance derivation includes two terms: one is based on the expectation under the ‘true’ model (can’t compute that!), the other is based on the relative distance between the model and the truth.
Since we can’t compute the first term, it means we do not have a measure of absolute fit using likelihood-based and K-L methods. But, because we can compute the relative distance, we can also compute the relative distance between two or more models. As a result, we can compare relative evidence across a set of candidate models. All may be bad, or all may be good, by some absolute criterion, but we can ascertain which is relatively best. Global fit criteria in SEM often give some information about the absolute fit (esp. residual-based methods). Likelihood methods (K-L distance) are more about relative evidence.
K-L distances are on a true ratio scale (i.e., a meaningful zero point and equal interval spacing). As a result, differences in distances can be compared, and a zero K-L difference implies identical relative fit. Moreover, K-L distances tend to become larger as the sample size increases. By analogy, AIC differences among models tend to be larger in large samples compared to small ones. This does not mean that the scaling or interpretation changes! Rather, it underscores that discriminating among models becomes easier in large samples.
The derivation of the AIC was a major breakthrough in information theory and statistics because it provides a statistical link between K-L distance (a more theoretical notion) and maximum likelihood-based methods.
The AIC provides an approximation of the relative distance of the fitted model and the unknown (true) mechanisms that generated the data. Importantly, it is a large sample criterion (based on asymptotic properties, so 1000s, ideally) and is also not especially meaningful if the models in question are simply inappropriate or poor (e.g., fitting a binary outcome with a continuous model).
\[ AIC = -2 \textrm{log} (\mathcal{L}(\hat{\theta}|y)) + 2k \] where \(k\) is the number of free parameters, and \(\mathcal{L}(\hat{\theta}|y)\) is the sample likelihood.
Note that by multiplying the LL by -2, the criterion is now scaled such that lower values are better. By contrast, for LL, higher values indicate better fit (greater likelihood of the data given the model).
Because AIC is based on a large sample derivation (i.e., its statistical properties are ideal as the sample size goes to infinity), it may select overly complex models in smaller samples (think hundreds, not thousands). This led to the derivation of a variant of AIC, usually denoted \(AIC_C\) (Sugiura, 1978), that includes an additional penalty for sample size:
\[ AIC_C = -2 \textrm{log} (\mathcal{L}(\hat{\theta}|y)) + 2k + \frac{2k(k+1)}{n - k -1} \]
When your cases:parameters ratio is below 40 (which it usually is!), Burnham and Anderson recommend this over AIC. But this was derived for univariate (one DV) regression, not SEM. There is not an equivalent in SEM (that I know of), but also not a specific reason that the above derivation is not approximately correct. Nevertheless, there is (to my knowledge) no consensus on whether \(AIC_C\) should be preferred to the standard \(AIC\) in SEM.
One of the most important considerations when comparing models is that the set of endogenous variables is the same and on the same scale! This is often misunderstood in applied papers and can lead to bizarre circumstances where AICs differ by thousands of points. That’s not impossible, but more likely to reflect differences in the likelihood function that render models non-comparable using LL-based methods (incl. AIC/BIC). For example, if you log-transform some of the variables, their contribution to the sample likelihood will be remarkably different, rendering models with and without that transformation incomparable (at least in likelihood terms).
Recall that the sample likelihood is computed on the endogenous variables. Thus, if the endogenous set changes (e.g., changing the direction of an arrow), LL methods cannot necessarily be compared. In lavaan
, this relates to the conditional.x
option. See ?lavOptions
. If it is FALSE
, then model-implied statistics, including LL, reflect both exogenous and endogenous variables. This is likely what you’d want in cases where you’re testing alternative models with paths direction changes that would make an endogenous variable exogenous, or vice versa.
If you’re not sure, check the unrestricted (saturated) LL for the two candidate models. If they are not identical, you cannot use AIC/BIC to compare them. You might also check your setting of conditional.x
in lavaan
. Fortunately, lavaan
will usually provide a warning: “WARNING: some models are based on a different set of observed variables.” If you see that, watch out! Your models are almost definitely not comparable, either in terms of LRT (nested) or AIC/BIC (non-nested) comparisons since both are based on the sample likelihood.
Note that you will come across many “IC” (Information Criterion) measures in SEM and statistics, including the Bayesian information criterion (BIC). A discussion of these measures is beyond the scope of the class, but it’s worth mentioning that they have different statistical properties as relates to model selection (i.e., choosing which model fits best). Virtually all of the IC-based approaches build on the sample log-likelihood and are concerned with the fit-parsimony tradeoff. That is, we are interested in identifying a model that is as parsimonious as possible, while also fitting the data well.
We can always improve fit – that is, sample likelihood – by adding additional free parameters. Even if the parameters are not useful (e.g., adding a predictor whose p-value is .5), they can only help in fit, even if trivially. This is akin to the motivation for adjusted \(R^2\) in regression, where a predictor can only contribute positively to predicted variance in the outcome. Thus, we do not want to prioritize the sample likelihood alone. Instead, we would like to prefer models that have the fewest free parameters possible. But if we delete a free parameter that has a huge effect in the dataset (e.g., a big regression path), the fit will worsen substantially. This is the dilemma of fit versus parsimony, and the motivation for measures such as AIC and BIC that try to balance these considerations.
(Some quotes from Hallquist & Pilkonis, 2010)
AIC is an asymptotically efficient criterion for model selection, which means that it tends to select the model that minimizes prediction error as sample size increases. More generally, AIC is an information criterion primarily concerned with minimizing the relative Kullback-Leibler (K-L) distance between a statistical model and the unknown mechanisms giving rise to the observed data (Claeskens & Hjort, 2008).
An underlying assumption of AIC is that the “truth” of the data is very complex (i.e., many causal factors of varying strengths) and that no one “true model” exists (Burnham & Anderson, 2002). Thus, AIC is useful for selecting the best approximate model using a finite number of parameters when “truth” is countably infinite.
By contrast, the BIC originates from the Bayesian tradition in statistics and is concerned with the statistical property of consistency, which refers to the one “true model” being selected with increasing probability as sample size increases (Yang, 2005). Underlying the property of consistency are several assumptions worth examining: (1) a single “true” model exists that best represents the mechanisms that gave rise to the observed data, (2) one of the models in the set of models tested is the “true” model, and (3) the focus of model selection is to identify the “true” model rather than to provide a close approximation to the truth. Given these assumptions, Burnham and Anderson (2002) have argued against the use of consistent model selection criteria in the biological, medical, and social sciences because such data are highly complex, there is a low likelihood of including the “true model” in any set of models considered, and it is unlikely that a single “true model” exists.
Summarizing extensive simulation work, Burnham and Anderson (2002) offer useful rules-of-thumb for inferring the degree of certainty afforded to a particular model relative to the best fitting model in the set:
Moreover, AIC values for a set of models can be subtracted from the model with the lowest AIC value and these difference scores can be transformed into a set of Akaike weights that represent the evidential weight of a particular model relative to all other models in the set (Burnham & Anderson, 2002). The important general point is that conclusions drawn from latent variable models would benefit from being contextualized in terms of model selection uncertainty and the degree of support for one model over alternatives.
For details on Akaike weights or evidence ratios, I would recommend checking out the AICcmodavg
package and Burnham and Anderson’s 2002 book. Briefly, the Akaike weight is the probability that model \(i\) is the Kullback-Leibler best model in the set of models \(I\) that we’ve tested. A K-L best model just means that the relative distance from the unknown truth is shortest, and thus its relative evidence is best, but the weight gives a sense of uncertainty in this judgment.
aa <- aictab(list(first_order=l1fac_m, higher_order=hcfa_m))
print(aa)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## higher_order 29 11351 0.00 0.78 0.78 -5643
## first_order 30 11353 2.49 0.22 1.00 -5643
Returning to our first-order versus hierarchical factor model example, we see that the difference in AIC between these is 2.49, which would fall into the weak range of Burnham and Anderson’s rules. The AIC weights sum to 1.0 and reveal here that the probability that the hierarchical model is the best in the set of tested models (here, two). A 78% probability isn’t terrible, but it’s not definitive either – we should have some uncertainty about whether the hierarchical model is best. The fact that the hierarchical model is also more parsimonious may tilt our conclusions further in that direction.
Furthermore, note that AIC weights do not consider the important possibility that there is another untested model that would capture the data far better. Thus, we must always remember that these measures provide relative evidnece of one model versus a set of tested alternatives, but not alternatives we failed to consider!
Finally, we can take the ratios of AIC weights as an evidence ratio, which quantifies the ratio in favor of one model (with lower AIC) compared to another (higher AIC).
evidence(aa)
##
## Evidence ratio between models 'higher_order' and 'first_order':
## 3.47
Here, the ratio of evidence in favor of the higher-order factor models is 3.47x that of the first-order factor model. Many statisticians would prefer that such ratios (closely related to Bayes factors) are 10 or higher, minimally, to make strong claims that one model is better than another. But here, we could contextualize the evidence of the hierarhical model as “moderately” higher than the alternative, first-order model.