library(haven)
data <- read_dta("parcial.dta", encoding = "UTF-8")
library(Hmisc)
library(knitr)
kable(label(data), caption = "Items y sus significado", col.names = c("Significado"),
table.attr = "style='width:150%;'")
| Significado | |
|---|---|
| facsex | Sexo de los docentes |
| facethn | Raza de los docentes |
| facnat | Nacionalidad de los docentes |
| facrank | Rango académico |
| salary | Salario del docente |
| yrsteach | Años de enseñanza |
| yrsut | Años en la universidad |
| degree | Maximo grado academico alcanzado |
| nstud | Número de estudiantes en la clase |
| studrank | Nivel académico de los estudiantes |
A continuación se presenta la distribución de los items y el conteo de datos faltantes en cada uno:
summary(data)
facsex facethn facnat facrank
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
Median :1.000 Median :1.000 Median :1.000 Median :2.000
Mean :1.412 Mean :1.594 Mean :1.171 Mean :2.319
3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:3.000
Max. :2.000 Max. :5.000 Max. :2.000 Max. :4.000
salary yrsteach yrsut degree
Min. :16.00 Min. : 0.00 Min. : 0.000 Min. :1.00
1st Qu.:25.00 1st Qu.: 5.00 1st Qu.: 1.000 1st Qu.:3.00
Median :32.00 Median : 8.00 Median : 3.000 Median :3.00
Mean :30.93 Mean : 9.82 Mean : 5.585 Mean :3.12
3rd Qu.:34.00 3rd Qu.:11.75 3rd Qu.: 9.000 3rd Qu.:3.00
Max. :60.00 Max. :27.00 Max. :23.000 Max. :4.00
NA's :91 NA's :15
nstud studrank
Min. : 3.00 Min. :1.000
1st Qu.: 19.00 1st Qu.:2.000
Median : 43.00 Median :3.000
Mean : 76.46 Mean :2.921
3rd Qu.:103.00 3rd Qu.:4.000
Max. :256.00 Max. :5.000
NA's :1
boxplot(data[-1], las = 2, col = "skyblue")
title("boxplot de cada Item")
En este caso contamos con datos mixtos. Hoy en día existen técnicas avanzadas para estos conocidas comom Factor Analysis for Mixed Data (FAMD), véase Francois Husson et al. Sin embargo en esta oportunidad optaremos por escalar los datos (centrarlos y dividirlos entre su desviación estandar). De esta manera se manejan el hecho de que las técnicas de análisis darían mayor peso a las variables con mayor variabilidad.
datos <- scale(data, center = T, scale = T)
summary(datos)
facsex facethn facnat facrank
Min. :-0.8355 Min. :-0.519 Min. :-0.4533 Min. :-1.3796
1st Qu.:-0.8355 1st Qu.:-0.519 1st Qu.:-0.4533 1st Qu.:-0.3339
Median :-0.8355 Median :-0.519 Median :-0.4533 Median :-0.3339
Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 1.1936 3rd Qu.:-0.519 3rd Qu.:-0.4533 3rd Qu.: 0.7117
Max. : 1.1936 Max. : 2.977 Max. : 2.1997 Max. : 1.7574
salary yrsteach yrsut degree
Min. :-2.1097 Min. :-1.4627 Min. :-0.9085 Min. :-3.927
1st Qu.:-0.8377 1st Qu.:-0.7179 1st Qu.:-0.7459 1st Qu.:-0.223
Median : 0.1516 Median :-0.2710 Median :-0.4205 Median :-0.223
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000
3rd Qu.: 0.4343 3rd Qu.: 0.2875 3rd Qu.: 0.5556 3rd Qu.:-0.223
Max. : 4.1090 Max. : 2.5591 Max. : 2.8331 Max. : 1.629
NA's :91 NA's :15
nstud studrank
Min. :-0.9263 Min. :-1.52461
1st Qu.:-0.7246 1st Qu.:-0.73110
Median :-0.4219 Median : 0.06241
Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.3346 3rd Qu.: 0.85592
Max. : 2.2637 Max. : 1.64943
NA's :1
boxplot(datos, las = 2, col = "skyblue")
title("boxplot de las variables escaladas")
Se pide generar un factor que haga una combinación lineal de las variables.
Para esto decidimos utiliza el Análisis de Componentes Principales (PCA) que justamente crea factores como combinación lineal de las variables originales. Como se pide un factor extraemos directamente un factor. Con repecto a los datos faltantes por defecto imputamos la mediana.
# Pricipal Components Analysis entering raw data and extracting PCs from the
# correlation matrix
fit <- princomp(datos[complete.cases(datos), ], cor = TRUE)
summary(fit) # print variance accounted for
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.6990407 1.3327605 1.1730617 1.0430382 0.96240789
Proportion of Variance 0.2886739 0.1776251 0.1376074 0.1087929 0.09262289
Cumulative Proportion 0.2886739 0.4662990 0.6039063 0.7126992 0.80532211
Comp.6 Comp.7 Comp.8 Comp.9
Standard deviation 0.78195557 0.74445206 0.59761518 0.54289101
Proportion of Variance 0.06114545 0.05542089 0.03571439 0.02947307
Cumulative Proportion 0.86646756 0.92188845 0.95760284 0.98707591
Comp.10
Standard deviation 0.35950097
Proportion of Variance 0.01292409
Cumulative Proportion 1.00000000
loadings(fit) # pc loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
facsex 0.555 -0.188 0.240 0.203 0.672 -0.269
facethn -0.193 -0.213 0.575 0.719 0.233
facnat 0.628 -0.288 0.454 -0.474 0.144 -0.237 0.106
facrank 0.456 -0.242 0.272 0.272 0.342 0.637
salary 0.436 -0.232 0.129 0.225 -0.220 -0.228 0.355 -0.656
yrsteach 0.513 -0.227 0.106 0.266 -0.225 -0.238 -0.212
yrsut 0.517 0.127 -0.163 0.297 -0.318 0.228
degree -0.264 0.680 0.566 -0.299 -0.202
nstud -0.163 -0.569 -0.249 -0.102 -0.263 0.219 -0.654 0.153
studrank 0.341 0.396 0.423 -0.538 -0.156 -0.287 -0.347 0.157
Comp.10
facsex 0.135
facethn
facnat
facrank 0.230
salary -0.174
yrsteach 0.667
yrsut -0.661
degree
nstud
studrank
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
SS loadings 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Cumulative Var 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Comp.9 Comp.10
SS loadings 1.0 1.0
Proportion Var 0.1 0.1
Cumulative Var 0.9 1.0
plot(fit, type = "lines") # scree plot
abline(h = 1, lty = 2)
biplot(fit)
eig <- eigen(cor(datos[complete.cases(datos), ]))
eig$values
[1] 2.8867392 1.7762505 1.3760736 1.0879288 0.9262289 0.6114545 0.5542089
[8] 0.3571439 0.2947307 0.1292409
Vemos que la proporción de variación explicada por la primera componente principal es apenas 28%. Con lo cual no es aconcejable utilizar esta solución.
Se cre que existen tres factores:
Para evaluar esta hipótesis realizamos una análisis factorial confirmatorio.
Diseño del modelo
library(lavaan)
model <- "
# latent variable definitions
academico =~ yrsut + degree + facrank + salary + yrsteach
demografico =~ facsex + facethn
estudiantes =~ nstud + studrank
# covariance between factor_1 and factor_2
# residual covariances
"
models <- list()
models$m1 <- cfa(model, data = datos)
Warning in lavaan::lavaan(model = model, data = datos, model.type =
"cfa", : lavaan WARNING: model has NOT converged!
# summary(models$m1, fit.measures = TRUE, modindices = TRUE)
# fitMeasures(models$m1) standardizedsolution(models$m1)
# Plot path diagram:
library(semPlot)
semPaths(models$m1, curvePivot = TRUE)
Warning in qgraph(Edgelist, labels = nLab, bidirectional = Bidir, directed
= Directed, : The following arguments are not documented and likely not
arguments of qgraph and thus ignored: loopRotation; residuals; residScale;
residEdge; CircleEdgeEnd
Para esto realizamos una análisis factorial exploratorio. Utilizamos el método de ejes principales. Se realiza un AFE tipo-R sobre la matriz de correlación de Pearson de las 10 variables escaladas. En cuanto a datos faltantos la correlación es estimada borrando a pares los datos faltantes y no todo el renglón.
Se asumen muestras balanceadas
Criterio de adecuación de muestreo MSA y KMO
library(REdaS)
x <- KMOS(datos, use = "pair")
print(x, stats = "both", sort = T)
Kaiser-Meyer-Olkin Statistics
Call: KMOS(x = datos, use = "pair")
Measures of Sampling Adequacy (MSA):
facethn studrank facnat facsex nstud yrsteach degree
0.2985855 0.4195465 0.4551807 0.4864868 0.4978905 0.5292224 0.6323156
yrsut facrank salary
0.6390588 0.6829275 0.6914431
KMO-Criterion: 0.575367
# x <- KMOS(data[,-c(1,10)]) print(x, stats = 'both', sort=T) x <-
# KMOS(data[,-c(1,2,10)]) print(x, stats = 'both', sort=T)
Vemos que algunos items tiene un indice MSA considerado inaceptable (<.5)
x <- KMOS(datos[, -c(2)], use = "pair")
print(x, stats = "both", sort = T)
Kaiser-Meyer-Olkin Statistics
Call: KMOS(x = datos[, -c(2)], use = "pair")
Measures of Sampling Adequacy (MSA):
studrank facnat facsex yrsteach nstud degree yrsut
0.4102247 0.5370122 0.5748505 0.5759289 0.5838254 0.6153991 0.6471897
facrank salary
0.6736650 0.6937381
KMO-Criterion: 0.6207431
x <- KMOS(datos[, -c(2, 10)], use = "pair")
print(x, stats = "both", sort = T)
Kaiser-Meyer-Olkin Statistics
Call: KMOS(x = datos[, -c(2, 10)], use = "pair")
Measures of Sampling Adequacy (MSA):
facsex facnat yrsteach degree nstud yrsut facrank
0.5257203 0.5342228 0.5829899 0.6196436 0.6399428 0.6453008 0.6854796
salary
0.7336219
KMO-Criterion: 0.6373098
Luego de eliminar las variables facethn y studrank los indices MSA con almenos acpetables aunque ninguno llega a ser meritorio. El KMO es regular. Realizaremos el AFE con estas 8 variables.
# Determine Number of Factors to Extract
library(nFactors)
mcor <- cor(datos[, -c(2, 10)], method = "pearson", use = "pair") # Cor Variables ordinale
ev <- eigen(mcor) # get eigenvalues
ap <- parallel(subject = nrow(datos[, -c(2, 10)]), var = ncol(datos[, -c(2,
10)]), rep = 100, cent = 0.05, model = "factors")
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea, model = "factors")
plotnScree(nS)
abline(h = 1, col = "black", lty = 1)
print("Autovalores:")
[1] "Autovalores:"
ev$values
[1] 3.15405050 1.55008585 1.21240263 0.74478238 0.51816690 0.48861058
[7] 0.23217308 0.09972808
Utilizaremos entonces tres factores.
fit <- fa(datos[, -c(2, 10)], nfactors = 3, fm = "pa", rotate = "none")
print(fit, digits = 2, cutoff = 0.4, sort = TRUE)
Factor Analysis using method = pa
Call: fa(r = datos[, -c(2, 10)], nfactors = 3, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
item PA1 PA2 PA3 h2 u2 com
yrsut 6 0.89 -0.07 0.19 0.83 0.17 1.1
yrsteach 5 0.81 0.10 0.53 0.95 0.05 1.7
facrank 3 0.80 0.18 -0.30 0.76 0.24 1.4
salary 4 0.70 0.23 -0.21 0.59 0.41 1.4
facsex 1 -0.14 -0.56 0.17 0.36 0.64 1.3
nstud 8 -0.30 0.55 -0.24 0.45 0.55 1.9
degree 7 -0.30 0.30 0.21 0.22 0.78 2.8
facnat 2 -0.36 0.42 0.53 0.59 0.41 2.7
PA1 PA2 PA3
SS loadings 2.90 0.99 0.86
Proportion Var 0.36 0.12 0.11
Cumulative Var 0.36 0.49 0.59
Proportion Explained 0.61 0.21 0.18
Cumulative Proportion 0.61 0.82 1.00
Mean item complexity = 1.8
Test of the hypothesis that 3 factors are sufficient.
The degrees of freedom for the null model are 28 and the objective function was 3.65 0.4 with Chi Square of 1288.14
The degrees of freedom for the model are 7 and the objective function was 0.26
0.4
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.08
0.4
The harmonic number of observations is 328 with the empirical chi square 26.09 with prob < 0.00048
0.4The total number of observations was 357 with Likelihood Chi Square = 91.09 with prob < 7.4e-17
0.4
Tucker Lewis Index of factoring reliability = 0.732
RMSEA index = 0.185 and the 90 % confidence intervals are 0.151 0.218 0.4
BIC = 49.95
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
PA1 PA2 PA3
Correlation of scores with factors 0.97 0.82 0.90
Multiple R square of scores with factors 0.94 0.67 0.81
Minimum correlation of possible factor scores 0.87 0.33 0.63
Aunque hay algunas comunalidades pequeñas hay varias cargas significativas. Asumiendo que los tres factores están correlacionados. Haremos una rotación oblicua cluster.
fit <- fa(datos[, -c(2, 10)], nfactors = 3, fm = "pa", rotate = "cluster")
print(fit, digits = 2, cutoff = 0.4, sort = TRUE)
Factor Analysis using method = pa
Call: fa(r = datos[, -c(2, 10)], nfactors = 3, rotate = "cluster",
fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
item PA1 PA3 PA2 h2 u2 com
yrsteach 5 0.91 0.31 -0.15 0.95 0.05 1.3
yrsut 6 0.85 -0.08 -0.18 0.83 0.17 1.1
facrank 3 0.76 -0.29 0.26 0.76 0.24 1.5
salary 4 0.70 -0.16 0.28 0.59 0.41 1.4
facnat 2 -0.07 0.78 0.22 0.59 0.41 1.2
degree 7 -0.13 0.43 0.23 0.22 0.78 1.7
nstud 8 -0.12 0.25 0.65 0.45 0.55 1.4
facsex 1 -0.32 -0.22 -0.62 0.36 0.64 1.8
PA1 PA3 PA2
SS loadings 2.75 1.03 0.97
Proportion Var 0.34 0.13 0.12
Cumulative Var 0.34 0.47 0.59
Proportion Explained 0.58 0.22 0.20
Cumulative Proportion 0.58 0.80 1.00
With factor correlations of
PA1 PA3 PA2
PA1 1.00 -0.14 -0.18
PA3 -0.14 1.00 -0.28
PA2 -0.18 -0.28 1.00
Mean item complexity = 1.4
Test of the hypothesis that 3 factors are sufficient.
The degrees of freedom for the null model are 28 and the objective function was 3.65 0.4 with Chi Square of 1288.14
The degrees of freedom for the model are 7 and the objective function was 0.26
0.4
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.08
0.4
The harmonic number of observations is 328 with the empirical chi square 26.09 with prob < 0.00048
0.4The total number of observations was 357 with Likelihood Chi Square = 91.09 with prob < 7.4e-17
0.4
Tucker Lewis Index of factoring reliability = 0.732
RMSEA index = 0.185 and the 90 % confidence intervals are 0.151 0.218 0.4
BIC = 49.95
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
PA1 PA3 PA2
Correlation of scores with factors 0.98 0.89 0.85
Multiple R square of scores with factors 0.96 0.79 0.72
Minimum correlation of possible factor scores 0.91 0.58 0.43
Aunque la comunalidad de las ultimas tres varibles son pequeñas sus cargas son significativas. No las eliminamos estas variables y solo advertimos que la solución factorial no explica muy bien estas variables.
Quedan definidos tres factores
PA1: Nivel académico y experiencia. La variable mejor representada es años de experiencia y la menos representada fue el salario.
PA2: Sexo y número de estudiantes: Indicando una preferencia de los estudiantes por el sexo:1.
PA3: Nacionalidad y grádo académico: Con una correlació poritiva entre ellos y el factor.
De las tres técnicas utilizadas la tercera es la más recomendada ya que arroja una estructura no esperada y es la que mejor representa la relación entre las variables. Como por ejemplo la relación entre el sexo y el número de estudiantes y descarta variables que no serían necesarias como la raza y el nivel de los estudiantes.