Descriptivo previo

library(haven)
data <- read_dta("parcial.dta", encoding = "UTF-8")

library(Hmisc)
library(knitr)
kable(label(data), caption = "Items y sus significado", col.names = c("Significado"), 
    table.attr = "style='width:150%;'")
Items y sus significado
Significado
facsex Sexo de los docentes
facethn Raza de los docentes
facnat Nacionalidad de los docentes
facrank Rango académico
salary Salario del docente
yrsteach Años de enseñanza
yrsut Años en la universidad
degree Maximo grado academico alcanzado
nstud Número de estudiantes en la clase
studrank Nivel académico de los estudiantes

A continuación se presenta la distribución de los items y el conteo de datos faltantes en cada uno:

summary(data)
     facsex         facethn          facnat         facrank     
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
 Median :1.000   Median :1.000   Median :1.000   Median :2.000  
 Mean   :1.412   Mean   :1.594   Mean   :1.171   Mean   :2.319  
 3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:3.000  
 Max.   :2.000   Max.   :5.000   Max.   :2.000   Max.   :4.000  
                                                                
     salary         yrsteach         yrsut            degree    
 Min.   :16.00   Min.   : 0.00   Min.   : 0.000   Min.   :1.00  
 1st Qu.:25.00   1st Qu.: 5.00   1st Qu.: 1.000   1st Qu.:3.00  
 Median :32.00   Median : 8.00   Median : 3.000   Median :3.00  
 Mean   :30.93   Mean   : 9.82   Mean   : 5.585   Mean   :3.12  
 3rd Qu.:34.00   3rd Qu.:11.75   3rd Qu.: 9.000   3rd Qu.:3.00  
 Max.   :60.00   Max.   :27.00   Max.   :23.000   Max.   :4.00  
                 NA's   :91      NA's   :15                     
     nstud           studrank    
 Min.   :  3.00   Min.   :1.000  
 1st Qu.: 19.00   1st Qu.:2.000  
 Median : 43.00   Median :3.000  
 Mean   : 76.46   Mean   :2.921  
 3rd Qu.:103.00   3rd Qu.:4.000  
 Max.   :256.00   Max.   :5.000  
                  NA's   :1      
boxplot(data[-1], las = 2, col = "skyblue")
title("boxplot de cada Item")

En este caso contamos con datos mixtos. Hoy en día existen técnicas avanzadas para estos conocidas comom Factor Analysis for Mixed Data (FAMD), véase Francois Husson et al. Sin embargo en esta oportunidad optaremos por escalar los datos (centrarlos y dividirlos entre su desviación estandar). De esta manera se manejan el hecho de que las técnicas de análisis darían mayor peso a las variables con mayor variabilidad.

datos <- scale(data, center = T, scale = T)
summary(datos)
     facsex           facethn           facnat           facrank       
 Min.   :-0.8355   Min.   :-0.519   Min.   :-0.4533   Min.   :-1.3796  
 1st Qu.:-0.8355   1st Qu.:-0.519   1st Qu.:-0.4533   1st Qu.:-0.3339  
 Median :-0.8355   Median :-0.519   Median :-0.4533   Median :-0.3339  
 Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.1936   3rd Qu.:-0.519   3rd Qu.:-0.4533   3rd Qu.: 0.7117  
 Max.   : 1.1936   Max.   : 2.977   Max.   : 2.1997   Max.   : 1.7574  
                                                                       
     salary           yrsteach           yrsut             degree      
 Min.   :-2.1097   Min.   :-1.4627   Min.   :-0.9085   Min.   :-3.927  
 1st Qu.:-0.8377   1st Qu.:-0.7179   1st Qu.:-0.7459   1st Qu.:-0.223  
 Median : 0.1516   Median :-0.2710   Median :-0.4205   Median :-0.223  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000  
 3rd Qu.: 0.4343   3rd Qu.: 0.2875   3rd Qu.: 0.5556   3rd Qu.:-0.223  
 Max.   : 4.1090   Max.   : 2.5591   Max.   : 2.8331   Max.   : 1.629  
                   NA's   :91        NA's   :15                        
     nstud            studrank       
 Min.   :-0.9263   Min.   :-1.52461  
 1st Qu.:-0.7246   1st Qu.:-0.73110  
 Median :-0.4219   Median : 0.06241  
 Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.3346   3rd Qu.: 0.85592  
 Max.   : 2.2637   Max.   : 1.64943  
                   NA's   :1         
boxplot(datos, las = 2, col = "skyblue")
title("boxplot de las variables escaladas")

1. Primer objetivo:

Se pide generar un factor que haga una combinación lineal de las variables.

Para esto decidimos utiliza el Análisis de Componentes Principales (PCA) que justamente crea factores como combinación lineal de las variables originales. Como se pide un factor extraemos directamente un factor. Con repecto a los datos faltantes por defecto imputamos la mediana.

# Pricipal Components Analysis entering raw data and extracting PCs from the
# correlation matrix
fit <- princomp(datos[complete.cases(datos), ], cor = TRUE)
summary(fit)  # print variance accounted for
Importance of components:
                          Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
Standard deviation     1.6990407 1.3327605 1.1730617 1.0430382 0.96240789
Proportion of Variance 0.2886739 0.1776251 0.1376074 0.1087929 0.09262289
Cumulative Proportion  0.2886739 0.4662990 0.6039063 0.7126992 0.80532211
                           Comp.6     Comp.7     Comp.8     Comp.9
Standard deviation     0.78195557 0.74445206 0.59761518 0.54289101
Proportion of Variance 0.06114545 0.05542089 0.03571439 0.02947307
Cumulative Proportion  0.86646756 0.92188845 0.95760284 0.98707591
                          Comp.10
Standard deviation     0.35950097
Proportion of Variance 0.01292409
Cumulative Proportion  1.00000000
loadings(fit)  # pc loadings

Loadings:
         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
facsex           0.555 -0.188  0.240  0.203         0.672 -0.269       
facethn  -0.193 -0.213  0.575                0.719  0.233              
facnat                  0.628 -0.288  0.454 -0.474  0.144 -0.237  0.106
facrank   0.456 -0.242         0.272                0.272  0.342  0.637
salary    0.436 -0.232  0.129  0.225 -0.220 -0.228  0.355        -0.656
yrsteach  0.513               -0.227  0.106  0.266 -0.225 -0.238 -0.212
yrsut     0.517  0.127        -0.163         0.297        -0.318  0.228
degree          -0.264         0.680  0.566        -0.299 -0.202       
nstud    -0.163 -0.569 -0.249 -0.102 -0.263         0.219 -0.654  0.153
studrank         0.341  0.396  0.423 -0.538 -0.156 -0.287 -0.347  0.157
         Comp.10
facsex    0.135 
facethn         
facnat          
facrank   0.230 
salary   -0.174 
yrsteach  0.667 
yrsut    -0.661 
degree          
nstud           
studrank        

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
SS loadings       1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
Proportion Var    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1
Cumulative Var    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8
               Comp.9 Comp.10
SS loadings       1.0     1.0
Proportion Var    0.1     0.1
Cumulative Var    0.9     1.0
plot(fit, type = "lines")  # scree plot
abline(h = 1, lty = 2)

biplot(fit)

eig <- eigen(cor(datos[complete.cases(datos), ]))
eig$values
 [1] 2.8867392 1.7762505 1.3760736 1.0879288 0.9262289 0.6114545 0.5542089
 [8] 0.3571439 0.2947307 0.1292409

Vemos que la proporción de variación explicada por la primera componente principal es apenas 28%. Con lo cual no es aconcejable utilizar esta solución.

2. Evaluar una estructura subyacente previa

Se cre que existen tres factores:

  • Nivel académico y experiencia
  • Demográfico
  • Estudantes

Para evaluar esta hipótesis realizamos una análisis factorial confirmatorio.

Diseño del modelo

library(lavaan)
model <- "
# latent variable definitions
academico =~ yrsut + degree + facrank + salary + yrsteach

demografico =~ facsex + facethn

estudiantes =~ nstud + studrank

# covariance between factor_1 and factor_2



# residual covariances

"
models <- list()
models$m1 <- cfa(model, data = datos)
Warning in lavaan::lavaan(model = model, data = datos, model.type =
"cfa", : lavaan WARNING: model has NOT converged!
# summary(models$m1, fit.measures = TRUE, modindices = TRUE)
# fitMeasures(models$m1) standardizedsolution(models$m1)

# Plot path diagram:

library(semPlot)
semPaths(models$m1, curvePivot = TRUE)
Warning in qgraph(Edgelist, labels = nLab, bidirectional = Bidir, directed
= Directed, : The following arguments are not documented and likely not
arguments of qgraph and thus ignored: loopRotation; residuals; residScale;
residEdge; CircleEdgeEnd

3. Identificar factores

Para esto realizamos una análisis factorial exploratorio. Utilizamos el método de ejes principales. Se realiza un AFE tipo-R sobre la matriz de correlación de Pearson de las 10 variables escaladas. En cuanto a datos faltantos la correlación es estimada borrando a pares los datos faltantes y no todo el renglón.

Supuestos

  • Se asumen muestras balanceadas

  • Criterio de adecuación de muestreo MSA y KMO

library(REdaS)
x <- KMOS(datos, use = "pair")
print(x, stats = "both", sort = T)

Kaiser-Meyer-Olkin Statistics

Call: KMOS(x = datos, use = "pair")

Measures of Sampling Adequacy (MSA):
  facethn  studrank    facnat    facsex     nstud  yrsteach    degree 
0.2985855 0.4195465 0.4551807 0.4864868 0.4978905 0.5292224 0.6323156 
    yrsut   facrank    salary 
0.6390588 0.6829275 0.6914431 

KMO-Criterion: 0.575367
# x <- KMOS(data[,-c(1,10)]) print(x, stats = 'both', sort=T) x <-
# KMOS(data[,-c(1,2,10)]) print(x, stats = 'both', sort=T)

Vemos que algunos items tiene un indice MSA considerado inaceptable (<.5)

x <- KMOS(datos[, -c(2)], use = "pair")
print(x, stats = "both", sort = T)

Kaiser-Meyer-Olkin Statistics

Call: KMOS(x = datos[, -c(2)], use = "pair")

Measures of Sampling Adequacy (MSA):
 studrank    facnat    facsex  yrsteach     nstud    degree     yrsut 
0.4102247 0.5370122 0.5748505 0.5759289 0.5838254 0.6153991 0.6471897 
  facrank    salary 
0.6736650 0.6937381 

KMO-Criterion: 0.6207431
x <- KMOS(datos[, -c(2, 10)], use = "pair")
print(x, stats = "both", sort = T)

Kaiser-Meyer-Olkin Statistics

Call: KMOS(x = datos[, -c(2, 10)], use = "pair")

Measures of Sampling Adequacy (MSA):
   facsex    facnat  yrsteach    degree     nstud     yrsut   facrank 
0.5257203 0.5342228 0.5829899 0.6196436 0.6399428 0.6453008 0.6854796 
   salary 
0.7336219 

KMO-Criterion: 0.6373098

Luego de eliminar las variables facethn y studrank los indices MSA con almenos acpetables aunque ninguno llega a ser meritorio. El KMO es regular. Realizaremos el AFE con estas 8 variables.

Número de factores

  • A priori: para que los factores sean bien identificado esperamos tres factores.
  • Regla de los autovalores y análisis en paralelo: Ambos criterios recomiendan tres factores.
# Determine Number of Factors to Extract
library(nFactors)
mcor <- cor(datos[, -c(2, 10)], method = "pearson", use = "pair")  # Cor Variables ordinale
ev <- eigen(mcor)  # get eigenvalues
ap <- parallel(subject = nrow(datos[, -c(2, 10)]), var = ncol(datos[, -c(2, 
    10)]), rep = 100, cent = 0.05, model = "factors")
nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea, model = "factors")
plotnScree(nS)
abline(h = 1, col = "black", lty = 1)

print("Autovalores:")
[1] "Autovalores:"
ev$values
[1] 3.15405050 1.55008585 1.21240263 0.74478238 0.51816690 0.48861058
[7] 0.23217308 0.09972808

Utilizaremos entonces tres factores.

Ajuste del modelo

fit <- fa(datos[, -c(2, 10)], nfactors = 3, fm = "pa", rotate = "none")
print(fit, digits = 2, cutoff = 0.4, sort = TRUE)
Factor Analysis using method =  pa
Call: fa(r = datos[, -c(2, 10)], nfactors = 3, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         item   PA1   PA2   PA3   h2   u2 com
yrsut       6  0.89 -0.07  0.19 0.83 0.17 1.1
yrsteach    5  0.81  0.10  0.53 0.95 0.05 1.7
facrank     3  0.80  0.18 -0.30 0.76 0.24 1.4
salary      4  0.70  0.23 -0.21 0.59 0.41 1.4
facsex      1 -0.14 -0.56  0.17 0.36 0.64 1.3
nstud       8 -0.30  0.55 -0.24 0.45 0.55 1.9
degree      7 -0.30  0.30  0.21 0.22 0.78 2.8
facnat      2 -0.36  0.42  0.53 0.59 0.41 2.7

                       PA1  PA2  PA3
SS loadings           2.90 0.99 0.86
Proportion Var        0.36 0.12 0.11
Cumulative Var        0.36 0.49 0.59
Proportion Explained  0.61 0.21 0.18
Cumulative Proportion 0.61 0.82 1.00

Mean item complexity =  1.8
Test of the hypothesis that 3 factors are sufficient.

The degrees of freedom for the null model are  28  and the objective function was  3.65 0.4 with Chi Square of  1288.14
The degrees of freedom for the model are 7  and the objective function was  0.26 
 0.4
The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.08 
 0.4
The harmonic number of observations is  328 with the empirical chi square  26.09  with prob <  0.00048 
 0.4The total number of observations was  357  with Likelihood Chi Square =  91.09  with prob <  7.4e-17 
 0.4
Tucker Lewis Index of factoring reliability =  0.732
RMSEA index =  0.185  and the 90 % confidence intervals are  0.151 0.218 0.4
BIC =  49.95
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                PA1  PA2  PA3
Correlation of scores with factors             0.97 0.82 0.90
Multiple R square of scores with factors       0.94 0.67 0.81
Minimum correlation of possible factor scores  0.87 0.33 0.63

Aunque hay algunas comunalidades pequeñas hay varias cargas significativas. Asumiendo que los tres factores están correlacionados. Haremos una rotación oblicua cluster.

fit <- fa(datos[, -c(2, 10)], nfactors = 3, fm = "pa", rotate = "cluster")
print(fit, digits = 2, cutoff = 0.4, sort = TRUE)
Factor Analysis using method =  pa
Call: fa(r = datos[, -c(2, 10)], nfactors = 3, rotate = "cluster", 
    fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
         item   PA1   PA3   PA2   h2   u2 com
yrsteach    5  0.91  0.31 -0.15 0.95 0.05 1.3
yrsut       6  0.85 -0.08 -0.18 0.83 0.17 1.1
facrank     3  0.76 -0.29  0.26 0.76 0.24 1.5
salary      4  0.70 -0.16  0.28 0.59 0.41 1.4
facnat      2 -0.07  0.78  0.22 0.59 0.41 1.2
degree      7 -0.13  0.43  0.23 0.22 0.78 1.7
nstud       8 -0.12  0.25  0.65 0.45 0.55 1.4
facsex      1 -0.32 -0.22 -0.62 0.36 0.64 1.8

                       PA1  PA3  PA2
SS loadings           2.75 1.03 0.97
Proportion Var        0.34 0.13 0.12
Cumulative Var        0.34 0.47 0.59
Proportion Explained  0.58 0.22 0.20
Cumulative Proportion 0.58 0.80 1.00

 With factor correlations of 
      PA1   PA3   PA2
PA1  1.00 -0.14 -0.18
PA3 -0.14  1.00 -0.28
PA2 -0.18 -0.28  1.00

Mean item complexity =  1.4
Test of the hypothesis that 3 factors are sufficient.

The degrees of freedom for the null model are  28  and the objective function was  3.65 0.4 with Chi Square of  1288.14
The degrees of freedom for the model are 7  and the objective function was  0.26 
 0.4
The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.08 
 0.4
The harmonic number of observations is  328 with the empirical chi square  26.09  with prob <  0.00048 
 0.4The total number of observations was  357  with Likelihood Chi Square =  91.09  with prob <  7.4e-17 
 0.4
Tucker Lewis Index of factoring reliability =  0.732
RMSEA index =  0.185  and the 90 % confidence intervals are  0.151 0.218 0.4
BIC =  49.95
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                PA1  PA3  PA2
Correlation of scores with factors             0.98 0.89 0.85
Multiple R square of scores with factors       0.96 0.79 0.72
Minimum correlation of possible factor scores  0.91 0.58 0.43

Aunque la comunalidad de las ultimas tres varibles son pequeñas sus cargas son significativas. No las eliminamos estas variables y solo advertimos que la solución factorial no explica muy bien estas variables.

Resultados

Quedan definidos tres factores

  • PA1: Nivel académico y experiencia. La variable mejor representada es años de experiencia y la menos representada fue el salario.

  • PA2: Sexo y número de estudiantes: Indicando una preferencia de los estudiantes por el sexo:1.

  • PA3: Nacionalidad y grádo académico: Con una correlació poritiva entre ellos y el factor.

4. Conclusión:

De las tres técnicas utilizadas la tercera es la más recomendada ya que arroja una estructura no esperada y es la que mejor representa la relación entre las variables. Como por ejemplo la relación entre el sexo y el número de estudiantes y descarta variables que no serían necesarias como la raza y el nivel de los estudiantes.