Ejercicio introductorio de análisis factorial

Una empresa especializada en el diseño de automóviles desea analizar los desea del público al momento de comprar automóviles. Para ello ha diseñado una encuesta con 10 preguntas, en las que se solicita a cada uno de los 20 encuestados que valore de 1 a 5 si una característica es o no muy importante, de la siguiente manera:

Las 10 características a valorar por los encuestados son:

Los datos correspondientes están en el archivo Automoviles.sav

  1. Calcular la matriz de correlaciones de estos datos
  2. Determinar si la matriz es “factorizable” calculando el KMO y el test de Bartlett
  3. Realizar una factorización mediante el método de componentes principales
  4. Idem anterior, utilizando el método de máxima verosimilitud
## Warning: package 'haven' was built under R version 4.1.1

Analisis factorial

“Nacido con el siglo XX, el análisis factorial (AF) se ha desarrollado considerablemente a lo largo de sus más de 100 años de existencia. El sencillo modelo inicial propuesto por Spearman (1904) para validar su teoría de la inteligencia ha dado lugar a una amplia familia de modelos que se utilizan no sólo en ciencias sociales, sino también en otros dominios como Biología o Economía.” (Ferrando y Carrasco,2010)

“Regardless of how the investigator determines the relationships among variables under specified conditions, all scientists are united in the common goal: they seek to summarize data so that the empirical relationships can be grasped by the human mind. In doing so, constructs are built which are conceptually clearer than the a priori ideas, and then these constructs are integrated through the development of theories.”(Gorsuch, 1974)

“El AF es un modelo estadístico que representa las relaciones entre un conjunto de variables. Plantea que estas relaciones pueden explicarse a partir de una serie de variables no observables (latentes) denominadas factores, siendo el número de factores substancialmente menor que el de variables” (Ferrando y Carrasco,2010)

“En el AF se analiza un conjunto de variables observables (ítems, subtests o tests) cada una de las cuales puede considerarse como un criterio. Así entendido, el AF consiste en un sistema de ecuaciones de regresión como la descrita arriba (una ecuación para cada variable observable) en el que los regresores, denominados aquí factores, son comunes para un subconjunto (factores comunes) o todo el conjunto (factores generales) de variables (véase ec. 2 en el cuadro). Para cada una de estas ecuaciones la diferencia básica entre el AF y una regresión convencional es que los factores, no son observables.” (Ferrando y Carrasco,2010)

Matriz de correlación

“A correlation matrix is a symmetric matrix. Note that a symmetric matrix is its own transpose. Since the elements above the diagonal in a symmetric matrix are the same as those below the diagonal, it is common practice to give only half the matrix” (Gorsuch, 1974)

“In the simplest of factor analyses, the first factor is assumed to be the equivalent of one variable. The extent to which this one”factor" can account for the entire correlation matrix is determined. The next factor is then set equal to a second variable and the variance that it can account for is then determined for the correlation matrix from which the variance for the first factor has been extracted. The process is continued until all the factors of interest are extracted. I The procedure is referred to as diagonal factor analysis because the calculations for each factor begin with the diagonal element of the variable defining that factor. It has also been referred to as square-root factor analysis since the component version takes the square root of the diagonal element; other names for it include triangular decomposition, sweep-out method, pivotal condensation, solid-staircase analysis, analytic factor analysis (DuBois and Sullivan, 1972), and maximal decomposition (Hunter, 1972). (The procedure is often used in computing other statistics.)" (Gorsuch, 1974)

“In extracting the first diagonal factor, one variable is designated as the first factor. The problem is to determine the correlation of the variable with itself as a factor and the correlation of the other variables with the factor, The correlations .are the same as the weights in the factor pattern matrix. Since the variable is considered to be error-free in the component model, the defining variable is the factor, and the weight/correlation for that variable must be: P’A = 1.0” (Gorsuch, 1974)

“When the first factor is defined as identical with the defining variable,all of that variable’s communality will be accounted for by that factor. Since the communality is given in the diagonal of the matrix being factored, the square of the defining variable’s single loading on the first factor must reproduce the diagonal element.”(Gorsuch, 1974)

“When the first factor is defined as identical with the defining variable,all of that variable’s communality will be accounted for by that factor. Since the communality is given in the diagonal of the matrix being factored,the square of the defining variable’s single loading on the first factor must reproduce the diagonal element. Therefore, the loading of the defining variable on the first factor, P” (Gorsuch, 1974)

“The next step is to find the loadings of the other variables on this first factor. From equation (2.4.4) the correlation between the defining variable and any other variable is equal to the product of their two loadings, i.e.,” (Gorsuch, 1974) “To find PJA, one would simply divide through by P,A to obtain:” “The loading for each variable is therefore calculated by dividing the correlation by the loading of the defining variable.”

#Matriz de correlacion
library(stats)
cor_auto<- cor(automoviles)
## Warning: package 'kableExtra' was built under R version 4.1.2
precio financiación consumo combustible seguridad confort capacidad prestaciones modernidad aerodinámica
precio 1.0000000 0.8732860 0.8227068 0.8163752 -0.5013159 -0.1937601 0.2134668 -0.6477072 -0.6446941 -0.4974610
financiación 0.8732860 1.0000000 0.7290378 0.8291310 -0.4392191 -0.0712674 0.2487646 -0.7844064 -0.7519310 -0.6969907
consumo 0.8227068 0.7290378 1.0000000 0.8123536 -0.4781461 -0.2255894 0.1915028 -0.5570735 -0.6296349 -0.5402292
combustible 0.8163752 0.8291310 0.8123536 1.0000000 -0.5496865 -0.2616171 0.1738070 -0.7367273 -0.7891501 -0.6537458
seguridad -0.5013159 -0.4392191 -0.4781461 -0.5496865 1.0000000 0.7377945 0.1753079 0.2917811 0.3406250 0.1225791
confort -0.1937601 -0.0712674 -0.2255894 -0.2616171 0.7377945 1.0000000 0.4206208 0.1324920 0.0547865 -0.2359846
capacidad 0.2134668 0.2487646 0.1915028 0.1738070 0.1753079 0.4206208 1.0000000 -0.3007577 -0.1803955 -0.4139927
prestaciones -0.6477072 -0.7844064 -0.5570735 -0.7367273 0.2917811 0.1324920 -0.3007577 1.0000000 0.8864743 0.7302468
modernidad -0.6446941 -0.7519310 -0.6296349 -0.7891501 0.3406250 0.0547865 -0.1803955 0.8864743 1.0000000 0.7845472
aerodinámica -0.4974610 -0.6969907 -0.5402292 -0.6537458 0.1225791 -0.2359846 -0.4139927 0.7302468 0.7845472 1.0000000
library(corrplot)
## corrplot 0.90 loaded
corrplot(cor_auto, method= 'square', type='lower')

Determinamos si la matriz es factorizable

“the Kaiser-Meyer-Olkin test of sampling adequacy assesses whether or not our sample size is sufficient for factor analysis. A value of less than 0.5 indicates the sample is too small, but ideally we are aiming for 0.7 or above. In this case the value is KMO = .87, which means our sample size is sufficient.”

“The second statistic is Bartlett’s test of sphericity which tells us whether we have an adequate number of correlations between our variables for factor analysis. In this case we are looking for a significance value of less than your alpha level (i.e. p<0.05), just like ANOVA. In this case the value is p < .001, which means that we have enough correlations for factor analysis.”

#Determinamos si la matriz es factorizable

library(psych)
## Warning: package 'psych' was built under R version 4.1.3
KMO(automoviles) #Test KMO
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = automoviles)
## Overall MSA =  0.7
## MSA for each item = 
##       precio financiación      consumo  combustible    seguridad      confort 
##         0.82         0.74         0.84         0.93         0.55         0.32 
##    capacidad prestaciones   modernidad aerodinámica 
##         0.37         0.62         0.68         0.84
#Determinamos si la matriz es factoriable

bartlett.test(automoviles)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  automoviles
## Bartlett's K-squared = 8.4557, df = 9, p-value = 0.489
cortest.bartlett(automoviles)
## R was not square, finding R from data
## $chisq
## [1] 163.4656
## 
## $p.value
## [1] 2.362835e-15
## 
## $df
## [1] 45

Metodo de los componentes principales

“Procedures used to factor-analyze a correlation matrix can have each factor based upon the correlations among the total set of variables.’ The most widely used such procedure is that of principal factors. Several characteristics make the principal factor solution’ desirable as an initial set of factors (cf. Section 6.1, Characteristics of Principal Factor Methods). While this solution could be rotated to from another solution (Thurstone, 1947), the principal factor procedure IS generally used to extract factors from a correlation matrix. The principal factors are usually rotated (cf. Chapters 9 and 10). The basic principal factor procedure can be applied in different situations to slightly different correlation matrices. If it is applied to the correlation matrix with unities in the diagonal, then principal components, Sectionb6.2, result. Ifit is applied to the correlation matrix where the diagonals have been adjusted to communality estimates, common factors result (cf. Section 6.3, Communality Estimation and Principal Axes).”

“The prime characteristic of the principal factor extraction procedure is that each factor accounts for the maximum possible amount of the variance of the variables being factored. The first factor from the correlation matrix consists of that weighted combination of all the variables which will produce the highest squared correlations between the variables and the factor since the squared correlation is a measure of the variance accounted”

“Principal component analysis is the extraction of principal factors under the component model. The principal factors are extracted from the correlation matrix with unities as diagonal elements. The factors then give the best least-squares fit to the entire correlation matrix, and each succeeding factor accounts for the maximum amount of the total correlation matrix obtainable. Since the main diagonal is unaltered, the procedure attempts to account for all the variance of each variable and it is thus assumed that all the variance is relevant. Using the full component model means that as many factors as variables are generally needed. However, the full component model is seldom used because so many of the smaller factors are trivial and do not replicate. The smaller factors are generally dropped, and a truncated component solution results. The inaccuracies in reproducing the correlations and variable scores are attributed to errors in the model in that sample.”

#Metodo de los componentes principales

CP_automoviles<- prcomp(automoviles)

summary(CP_automoviles)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     3.3873 1.6833 1.12669 0.92382 0.7676 0.64138 0.53149
## Proportion of Variance 0.6329 0.1563 0.07002 0.04708 0.0325 0.02269 0.01558
## Cumulative Proportion  0.6329 0.7892 0.85922 0.90629 0.9388 0.96149 0.97707
##                            PC8     PC9    PC10
## Standard deviation     0.48217 0.35680 0.23636
## Proportion of Variance 0.01282 0.00702 0.00308
## Cumulative Proportion  0.98990 0.99692 1.00000
plot(CP_automoviles)

print(CP_automoviles$rotation,sort=T)
##                      PC1         PC2         PC3         PC4          PC5
## precio       -0.34448243 -0.13635824  0.40606595 -0.22964195 -0.259370675
## financiación -0.45047802  0.00325112  0.16116518 -0.28567643 -0.493158683
## consumo      -0.28846551 -0.12879360  0.31756359 -0.14158408  0.464189618
## combustible  -0.43233054 -0.16124636  0.07431297 -0.01207199  0.369571120
## seguridad     0.13076810  0.38028279 -0.08210427 -0.25806310 -0.084477462
## confort       0.04958828  0.61577667  0.06090624 -0.51277820  0.024778476
## capacidad    -0.10898308  0.52789369  0.50912879  0.63843779  0.002342479
## prestaciones  0.35863495 -0.09727861  0.35736427 -0.32541667  0.414448434
## modernidad    0.38735298 -0.08149580  0.48285101 -0.04220731 -0.090745948
## aerodinámica  0.31333911 -0.34674966  0.26879369  0.02536632 -0.386921593
##                      PC6          PC7         PC8         PC9        PC10
## precio       -0.20227083  0.111229681  0.21285224 -0.69128957 -0.03697669
## financiación  0.40599262  0.006677560 -0.12558833  0.42658312  0.28921977
## consumo      -0.33851321  0.523158165 -0.11951485  0.38166705 -0.12488386
## combustible  -0.06496724 -0.754833058 -0.25705337 -0.02184475 -0.06551057
## seguridad    -0.46683036 -0.005003657 -0.56303831 -0.13278593  0.45300190
## confort      -0.04969493 -0.197451622  0.30041238  0.14965729 -0.44383943
## capacidad    -0.01666854 -0.056791538  0.09746596  0.05327527  0.16367509
## prestaciones  0.26998869 -0.104502613  0.27556306 -0.03704667  0.54352825
## modernidad    0.30958975  0.005473131 -0.57075470 -0.08859922 -0.41312008
## aerodinámica -0.53481478 -0.301542282  0.20026478  0.37670771 -0.00349033
#Metodo de maxima verosimulitud


FA_automoviles <- factanal(automoviles, factors = 2)

print(FA_automoviles$loadings,sort=T,cutoff=F)
## 
## Loadings:
##              Factor1 Factor2
## precio        0.849  -0.175 
## financiación  0.919  -0.018 
## consumo       0.800  -0.193 
## combustible   0.912  -0.201 
## prestaciones -0.846  -0.035 
## modernidad   -0.870  -0.057 
## aerodinámica -0.787  -0.343 
## seguridad    -0.440   0.754 
## confort      -0.090   0.905 
## capacidad     0.296   0.459 
## 
##                Factor1 Factor2
## SS loadings      5.420   1.829
## Proportion Var   0.542   0.183
## Cumulative Var   0.542   0.725
#

print(FA_automoviles$uniquenesses,sort=T,cutoff=F)
##       precio financiación      consumo  combustible    seguridad      confort 
##    0.2477646    0.1549725    0.3232177    0.1274528    0.2379030    0.1736279 
##    capacidad prestaciones   modernidad aerodinámica 
##    0.7017032    0.2832061    0.2396111    0.2623648