Analisis Multivariado

Módulo 1: Introducción al Análisis Multivariado

Teoría: ¿Qué es el Análisis Multivariado?

El análisis multivariado es un conjunto de técnicas estadísticas utilizadas para analizar datos que contienen más de una variable. A diferencia del análisis univariado (una variable) o bivariado (dos variables), el análisis multivariado explora las relaciones conjuntas entre tres o más variables simultáneamente. Esto nos permite obtener una comprensión mucho más rica y realista de los fenómenos que estudiamos, ya que los problemas del mundo real raramente involucran una sola variable.

Objetivos clave:

Reducción de la dimensionalidad: Simplificar la estructura de los datos sin perder información relevante (Ej: Análisis de Componentes Principales).

Clasificación y agrupación: Agrupar observaciones o variables similares en categorías (Ej: Análisis de Clúster, Análisis Discriminante).

Investigación de la dependencia: Analizar la relación entre dos conjuntos de variables (Ej: Correlación Canónica, MANOVA).

Predicción: Construir modelos para predecir los valores de ciertas variables basándose en otras.

Conceptos Fundamentales:

Vector de Medias:

Es el análogo multivariado de la media. Para un conjunto de datos con $p$ variables, el vector de medias $\mu$ es:

\[\\ \mathbf{\mu} = \begin{pmatrix} \mu\_1 \\ \mu\_2 \\ \vdots \\ \mu\_p \end{pmatrix}\]

Donde

\[\mu_i\] es la media de la variable $i$.

Matriz de Covarianzas:

\[\\ \mathbf{\Sigma} = \begin{pmatrix} \sigma\_{11} & \sigma\_{12} & \cdots & \sigma\_{1p} \\ \sigma\_{21} & \sigma\_{22} & \cdots & \sigma\_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma\_{p1} & \sigma\_{p2} & \cdots & \sigma\_{pp} \end{pmatrix}\]

Donde

\[\sigma_{ii}\] (la diagonal) es la varianza de la variable $i$, y \[\sigma_{ij}\] (fuera de la diagonal) es la covarianza entre las variables $i$ y $j$.

Código en R Studio: Preparando el Entorno

Para empezar, vamos a cargar algunos datos clásicos en R y a calcular estas matrices básicas. Usaremos el famoso dataset iris, que contiene mediciones de 3 especies de flores.

# 1. Cargar y explorar los datos
# El dataset 'iris' ya viene cargado en R.
# Tomaremos solo las variables numéricas (las primeras 4 columnas).
data <- iris[, 1:4]
head(data)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

# 2. Calcular el Vector de Medias
# La función colMeans calcula la media de cada columna.
vector_medias <- colMeans(data)
print("Vector de Medias:")

## [1] "Vector de Medias:"

print(vector_medias)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

# 3. Calcular la Matriz de Covarianzas
# La función cov() calcula la matriz de covarianzas.
matriz_cov <- cov(data)
print("Matriz de Covarianzas:")

## [1] "Matriz de Covarianzas:"

print(matriz_cov)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
## Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
## Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
## Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

# 4. Calcular la Matriz de Correlación
# La función cor() es útil para ver las relaciones estandarizadas.
matriz_cor <- cor(data)
print("Matriz de Correlación:")

## [1] "Matriz de Correlación:"

print(matriz_cor)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Interpretación:

El vector de medias te da el valor promedio para cada una de las cuatro medidas (largo del sépalo, ancho del sépalo, etc.).

La matriz de covarianzas muestra en su diagonal la varianza de cada variable. Los valores fuera de la diagonal indican cómo dos variables cambian juntas. Por ejemplo, una covarianza positiva entre Sepal.Length y Petal.Length significa que a medida que una crece, la otra tiende a crecer también.

La matriz de correlación es una versión estandarizada de la matriz de covarianzas. Es más fácil de interpretar, ya que los valores van de -1 a 1. Un valor cercano a 1 indica una fuerte relación positiva, y cercano a -1, una fuerte relación negativa.

Módulo 2: Análisis de Componentes Principales (PCA)

Teoría: Reduciendo la Complejidad

El Análisis de Componentes Principales (PCA, por sus siglas en inglés) es una técnica de reducción de dimensionalidad. Su objetivo es transformar un conjunto de variables posiblemente correlacionadas en un nuevo conjunto de variables no correlacionadas llamadas componentes principales.

Estos nuevos componentes son combinaciones lineales de los originales y están ordenados de tal manera que el primer componente principal (PC1) captura la mayor cantidad posible de variabilidad de los datos, el segundo (PC2) captura la segunda mayor cantidad de variabilidad (siendo ortogonal/no correlacionado con el PC1), y así sucesivamente. A menudo, los primeros dos o tres componentes son suficientes para explicar una gran parte de la varianza total, permitiéndonos visualizar y analizar los datos en un espacio de menor dimensión.

Fórmulas Matemáticas

El PCA se basa en la diagonalización de la matriz de covarianzas $\Sigma$. Los componentes principales se encuentran resolviendo la siguiente ecuación de valores y vectores propios (eigenvalues y eigenvectors):

$$( - ) = 0

$$Donde:

$\mathbf{\Sigma}$ es la matriz de covarianzas.
$\lambda$ son los valores propios (eigenvalues). Representan la cantidad de varianza explicada por cada componente principal.
$\mathbf{e}$ son los vectores propios (eigenvectors). Definen la “dirección” o los coeficientes de los componentes principales.
$\mathbf{I}$ es la matriz identidad.

El Componente Principal $k$ (PCk) se calcula como:

\[PC_k= e_{k1}X_1 + e_{k2}X_2 + \cdots + e_{kp}X_p\]

Ejemplo y Código en RStudio

Usaremos el dataset mtcars para reducir la dimensionalidad de las características de los coches.

# 1. Cargar los datos
# Usaremos solo las columnas numéricas de mtcars.
data_pca <- mtcars[, c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec")]

# Es MUY importante estandarizar los datos antes de PCA si las variables
# tienen escalas diferentes (ej. millas vs. caballos de fuerza).
data_pca_scaled <- scale(data_pca)

# 2. Realizar el PCA
# La función prcomp() realiza el análisis.
pca_result <- prcomp(data_pca_scaled, center = TRUE, scale. = TRUE)

# 3. Explorar los resultados
# summary() nos muestra la importancia de cada componente.
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.2552 1.0754 0.58724 0.39741 0.3599 0.27542 0.22181
## Proportion of Variance 0.7266 0.1652 0.04926 0.02256 0.0185 0.01084 0.00703
## Cumulative Proportion  0.7266 0.8918 0.94107 0.96364 0.9821 0.99297 1.00000

# Ver los vectores propios (loadings)
print(pca_result$rotation)

##             PC1         PC2        PC3          PC4        PC5         PC6
## mpg  -0.4127573  0.08296098 -0.2416477 -0.766798834  0.2127946  0.09002238
## cyl   0.4247315  0.07844163 -0.1880252 -0.193926827 -0.2383825 -0.78055217
## disp  0.4225036 -0.08239922  0.1180359 -0.587679498 -0.1488730  0.16169873
## hp    0.3877611  0.33696384  0.2027400  0.006884691  0.8314576 -0.04607937
## drat -0.3311703  0.44858845  0.7552915 -0.117073263 -0.2217502 -0.23319304
## wt    0.3913153 -0.32236122  0.4405532 -0.107346915 -0.1666473  0.36612190
## qsec -0.2399275 -0.74932087  0.2943885 -0.061382684  0.3278152 -0.40735764
##              PC7
## mpg   0.35069828
## cyl   0.27276548
## disp -0.63803744
## hp    0.03873894
## drat -0.03702814
## wt    0.61280387
## qsec -0.13083413

# 4. Visualizar los resultados
# Un biplot muestra tanto las observaciones como las variables en el nuevo espacio.
# Nos ayuda a interpretar qué variables contribuyen a cada componente.
biplot(pca_result, scale = 1)

Interpretación:

summary(pca_result): Fíjate en la “Proportion of Variance” y “Cumulative Proportion”. Verás que los primeros dos componentes (PC1 y PC2) explican un gran porcentaje de la variabilidad total (probablemente más del 80%). Esto significa que podemos representar la mayor parte de la información de las 7 variables originales con solo 2.

pca_result rotation: Esta matriz muestra los “loadings” o coeficientes. Por ejemplo, en PC1, verás que variables como cyl, disp, hp y wt tienen coeficientes negativos grandes, mientras que mpg tiene uno positivo. Esto sugiere que PC1 representa un eje que va desde coches pequeños y eficientes (alto mpg) a coches grandes y potentes (altos cyl, hp, wt).

Biplot: Las flechas representan las variables originales. Flechas que apuntan en la misma dirección están positivamente correlacionadas. Las que apuntan en direcciones opuestas, negativamente. Los puntos son las observaciones (los coches). Coches que están cerca en el gráfico son similares en términos de sus características originales.

Ejemplo 2

# Cargar datos
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(ggplot2)
library(GGally)
data(USArrests)

# Ver estructura
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

str(USArrests)

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

# Estandarizar (importante en PCA si las escalas difieren)
USArrests_scaled <- scale(USArrests)

# Realizar PCA
pca_result <- prcomp(USArrests_scaled, center = TRUE, scale. = TRUE)

# Resumen
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

# Gráfico de varianza explicada
fviz_eig(pca_result, addlabels = TRUE)

## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

# Biplot
fviz_pca_biplot(pca_result, 
                col.ind = "cos2", 
                gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                repel = TRUE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretación

PC1 y PC2 explican el 80-90% de la varianza.
Las flechas indican la dirección y magnitud de cada variable original.
Los puntos son las observaciones (estados).

Módulo 3: Análisis de Clúster

Teoría: Encontrando Grupos Naturales

El análisis de clúster es una técnica de aprendizaje no supervisado. Su objetivo es agrupar un conjunto de observaciones de tal manera que las observaciones dentro de un mismo grupo (o clúster) sean más similares entre sí que con las de otros grupos. No hay una variable de “respuesta” o “etiqueta” predefinida; el algoritmo busca la estructura inherente en los datos.

Existen dos tipos principales:

Clustering Jerárquico: Construye una jerarquía de clústeres, ya sea de forma aglomerativa (empezando con cada punto como un clúster y fusionándolos) o divisiva (empezando con un gran clúster y dividiéndolo). El resultado se visualiza en un dendrograma.
Clustering de Partición (K-Means): Divide las observaciones en un número predefinido de clústeres ($k$). El algoritmo asigna iterativamente cada observación al clúster cuyo centroide (media) esté más cercano.

Fórmulas Matemáticas

El concepto central es la distancia. La métrica más común es la Distancia Euclidiana entre dos puntos $A = (a_1, a_2, ..., a_p)$ y $B = (b_1, b_2, ..., b_p)$:

\[d(A, B) = \sqrt{\sum_{i=1}^{p} (a_i - b_i)^2}\] En K-Means, el objetivo es minimizar la suma de cuadrados dentro de los clústeres (Within-Cluster Sum of Squares, WCSS):

\[\text{WCSS} = \sum_{k=1}^{K} \sum_{i \in C_k} d(x_i, \mu_k)^2\]

Donde K es el número de clústeres, Ck es el clúster k, y μk es el centroide del clúster k.

Ejemplo y Código en RStudio

Vamos a usar los datos iris para ver si podemos identificar las tres especies de flores sin usar sus etiquetas, solo basándonos en sus medidas.

1. Clustering Jerárquico

# 1. Preparar los datos
# Usaremos las 4 variables numéricas y las estandarizaremos.
data_cluster <- iris[, 1:4]
data_cluster_scaled <- scale(data_cluster)

# 2. Calcular la matriz de distancias
dist_matrix <- dist(data_cluster_scaled, method = "euclidean")

# 3. Realizar el clustering jerárquico aglomerativo
# Usaremos el método de Ward, que minimiza la varianza dentro de los clústeres.
hierarchical_cluster <- hclust(dist_matrix, method = "ward.D2")

# 4. Visualizar el dendrograma
plot(hierarchical_cluster, main = "Dendrograma de Flores Iris")
# Podemos cortar el árbol en 3 clústeres.
rect.hclust(hierarchical_cluster, k = 3, border = "red")

Interpretación del Dendrograma: El eje vertical representa la distancia o “disimilitud”. Las fusiones que ocurren en la parte inferior del gráfico son entre observaciones muy similares. Para decidir el número de clústeres, buscas el “salto” vertical más grande que puedas hacer sin cortar ninguna línea horizontal y cuentas las líneas verticales que cruzas. En este caso, cortar a una altura de ~4 nos daría claramente 3 grupos.

Eejemplo

# Distancia + enlace de Ward
d <- dist(USArrests_scaled, method = "euclidean")
hc <- hclust(d, method = "ward.D2")

# Dendrograma
fviz_dend(hc, k = 4, rect = TRUE, cex = 0.7) +
  labs(title = "Dendrograma: Clustering Jerárquico")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

2. K-Means Clustering

# 1. Determinar el número óptimo de clústeres (k)
# Método del codo (Elbow Method)
library(factoextra)
library(ggplot2)
wcss <- numeric(10)
for (i in 1:10) {
  wcss[i] <- sum(kmeans(data_cluster_scaled, centers = i)$withinss)
}

ggplot() +
  geom_point(aes(x = 1:10, y = wcss), color = 'blue') +
  geom_line(aes(x = 1:10, y = wcss), color = 'blue') +
  ggtitle("Método del Codo para Encontrar K Óptimo") +
  xlab("Número de Clústeres (K)") +
  ylab("WCSS") +
  theme_minimal()

# El "codo" aparece en k=3, lo que sugiere 3 clústeres.

# 2. Realizar el clustering K-Means con k=3
set.seed(123) # Para reproducibilidad
kmeans_result <- kmeans(data_cluster_scaled, centers = 3, nstart = 25)

# 3. Explorar los resultados
print(kmeans_result$size) # Tamaño de cada clúster

## [1] 50 53 47

print(kmeans_result$centers) # Centroides de cada clúster

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1  -1.01119138  0.85041372   -1.3006301  -1.2507035
## 2  -0.05005221 -0.88042696    0.3465767   0.2805873
## 3   1.13217737  0.08812645    0.9928284   1.0141287

# 4. Visualizar los clústeres
# Usaremos los dos primeros componentes principales para visualizar.

fviz_cluster(kmeans_result, data = data_cluster_scaled,
             palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_bw()
)

Interpretación:

El método del codo muestra que a partir de 3 clústeres, añadir más no reduce significativamente la varianza total dentro de los grupos.

Los resultados de kmeans_result te dirán cuántas observaciones cayeron en cada clúster.

La visualización te mostrará los tres grupos que el algoritmo encontró. Si comparas estos grupos con las especies reales de iris$Species, verás que K-Means hizo un trabajo excelente para separar las flores basándose únicamente en sus medidas.

Ejemplo

# Método del codo
fviz_nbclust(USArrests_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(title = "Método del codo para K-means")

# Aplicar K-means
set.seed(123)
km <- kmeans(USArrests_scaled, centers = 4, nstart = 25)

# Visualizar en espacio PCA
fviz_cluster(km, data = USArrests_scaled,
             palette = "jco",
             ellipse.type = "euclid") +
  labs(title = "Clusters K-means proyectados en PCA")

Módulo 4: Análisis Discriminante Lineal (LDA)

Teoría: Clasificando ObservacionesA diferencia del clustering, el Análisis Discriminante Lineal (LDA) es una técnica de aprendizaje supervisado. Su objetivo principal es la clasificación. Dado un conjunto de observaciones que ya pertenecen a grupos conocidos (etiquetas), el LDA busca encontrar una combinación lineal de las variables que mejor separe (o discrimine) esos grupos.Una vez que se encuentran estas “funciones discriminantes”, se pueden usar para predecir a qué grupo pertenece una nueva observación. El LDA asume que las variables predictoras siguen una distribución normal y que las covarianzas entre los grupos son iguales.Fórmulas MatemáticasEl LDA busca maximizar la relación entre la varianza entre los grupos ($S_B$) y la varianza dentro de los grupos ($S_W$). Esto se logra encontrando un vector de coeficientes $w$ que maximiza el siguiente criterio:\[J(w) = \frac{w^T S_B w}{w^T S_W w} \]Donde:

$S_B$ es la matriz de covarianza entre grupos (between-group covariance).
$S_W$ es la matriz de covarianza dentro de los grupos (within-group covariance).

La solución a este problema también se encuentra a través de un problema de valores y vectores propios.

Ejemplo y Código en RStudio

El dataset iris es el ejemplo canónico para LDA, ya que tenemos las medidas (predictores) y la especie (la clase que queremos predecir).

# 1. Cargar el paquete necesario y preparar los datos
# El paquete MASS contiene la función lda().
#install.packages("MASS")
library(MASS)

# Datos de iris
data_lda <- iris

# 2. Dividir los datos en entrenamiento y prueba
# Esto es crucial para evaluar el modelo de forma honesta.
set.seed(123)
sample_index <- sample(1:nrow(data_lda), 0.7 * nrow(data_lda))
train_data <- data_lda[sample_index, ]
test_data <- data_lda[-sample_index, ]

# 3. Construir el modelo LDA
# La fórmula Species ~ . significa "predecir Especie usando todas las demás variables".
lda_model <- lda(Species ~ ., data = train_data)

# Imprimir el modelo
print(lda_model)

## Call:
## lda(Species ~ ., data = train_data)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3428571  0.3047619  0.3523810 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.966667    3.394444     1.461111   0.2555556
## versicolor     5.971875    2.787500     4.309375   1.3500000
## virginica      6.586486    2.948649     5.529730   1.9945946
## 
## Coefficients of linear discriminants:
##                    LD1        LD2
## Sepal.Length  0.887242 -0.1977697
## Sepal.Width   1.267866 -1.8705232
## Petal.Length -2.101235  1.1717820
## Petal.Width  -2.934756 -3.1681626
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9937 0.0063

# 4. Realizar predicciones en el conjunto de prueba
predictions <- predict(lda_model, newdata = test_data)

# Ver las predicciones
predicted_classes <- predictions$class
print(predicted_classes)

##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     versicolor versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor virginica  versicolor versicolor
## [31] versicolor versicolor virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

# 5. Evaluar el rendimiento del modelo
# Creamos una tabla de confusión para ver qué tan bien clasificó.
confusion_matrix <- table(Predicted = predicted_classes, Actual = test_data$Species)
print(confusion_matrix)

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         17         0
##   virginica       0          1        13

# Calcular la precisión (accuracy)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Precisión del modelo:", round(accuracy * 100, 2), "%"))

## [1] "Precisión del modelo: 97.78 %"

# 6. Visualizar los resultados
# lda_model crea nuevas dimensiones (LD1, LD2) que separan óptimamente los grupos.
# Podemos graficar las observaciones en este nuevo espacio.
lda_plot_data <- cbind(train_data, predict(lda_model)$x)

ggplot(lda_plot_data, aes(x = LD1, y = LD2, color = Species)) +
geom_point() +
ggtitle("Visualización del Análisis Discriminante Lineal") +
theme_minimal()

Interpretación:

print(lda_model): Te mostrará los “coeficientes de la discriminante lineal”. Estos te indican qué variables son más importantes para separar los grupos.

Matriz de Confusión: Es la herramienta clave para la evaluación. La diagonal muestra las clasificaciones correctas. Los valores fuera de la diagonal son los errores.

Precisión: Un porcentaje alto (en este caso, probablemente cercano al 98-100%) indica que el modelo es muy bueno para clasificar las flores en su especie correcta basándose en sus medidas.

Gráfico LDA: Verás qué tan bien las funciones discriminantes (LD1, LD2) separan los puntos de las tres especies. Idealmente, los clústeres de puntos de cada color estarán muy separados y serán muy compactos.

Modulo 5: Análisis Factorial (FA)

Objetivo Explicar las correlaciones entre variables observadas mediante factores latentes (no observables).

Modelo factorial ortogonal \[X=μ+LF+ε\] Donde:

$X∈R^p$ :variables observadas (centradas).

$F∈R^m$ :factores comunes $(m<p)$, $E[F]=0$ , $Cov(F)=Im$

$L∈ R^{p×m}$ : matriz de cargas factoriales.

$ε$ : errores únicos, $E[ε]=0 , Cov(ε)=Ψ$ (diagonal).

Entonces:

$Σ=LL^⊤ +Ψ$

library(psych) 
# Determinar número de factores
fa.parallel(USArrests_scaled, fa = "fa", n.iter = 100)

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

# Extraer 2 factores con rotación Varimax
fa_model <- fa(USArrests_scaled, nfactors = 2, rotate = "varimax")

# Cargas factoriales
print(fa_model$loadings, cutoff = 0.3)

## 
## Loadings:
##          MR1   MR2  
## Murder   0.948      
## Assault  0.841 0.308
## UrbanPop       0.669
## Rape     0.586 0.560
## 
##                  MR1   MR2
## SS loadings    1.953 0.856
## Proportion Var 0.488 0.214
## Cumulative Var 0.488 0.702

# Diagrama de factores
fa.diagram(fa_model)

Analisis Multivariado

Andrew Arnedo

2025-10-17

Ejemplo y Código en RStudio