Importar la base de datos
library(readxl)
data=read_excel(file.choose())
head(data)
## # A tibble: 6 × 12
## `Acidez fija` `Acidez variable` `Acido cítrico` `azúcar residual` Cloruros
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 0.27 0.36 20.7 0.045
## 2 6.3 0.3 0.34 1.6 0.049
## 3 8.1 0.28 0.4 6.9 0.05
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.4 6.9 0.05
## # ℹ 7 more variables: `Dioxido de azufre libre` <dbl>,
## # `Dioxido de azufre total` <dbl>, Densidad <dbl>, pH <dbl>, Sulfatos <dbl>,
## # `alcohol (%)` <dbl>, calidad <dbl>
Eliminamos la variable “Calidad” para trabar un regresión posterior.
pca = prcomp(data[-12] , scale=TRUE) #Scale = estandarización
pca
## Standard deviations (1, .., p=11):
## [1] 1.7950638 1.2550856 1.1052924 1.0092187 0.9865772 0.9688867 0.8524072
## [8] 0.7741825 0.6435399 0.5380401 0.1436979
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## Acidez fija 0.157218451 -0.587558208 0.1213683 0.01858383
## Acidez variable 0.005089494 0.051728054 -0.5909715 0.27411517
## Acido cítrico 0.144049843 -0.345294562 0.5043969 0.14851432
## azúcar residual 0.427408368 0.008749392 -0.2143199 -0.27376531
## Cloruros 0.212011065 -0.008800308 -0.1023674 0.71071228
## Dioxido de azufre libre 0.300334387 0.290355136 0.2794101 -0.30558549
## Dioxido de azufre total 0.406652203 0.244032391 0.1243753 -0.06045562
## Densidad 0.511523597 0.006296796 -0.1292029 -0.02206110
## pH -0.128831885 0.581344397 0.1266715 0.09775335
## Sulfatos 0.043379327 0.222695370 0.4332440 0.44205953
## alcohol (%) -0.437237835 -0.035568666 0.1059032 -0.14107870
## PC5 PC6 PC7 PC8
## Acidez fija 0.25104839 -0.1035307 0.19784897 -0.58835527
## Acidez variable 0.64261658 0.1223385 -0.26935495 -0.02837266
## Acido cítrico 0.05390510 0.1320967 -0.70548123 0.15228698
## azúcar residual 0.01139144 -0.2894469 -0.21275955 0.38818585
## Cloruros -0.32862831 0.3958211 0.07948377 0.10015094
## Dioxido de azufre libre 0.17691226 0.4944936 0.16677879 0.08179901
## Dioxido de azufre total 0.29300991 0.2763199 0.06772962 -0.24731437
## Densidad -0.08458824 -0.3276509 -0.11038544 -0.06902650
## pH -0.11982553 -0.1933412 -0.42731310 -0.53388135
## Sulfatos 0.40058526 -0.4810392 0.30856238 0.27039413
## alcohol (%) 0.33741948 0.1392842 -0.12892247 0.19585929
## PC9 PC10 PC11
## Acidez fija 0.33052283 0.13170530 -0.171290475
## Acidez variable -0.14590968 0.22372176 -0.017056659
## Acido cítrico -0.20201133 0.03735137 -0.009721121
## azúcar residual 0.40896853 -0.09446795 -0.490225929
## Cloruros 0.39353539 -0.05337405 -0.025399445
## Dioxido de azufre libre 0.14407339 0.56745057 0.030908440
## Dioxido de azufre total -0.15454024 -0.70912033 -0.035560953
## Densidad 0.08788799 0.06837403 0.759779368
## pH 0.26129767 0.11073292 -0.141197197
## Sulfatos -0.01169757 0.05770758 -0.041832843
## alcohol (%) 0.62109070 -0.27260860 0.357961330
De la salida anterior , apriori se identifican 4 componentes que superan la unidad.
Componente 1: 1.7950638 Componente 2: 1.2550856 Componente 3: 1.1052924 Componente 4: 1.0092187
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7951 1.2551 1.1053 1.00922 0.98658 0.96889 0.85241
## Proportion of Variance 0.2929 0.1432 0.1111 0.09259 0.08848 0.08534 0.06605
## Cumulative Proportion 0.2929 0.4361 0.5472 0.63979 0.72827 0.81361 0.87967
## PC8 PC9 PC10 PC11
## Standard deviation 0.77418 0.64354 0.53804 0.14370
## Proportion of Variance 0.05449 0.03765 0.02632 0.00188
## Cumulative Proportion 0.93416 0.97181 0.99812 1.00000
considerando la regla del autovalor >1 complementando a la proporición acumulada, tenemos 4 componentes retiendo el 63.98% de información.
library(factoextra)
## Cargando paquete requerido: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_screeplot(pca,addlabels=TRUE )
library(factoextra)
fviz_screeplot(pca,choice =c("eigenvalue") ,addlabels=TRUE )
La decisión es quedarnos con 4 componentes.
head(pca$x[,1:4] ) # filtrar los 4 componente que retienen el 64% de la información
## PC1 PC2 PC3 PC4
## [1,] 3.6765681 -0.5451776 -0.930326898 -1.1352584
## [2,] -0.6445220 0.4307260 -0.356294322 0.9990146
## [3,] 0.1552747 -1.1896785 -0.017529733 0.2701919
## [4,] 1.4552255 0.0996700 -0.001956186 -0.4229793
## [5,] 1.4552255 0.0996700 -0.001956186 -0.4229793
## [6,] 0.1552747 -1.1896785 -0.017529733 0.2701919
# "1:4 " hace referencia que filtra esas 4 columnas.
reduccion=pca$x[,1:4]
Actividad Final
Comparar el R^2 de los datos originales vs los 4 componentes principales.
modelo_1=lm(calidad ~ . ,data=data) #Procedimiento regular
summary(modelo_1)
##
## Call:
## lm(formula = calidad ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## `Acidez fija` 6.552e-02 2.087e-02 3.139 0.00171 **
## `Acidez variable` -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## `Acido cítrico` 2.209e-02 9.577e-02 0.231 0.81759
## `azúcar residual` 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## Cloruros -2.473e-01 5.465e-01 -0.452 0.65097
## `Dioxido de azufre libre` 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## `Dioxido de azufre total` -2.857e-04 3.781e-04 -0.756 0.44979
## Densidad -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## Sulfatos 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## `alcohol (%)` 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
El procedimiento clásico nos da un R^2 = 28.03%
##armando la base
data_reducida=cbind(reduccion,data$calidad)
data_reducida=as.data.frame(data_reducida)
modelo_2=lm(V5 ~. , data=data_reducida)
summary(modelo_2)
##
## Call:
## lm(formula = V5 ~ ., data = data_reducida)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9397 -0.5414 -0.0277 0.5148 3.3002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.877909 0.011495 511.356 < 2e-16 ***
## PC1 -0.146506 0.006404 -22.877 < 2e-16 ***
## PC2 0.041009 0.009159 4.477 7.73e-06 ***
## PC3 0.174888 0.010401 16.815 < 2e-16 ***
## PC4 -0.167387 0.011391 -14.695 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8045 on 4893 degrees of freedom
## Multiple R-squared: 0.1756, Adjusted R-squared: 0.1749
## F-statistic: 260.5 on 4 and 4893 DF, p-value: < 2.2e-16
El procedimiento con ACP nos da un R^2 = 17.49%
En resumen Modelo R^2 Modelo_1 : 11 variables = 28.03% Modelo_2 : 4 Componentes = 17.49%