Ejercicio 01

Importar la base de datos

library(readxl)
data=read_excel(file.choose())
head(data)
## # A tibble: 6 × 12
##   `Acidez fija` `Acidez variable` `Acido cítrico` `azúcar residual` Cloruros
##           <dbl>             <dbl>           <dbl>             <dbl>    <dbl>
## 1           7                0.27            0.36              20.7    0.045
## 2           6.3              0.3             0.34               1.6    0.049
## 3           8.1              0.28            0.4                6.9    0.05 
## 4           7.2              0.23            0.32               8.5    0.058
## 5           7.2              0.23            0.32               8.5    0.058
## 6           8.1              0.28            0.4                6.9    0.05 
## # ℹ 7 more variables: `Dioxido de azufre libre` <dbl>,
## #   `Dioxido de azufre total` <dbl>, Densidad <dbl>, pH <dbl>, Sulfatos <dbl>,
## #   `alcohol (%)` <dbl>, calidad <dbl>

Componentes principales

Eliminamos la variable “Calidad” para trabar un regresión posterior.

Indentificamos autovalores > 1

pca = prcomp(data[-12] , scale=TRUE)  #Scale = estandarización
pca
## Standard deviations (1, .., p=11):
##  [1] 1.7950638 1.2550856 1.1052924 1.0092187 0.9865772 0.9688867 0.8524072
##  [8] 0.7741825 0.6435399 0.5380401 0.1436979
## 
## Rotation (n x k) = (11 x 11):
##                                  PC1          PC2        PC3         PC4
## Acidez fija              0.157218451 -0.587558208  0.1213683  0.01858383
## Acidez variable          0.005089494  0.051728054 -0.5909715  0.27411517
## Acido cítrico            0.144049843 -0.345294562  0.5043969  0.14851432
## azúcar residual          0.427408368  0.008749392 -0.2143199 -0.27376531
## Cloruros                 0.212011065 -0.008800308 -0.1023674  0.71071228
## Dioxido de azufre libre  0.300334387  0.290355136  0.2794101 -0.30558549
## Dioxido de azufre total  0.406652203  0.244032391  0.1243753 -0.06045562
## Densidad                 0.511523597  0.006296796 -0.1292029 -0.02206110
## pH                      -0.128831885  0.581344397  0.1266715  0.09775335
## Sulfatos                 0.043379327  0.222695370  0.4332440  0.44205953
## alcohol (%)             -0.437237835 -0.035568666  0.1059032 -0.14107870
##                                 PC5        PC6         PC7         PC8
## Acidez fija              0.25104839 -0.1035307  0.19784897 -0.58835527
## Acidez variable          0.64261658  0.1223385 -0.26935495 -0.02837266
## Acido cítrico            0.05390510  0.1320967 -0.70548123  0.15228698
## azúcar residual          0.01139144 -0.2894469 -0.21275955  0.38818585
## Cloruros                -0.32862831  0.3958211  0.07948377  0.10015094
## Dioxido de azufre libre  0.17691226  0.4944936  0.16677879  0.08179901
## Dioxido de azufre total  0.29300991  0.2763199  0.06772962 -0.24731437
## Densidad                -0.08458824 -0.3276509 -0.11038544 -0.06902650
## pH                      -0.11982553 -0.1933412 -0.42731310 -0.53388135
## Sulfatos                 0.40058526 -0.4810392  0.30856238  0.27039413
## alcohol (%)              0.33741948  0.1392842 -0.12892247  0.19585929
##                                 PC9        PC10         PC11
## Acidez fija              0.33052283  0.13170530 -0.171290475
## Acidez variable         -0.14590968  0.22372176 -0.017056659
## Acido cítrico           -0.20201133  0.03735137 -0.009721121
## azúcar residual          0.40896853 -0.09446795 -0.490225929
## Cloruros                 0.39353539 -0.05337405 -0.025399445
## Dioxido de azufre libre  0.14407339  0.56745057  0.030908440
## Dioxido de azufre total -0.15454024 -0.70912033 -0.035560953
## Densidad                 0.08788799  0.06837403  0.759779368
## pH                       0.26129767  0.11073292 -0.141197197
## Sulfatos                -0.01169757  0.05770758 -0.041832843
## alcohol (%)              0.62109070 -0.27260860  0.357961330

De la salida anterior , apriori se identifican 4 componentes que superan la unidad.

Componente 1: 1.7950638 Componente 2: 1.2550856 Componente 3: 1.1052924 Componente 4: 1.0092187

IDentificar el % de retención acumulada

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7951 1.2551 1.1053 1.00922 0.98658 0.96889 0.85241
## Proportion of Variance 0.2929 0.1432 0.1111 0.09259 0.08848 0.08534 0.06605
## Cumulative Proportion  0.2929 0.4361 0.5472 0.63979 0.72827 0.81361 0.87967
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.77418 0.64354 0.53804 0.14370
## Proportion of Variance 0.05449 0.03765 0.02632 0.00188
## Cumulative Proportion  0.93416 0.97181 0.99812 1.00000

considerando la regla del autovalor >1 complementando a la proporición acumulada, tenemos 4 componentes retiendo el 63.98% de información.

Grafico de sedimentación

library(factoextra)
## Cargando paquete requerido: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_screeplot(pca,addlabels=TRUE )

library(factoextra)
fviz_screeplot(pca,choice =c("eigenvalue") ,addlabels=TRUE )

La decisión es quedarnos con 4 componentes.

head(pca$x[,1:4] ) # filtrar los 4 componente que retienen el 64% de la información
##             PC1        PC2          PC3        PC4
## [1,]  3.6765681 -0.5451776 -0.930326898 -1.1352584
## [2,] -0.6445220  0.4307260 -0.356294322  0.9990146
## [3,]  0.1552747 -1.1896785 -0.017529733  0.2701919
## [4,]  1.4552255  0.0996700 -0.001956186 -0.4229793
## [5,]  1.4552255  0.0996700 -0.001956186 -0.4229793
## [6,]  0.1552747 -1.1896785 -0.017529733  0.2701919
            # "1:4 " hace referencia que filtra esas 4 columnas.
reduccion=pca$x[,1:4]

Actividad Final

Comparar el R^2 de los datos originales vs los 4 componentes principales.

modelo_1=lm(calidad ~ . ,data=data)  #Procedimiento regular
summary(modelo_1)
## 
## Call:
## lm(formula = calidad ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.502e+02  1.880e+01   7.987 1.71e-15 ***
## `Acidez fija`              6.552e-02  2.087e-02   3.139  0.00171 ** 
## `Acidez variable`         -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## `Acido cítrico`            2.209e-02  9.577e-02   0.231  0.81759    
## `azúcar residual`          8.148e-02  7.527e-03  10.825  < 2e-16 ***
## Cloruros                  -2.473e-01  5.465e-01  -0.452  0.65097    
## `Dioxido de azufre libre`  3.733e-03  8.441e-04   4.422 9.99e-06 ***
## `Dioxido de azufre total` -2.857e-04  3.781e-04  -0.756  0.44979    
## Densidad                  -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## pH                         6.863e-01  1.054e-01   6.513 8.10e-11 ***
## Sulfatos                   6.315e-01  1.004e-01   6.291 3.44e-10 ***
## `alcohol (%)`              1.935e-01  2.422e-02   7.988 1.70e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

El procedimiento clásico nos da un R^2 = 28.03%

##armando la base
data_reducida=cbind(reduccion,data$calidad)
data_reducida=as.data.frame(data_reducida)
modelo_2=lm(V5 ~. , data=data_reducida)
summary(modelo_2)
## 
## Call:
## lm(formula = V5 ~ ., data = data_reducida)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9397 -0.5414 -0.0277  0.5148  3.3002 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.877909   0.011495 511.356  < 2e-16 ***
## PC1         -0.146506   0.006404 -22.877  < 2e-16 ***
## PC2          0.041009   0.009159   4.477 7.73e-06 ***
## PC3          0.174888   0.010401  16.815  < 2e-16 ***
## PC4         -0.167387   0.011391 -14.695  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8045 on 4893 degrees of freedom
## Multiple R-squared:  0.1756, Adjusted R-squared:  0.1749 
## F-statistic: 260.5 on 4 and 4893 DF,  p-value: < 2.2e-16

El procedimiento con ACP nos da un R^2 = 17.49%

En resumen Modelo R^2 Modelo_1 : 11 variables = 28.03% Modelo_2 : 4 Componentes = 17.49%