Funciones principales

dim()

Con la funcion dim() vemos las dimensiones del dataset

library(rattle.data)
dim(wine)
## [1] 178  14

summary()

Con la funciĂ³n summary() vemos el resumen estadĂ­stico del dataset.

summary(wine)
##  Type      Alcohol          Malic            Ash          Alcalinity   
##  1:59   Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  2:71   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  3:48   Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##         Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##         3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##         Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium         Phenols        Flavanoids    Nonflavanoids   
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300  
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700  
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400  
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619  
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375  
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600  
##  Proanthocyanins     Color             Hue            Dilution    
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

str()

Con el comando str(), vemos que tipos de datos componen el dataset, si son nĂºmeros, enteros o factores componen el dataset.

str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

glimpse()

Con la funciĂ³n glimpse() de dataset dplyr podemos ver quĂ© tipos de datos componen el dataset

library(dplyr)
glimpse(wine)
## Observations: 178
## Variables: 14
## $ Type            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Alcohol         <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, 14...
## $ Malic           <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1.6...
## $ Ash             <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2.1...
## $ Alcalinity      <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 14....
## $ Magnesium       <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98, 105,...
## $ Phenols         <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2.8...
## $ Flavanoids      <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2.9...
## $ Nonflavanoids   <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0.2...
## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1.9...
## $ Color           <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5.2...
## $ Hue             <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1.0...
## $ Dilution        <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2.8...
## $ Proline         <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295, 1045...

ggpairs()

La funciĂ³n ggpairs de GGally nos permite ver de un vistazo todas las variables y sus relaciones con las demĂ¡s en el mismo dataset.

library(GGally)
ggpairs(wine, aes(colour = Type, alpha = 0.4))

Estudio de las caracterĂ­sticas (features)

Vemos si hay correlaciones entre las variables

library(dplyr)
wine1<-select(wine, -Type)
library(corrplot)
var.corr<-cor(wine1)
corrplot(var.corr, method="circle", type="upper", order="hclust")

Para mĂ¡s informaciĂ³n consultar: http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram

AnĂ¡lisis de Componentes Principales

En primer lugar calculamos las componentes principales a lo largo de las 2 primeras componentes principales.

library(ade4)
library(factoextra)
res.pca<-dudi.pca(wine1, scannf = FALSE, nf=2)

Visualizar las componentes principales

Para mĂ¡s informaciĂ³n http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/119-pca-in-r-using-ade4-quick-scripts/

fviz_eig(res.pca)

Visualizar las variables

fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

Para realizar un biplot

fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                )