Analisis Exploratorio de Chocolate Bar Ratings

En la siguiente seccióon analizaremos cada atributo de la base de datos Chocolate Bar Ratings disponible en: https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings

library(DataExplorer)

Renombramos las columnas

colnames(data)<-c("Company",
                   "Origin",
                   "ValueREF",
                   "Review.Date",
                   "Cacao.Percent",
                   "Company.Location",
                   "Rating",
                   "Bean.Type",
                   "Bean.Origin")
datos<-data.frame(data)
#Convertir dato caracter a numerico
datos$Cacao.Percent<-as.numeric(gsub(pattern="%", replacement = "", x=datos$Cacao.Percent))
#Resumen de los datos
summary(datos)

##    Company             Origin             ValueREF     Review.Date  
##  Length:1795        Length:1795        Min.   :   5   Min.   :2006  
##  Class :character   Class :character   1st Qu.: 576   1st Qu.:2010  
##  Mode  :character   Mode  :character   Median :1069   Median :2013  
##                                        Mean   :1036   Mean   :2012  
##                                        3rd Qu.:1502   3rd Qu.:2015  
##                                        Max.   :1952   Max.   :2017  
##  Cacao.Percent   Company.Location       Rating       Bean.Type        
##  Min.   : 42.0   Length:1795        Min.   :1.000   Length:1795       
##  1st Qu.: 70.0   Class :character   1st Qu.:2.875   Class :character  
##  Median : 70.0   Mode  :character   Median :3.250   Mode  :character  
##  Mean   : 71.7                      Mean   :3.186                     
##  3rd Qu.: 75.0                      3rd Qu.:3.500                     
##  Max.   :100.0                      Max.   :5.000                     
##  Bean.Origin       
##  Length:1795       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Incluimos algunos graficos

##    Company             Origin             ValueREF     Review.Date  
##  Length:1795        Length:1795        Min.   :   5   Min.   :2006  
##  Class :character   Class :character   1st Qu.: 576   1st Qu.:2010  
##  Mode  :character   Mode  :character   Median :1069   Median :2013  
##                                        Mean   :1036   Mean   :2012  
##                                        3rd Qu.:1502   3rd Qu.:2015  
##                                        Max.   :1952   Max.   :2017  
##  Cacao.Percent   Company.Location       Rating       Bean.Type        
##  Min.   : 42.0   Length:1795        Min.   :1.000   Length:1795       
##  1st Qu.: 70.0   Class :character   1st Qu.:2.875   Class :character  
##  Median : 70.0   Mode  :character   Median :3.250   Mode  :character  
##  Mean   : 71.7                      Mean   :3.186                     
##  3rd Qu.: 75.0                      3rd Qu.:3.500                     
##  Max.   :100.0                      Max.   :5.000                     
##  Bean.Origin       
##  Length:1795       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

El gráfico indica que existen variables de tipo cuantitativas y cualitativas.

## [1] 1795    9

## [1] 9

## [1] 9

El análisis de la variable company, indica que es tipo de datos carácter, por tanto, el grafico asociado es Pie, se visualiza una diversidad de empresas. Asimismo, la variable Origin, es de tipo carácter también le corresponde el tipo de grafico Pie, se aprecia que uno de los mayores porcentajes del origen radica en Madagascar. El análisis de la variable ValueREF, indica que es tipo de dato numérico, por tanto, el grafico asociado es Histograma, se visualiza la mayor cantidad de datos presenta un valor de referencia de 1500. Asimismo, la variable Review.Date, es de tipo numérico también le corresponde el tipo de grafico histograma, se aprecia que con mayores datos que fueron revisados están entre 2014-2015, en tanto entre año 2007-2008 se realizaron el menor número de revisiones. El análisis de la variable Cacao.Percent, indica que es tipo de datos numérico, por tanto, el grafico asociado es Histograma, se visualiza que la mayor cantidad de datos presenta porcentajes entre 65% y 70%, y menor cantidad de datos con % menor a 50%. Se indica también que la variable Cocoa.Percent tenia asignado tipo de dato carácter, el cual se tuvo que convertir a tipo de dato numérico. Asimismo, la variable Company.Location, es de tipo carácter le corresponde el tipo de grafico Pie, se aprecia que la mayoría de empresas son de USA, luego un % son de Francia, otro % de Canadá, U.K e Italia. El análisis de la variable Rating, indica que es tipo de datos numérico, por tanto, el grafico asociado es Histograma, se visualiza que la mayor cantidad de datos presenta valoraciones entre 3 y 3.5 punto y menor cantidad de datos con valoraciones entre 1 a 1.5 puntos Asimismo, la variable Bean.Type, es de tipo carácter le corresponde el tipo de grafico Pie, se aprecia que la mayoría de variedad de utilizada, aun no tiene una tipificación definida, no obstante, las variedades más usadas son Trinitario y criollo seguido otras variedades en poca proporción. El análisis de la variable Bean.Origin, es de tipo carácter le corresponde el tipo de grafico Pie, se aprecia que el origen de la mayor variedad de cacao se obtiene de Venezuela, Ecuador, Perú, Madagascar y Republica Dominicana.

Filtrar algunos datos

library(gapminder)
library(dslabs)

## 
## Attaching package: 'dslabs'

## The following object is masked from 'package:gapminder':
## 
##     gapminder

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

names(datos)

## [1] "Company"          "Origin"           "ValueREF"         "Review.Date"     
## [5] "Cacao.Percent"    "Company.Location" "Rating"           "Bean.Type"       
## [9] "Bean.Origin"

# Chocolate según Rating superior a 4
datos %>%
    filter(datos$Rating >4)

##   Company        Origin ValueREF Review.Date Cacao.Percent Company.Location
## 1  Amedei         Chuao      111        2007            70            Italy
## 2  Amedei Toscano Black       40        2006            70            Italy
##   Rating  Bean.Type Bean.Origin
## 1      5 Trinitario   Venezuela
## 2      5      Blend

# Chocolate según Cacao.porcentaje > 85 y Rating mayor a la media
datos %>%
  filter(Cacao.Percent >85 & Rating > mean(Rating))

##         Company             Origin ValueREF Review.Date Cacao.Percent
## 1           Adi         Vanua Levu      705        2011            88
## 2         Bisou             Belize     1486        2015            86
## 3       C-Amaro            Ecuador     1185        2013           100
## 4           Kto             Belize     1426        2014            90
## 5          Ocho        PNG, Voodoo     1411        2014            88
## 6          Soma           Crazy 88     1077        2013            88
## 7  Szanto Tibor            Baracoa     1704        2015            88
## 8          Taza          Alto Beni      785        2011            87
## 9          TCHO      Peru- Ecuador      915        2012            99
## 10       Videri Dominican Republic     1211        2014            90
## 11       Videri               Dark     1117        2013            90
## 12       Zotter      El Ceibo Coop      879        2012            90
##    Company.Location Rating  Bean.Type                  Bean.Origin
## 1              Fiji   3.50 Trinitario                         Fiji
## 2            U.S.A.   3.25 Trinitario                       Belize
## 3             Italy   3.50                                 Ecuador
## 4            U.S.A.   3.25 Trinitario                       Belize
## 5       New Zealand   3.25                        Papua New Guinea
## 6            Canada   4.00            Guat., D.R., Peru, Mad., PNG
## 7           Hungary   3.25 Trinitario                         Cuba
## 8            U.S.A.   3.25                                 Bolivia
## 9            U.S.A.   3.25                           Peru, Ecuador
## 10           U.S.A.   3.25                      Dominican Republic
## 11           U.S.A.   3.75                  Central and S. America
## 12          Austria   3.25                                 Bolivia

#Graficar porcentaje de cacao según rating, y asignar el color segun porcentaje de cacao utilizado

cacao_vs_rating <-datos %>% # Se extrae solo los valores del % de cacao y la valoracion
  select (Cacao.Percent, Rating)

#Grafico Cacao.Percent vs Rating
par(mfrow = c(1, 1))
ggplot(cacao_vs_rating,
       aes(Cacao.Percent,
           Rating,
           color=Cacao.Percent))+
 geom_point(size=3, alpha = 0.8)

Se interpreta que el promedio de Rating que obtiene el Chocolate contiene Cacao entre 60% y 80%.

Analisis de Datos Faltantes

apply(X = is.na(datos), MARGIN = 2, FUN = mean)

##          Company           Origin         ValueREF      Review.Date 
##                0                0                0                0 
##    Cacao.Percent Company.Location           Rating        Bean.Type 
##                0                0                0                0 
##      Bean.Origin 
##                0

apply(X = is.na(cacao_vs_rating), MARGIN = 2, FUN = mean)

## Cacao.Percent        Rating 
##             0             0

#Usando plot missiing para ver si existe algun dato faltante

plot_missing(datos)

#A realizar la evaluacion de la existencia de datos faltantes se evidencia que no hay.
plot_histogram(datos)

plot_density(datos)

Usando Plot_missing, se evalua la existencia de datos faltantes, Observamos que no haya ningún valor perdido en este conjunto de datos.

Verificar Datos Outliers

par(mfrow = c(1,2))

boxplot(datos$ValueREF, col = "orange", horizontal = TRUE, main="Diagrama Cajas ValueREF")
boxplot(datos$Cacao.Percent, col = "orange", horizontal = TRUE, main="Diagrama de Cajas Porcentaje Cacao")

boxplot(datos$Review.Date, col = "orange", horizontal = TRUE, main="Diagrama de Cajas Review.Date")
boxplot(datos$Rating, col = "orange", horizontal = TRUE, main="Diagrama Cajas Rating")

min(datos$Cacao.Percent)

## [1] 42

mean(datos$Cacao.Percent)

## [1] 71.69833

median(datos$Cacao.Percent)

## [1] 70

max(datos$Cacao.Percent)

## [1] 100

min(datos$Rating)

## [1] 1

mean(datos$Rating)

## [1] 3.185933

median(datos$Rating)

## [1] 3.25

max(datos$Rating)

## [1] 5

No se visualiza valores atípicos en la variable ValueREF Asimismo, se visualiza que variable Cacao.Percent presenta posibles datos atípicos, los cuales al analizar se encuentra que son valores que oscilan cerca del mínimo (42%) y cerca del máximo (100%) siendo el promedio 71.69%, cuya mediana es 70%

No se visualiza valores atípicos en la variable Review.Date Asimismo, se visualiza que variable Rating presenta posibles datos atípicos, los cuales al analizar se encuentra que son valores que oscilan cerca del mínimo (1) y cerca del máximo (5) siendo el promedio 3.18, cuya mediana es 3.25.

par(mfrow = c(1,1))
## valores de los outliers

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

head(cacao_vs_rating)

##   Cacao.Percent Rating
## 1            63   3.75
## 2            70   2.75
## 3            70   3.00
## 4            70   3.50
## 5            70   3.50
## 6            70   2.75

attach(cacao_vs_rating)

a <- which(Cacao.Percent %in% boxplot.stats(Cacao.Percent)$out)  ## donde estan??
b <- which(Cacao.Percent %in% boxplot.stats(Rating)$out)  ## donde estan??

## uniendo outliers por interseccion de sus variables individuales

inter<- intersect(a,b)
inter

## integer(0)

plot(cacao_vs_rating)
points(cacao_vs_rating[inter, ],
       col = "red",
       cex = 1.5,
       pch = "+")

## uniendo outliers por grupo de sus variables individuales

uni <- union(a,b)
uni

##   [1]   26   28   30   45   46   47   48   49   50   51   53   54   66   67   90
##  [16]   91   93  126  166  174  176  183  185  195  246  250  261  262  267  279
##  [31]  325  326  331  343  344  345  379  393  397  417  430  433  438  445  449
##  [46]  450  460  468  474  505  507  519  527  549  554  558  569  612  615  616
##  [61]  618  632  633  635  636  644  645  654  664  668  676  707  708  728  743
##  [76]  756  759  760  770  771  780  782  785  801  815  816  829  832  835  836
##  [91]  837  867  886  887  893  894  895  897  902  907  932  933  938  939  942
## [106]  976  985  986  987  990 1014 1015 1032 1038 1050 1065 1115 1116 1121 1154
## [121] 1161 1169 1205 1209 1220 1221 1237 1238 1246 1249 1252 1271 1286 1292 1297
## [136] 1320 1347 1350 1351 1353 1360 1374 1404 1412 1416 1424 1437 1444 1462 1467
## [151] 1468 1499 1508 1529 1541 1575 1583 1590 1593 1596 1615 1616 1620 1623 1624
## [166] 1625 1658 1659 1676 1687 1695 1696 1714 1717 1721 1722 1737 1741 1742 1767
## [181] 1770 1771 1784 1787 1790 1794

plot(cacao_vs_rating)
points(cacao_vs_rating[uni, ],
       col = "blue",
       cex = 1.5,
       pch = "*")

Verificar outliers Multivaridados

library(DMwR2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

datosmulti <- datos %>%
  select(ValueREF,
         Review.Date,
         Cacao.Percent,
         Rating)

puntajes<-lofactor(datosmulti, k=5) # Puntajes anomalos outlier.scores

par(mfrow = c(1,1))

outliers_multivariado<-order(puntajes, decreasing= TRUE)[1:6]
outliers_multivariado

## [1]  185  839  586  250 1624  126

datosmulti[outliers_multivariado,]

##      ValueREF Review.Date Cacao.Percent Rating
## 185       636        2011            88   3.00
## 839       623        2011            80   3.00
## 586       623        2011            72   3.75
## 250       341        2009           100   1.50
## 1624      915        2012            99   3.25
## 126       486        2010           100   1.75

n<-nrow(datosmulti)
etiquetas<-1:n
etiquetas[-outliers_multivariado]<-"."

# grafica de los principales componentes, indica outlier
biplot(prcomp(datosmulti), cex=1,xlabs=etiquetas)

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# Ajustando la forma de la grafica para el pair
forma <- rep(".", n)
forma[outliers_multivariado]<- "+"

# Ajustando el color de la grafica para el pair
colorv<-rep("black", n)
colorv[outliers_multivariado] <- "blue"
pairs(datosmulti,
      col = colorv,
      pch = forma) # Grafico para analizar las correlaciones entre variables

Analisis de Cluster

datos1 <- datos

colnames(datos1)

## [1] "Company"          "Origin"           "ValueREF"         "Review.Date"     
## [5] "Cacao.Percent"    "Company.Location" "Rating"           "Bean.Type"       
## [9] "Bean.Origin"

# la funcion kmeans(base de datos numerica ,  el numero de cluster)

(kmeans.result <- kmeans(datosmulti,5))

## K-means clustering with 5 clusters of sizes 381, 356, 340, 354, 364
## 
## Cluster means:
##    ValueREF Review.Date Cacao.Percent   Rating
## 1 1417.0709    2014.451      71.92257 3.223097
## 2 1774.6629    2015.750      71.71629 3.245084
## 3  660.6676    2010.765      71.33088 3.180882
## 4  232.4237    2007.692      71.39124 3.085452
## 5 1046.3187    2012.714      72.08791 3.191621
## 
## Clustering vector:
##    [1] 2 2 2 2 2 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 3 3 1 1 3 3 3 3 4 4 4 3 3 3 5 5
##   [38] 5 5 5 2 2 2 2 2 2 2 2 2 5 5 5 5 5 3 3 3 3 4 4 4 4 4 4 4 3 3 5 5 1 1 1 1 1
##   [75] 1 5 4 4 4 4 4 4 4 4 4 4 4 4 5 3 3 3 3 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
##  [112] 2 1 1 1 2 2 2 2 2 2 5 5 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 2 3 3 4 4 4 1 1 5 5
##  [149] 5 1 1 1 1 5 5 5 5 2 2 2 2 2 2 2 2 2 3 3 3 3 2 2 2 2 2 2 5 5 5 5 3 3 3 3 3
##  [186] 3 3 3 3 3 3 1 1 1 1 5 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 1 5 4 2 2 2 2 5 5 2
##  [223] 1 1 1 5 5 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 2 2 1 1 1 5 5
##  [260] 5 5 1 1 1 1 1 1 5 5 5 5 1 2 5 1 1 1 1 1 2 5 3 3 3 2 3 4 4 4 2 2 1 1 1 1 2
##  [297] 2 2 2 1 1 1 1 2 5 3 3 3 2 3 3 3 3 3 3 4 4 2 2 2 5 4 4 4 4 4 4 1 1 1 5 3 3
##  [334] 3 2 2 2 2 2 4 4 4 4 4 4 1 1 1 1 1 3 3 2 2 2 2 2 2 1 1 1 1 1 1 5 5 5 5 5 5
##  [371] 5 1 1 1 1 4 1 2 3 4 1 3 1 1 1 2 2 1 1 1 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
##  [408] 5 5 5 5 5 1 3 3 3 4 4 4 4 4 4 1 1 4 4 4 4 4 5 2 3 4 3 3 4 4 4 1 2 3 4 4 4
##  [445] 4 2 2 5 3 3 3 5 3 3 3 3 3 3 4 4 3 3 4 4 4 4 1 1 5 5 5 5 5 2 2 2 3 2 2 2 2
##  [482] 2 2 1 5 1 1 1 5 5 5 5 3 3 5 3 3 3 5 5 5 5 5 5 3 4 4 2 2 2 2 1 1 1 1 5 5 5
##  [519] 5 1 1 2 2 4 4 4 4 4 4 4 4 2 4 4 4 2 2 2 1 5 5 5 5 3 3 3 3 3 2 1 1 1 4 4 4
##  [556] 2 2 5 5 5 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 1 1 5 3 3 3 3 3 3 3
##  [593] 3 3 1 3 2 2 2 2 2 1 1 1 2 2 5 2 3 4 2 4 4 4 4 4 4 5 5 1 1 2 3 5 5 5 5 4 4
##  [630] 4 4 3 3 3 3 3 1 1 1 3 2 2 2 3 4 4 4 4 4 4 1 1 1 1 1 2 2 2 2 2 1 1 5 5 1 1
##  [667] 1 1 1 2 2 5 1 5 5 5 5 3 3 1 1 1 1 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 4 4
##  [704] 4 4 2 2 2 1 5 5 5 5 5 5 3 3 3 2 2 1 1 5 5 5 3 3 1 2 2 2 2 2 2 2 2 2 2 3 4
##  [741] 2 3 4 4 4 4 4 4 4 2 5 5 5 5 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5
##  [778] 5 5 5 4 1 1 3 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 4 4 4
##  [815] 4 4 2 1 1 1 1 4 4 4 4 1 2 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 4 4 4 4 1 1 1
##  [852] 1 1 1 5 5 5 3 3 3 3 3 3 3 3 3 2 3 5 1 5 5 1 1 1 1 1 1 4 1 1 1 1 5 5 5 5 3
##  [889] 4 4 4 4 4 4 4 4 4 4 4 5 4 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 5 5 5 3 3 3 3 4
##  [926] 3 2 2 2 2 5 5 5 5 3 4 4 4 4 4 3 3 1 1 1 1 1 2 2 2 1 1 5 4 2 3 3 3 2 2 2 1
##  [963] 1 1 1 1 1 5 2 5 5 5 5 5 5 4 5 1 1 1 5 5 5 1 3 3 3 3 3 3 4 4 4 4 2 5 5 5 5
## [1000] 5 5 5 3 3 3 4 4 4 4 3 4 4 5 3 4 2 1 5 5 3 1 1 1 5 5 5 5 5 2 2 1 1 5 5 5 2
## [1037] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 5 3 3 3 3 5 5 3 3 3 3 5 5 2 2 2 1 1 1 5 5
## [1074] 5 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 1 1 1 1 2 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1
## [1111] 2 5 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 5 5 5 5 5 5 3 3 3 2 2 2 5 5 5 3 2 2 2 3
## [1148] 3 3 2 1 2 2 2 1 1 2 2 2 2 5 5 5 5 5 1 1 1 5 1 1 1 1 3 4 4 4 4 4 1 1 1 1 5
## [1185] 5 1 1 1 3 3 5 5 5 1 1 1 1 1 2 3 2 2 1 1 2 2 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3
## [1222] 3 2 1 1 5 5 5 5 5 1 1 3 4 4 4 4 4 5 5 2 1 1 1 1 3 5 3 4 4 4 4 4 2 2 1 1 1
## [1259] 1 1 1 1 1 1 2 1 2 1 1 1 5 5 3 4 4 4 4 5 1 1 1 2 3 3 3 3 4 4 4 4 4 4 4 4 4
## [1296] 2 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3 3 3 1 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1333] 4 4 4 4 4 4 4 2 1 1 2 5 5 5 5 5 5 5 5 5 5 5 2 2 2 2 5 3 3 3 1 1 1 2 5 3 4
## [1370] 4 4 4 2 1 1 5 5 5 5 5 3 2 2 2 5 5 5 2 1 1 5 5 5 5 5 5 3 3 4 4 4 4 4 3 5 5
## [1407] 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 3 5 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4
## [1444] 4 2 2 1 1 1 1 1 1 3 3 2 3 1 1 5 3 3 3 3 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 2
## [1481] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 1 2 2 2 2 1 5 5 5 5 5 2 2 2 2 2 2 2 1 1 5
## [1518] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 3 3 5 5 5 5 5 5 3 3 3 3 3 3 3 4 4 4 4 4 4
## [1555] 1 1 2 2 2 2 2 2 3 3 3 2 2 2 2 2 2 1 5 5 5 2 2 2 2 2 5 3 1 1 5 5 5 5 5 2 2
## [1592] 2 2 2 2 2 2 2 2 2 1 1 5 5 5 5 5 5 3 5 5 5 5 1 1 1 2 2 1 3 3 4 2 5 4 4 4 4
## [1629] 4 4 3 3 3 3 3 3 3 3 3 3 5 5 5 5 1 1 1 1 1 1 1 1 1 2 5 4 4 4 4 4 2 2 1 5 5
## [1666] 3 3 1 3 2 2 1 5 5 3 2 3 1 1 2 2 1 1 5 5 5 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1703] 4 5 5 4 4 4 4 4 4 1 1 5 5 5 5 5 5 5 4 4 4 4 1 1 1 1 1 2 2 2 2 2 3 5 5 5 5
## [1740] 2 5 5 3 3 3 3 3 2 2 2 1 5 5 3 3 5 3 3 5 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3
## [1777] 3 3 5 3 3 5 5 5 5 5 5 5 5 5 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 4025518 3797782 4406056 5795094 4349589
##  (between_SS / total_SS =  95.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

table(datosmulti$Cacao.Percent, kmeans.result$cluster)

##       
##          1   2   3   4   5
##   42     0   0   0   0   1
##   46     0   0   1   0   0
##   50     0   0   1   0   0
##   53     0   0   0   1   0
##   55     1   0   6   6   3
##   56     0   0   1   0   1
##   57     1   0   0   0   0
##   58     0   0   2   3   3
##   60     7   9   6  13   8
##   60.5   0   0   0   1   0
##   61     0   2   1   4   1
##   62     3   2   3   4   2
##   63     4   2   2   3   1
##   64     3   1   9  19   2
##   65    10  12  19  19  18
##   66     3   2   5   7   6
##   67     6   5   5   6   5
##   68     9   8  10   9  11
##   69     2   0   2   1   5
##   70   163 180  97  94 138
##   71     0   3  14  11   3
##   72    45  19  55  32  38
##   72.5   4   0   0   0   0
##   73    12  13   5   3   7
##   73.5   1   0   1   0   0
##   74    11   9  11   4  15
##   75    37  37  38  64  46
##   76     6   7   6   2   2
##   77     9  10   4   8   2
##   78     3   6   4   1   3
##   79     1   0   0   0   0
##   80    24  12   9  13  14
##   81     1   0   2   1   1
##   82     4   2   4   2   5
##   83     1   2   1   0   0
##   84     1   0   1   1   1
##   85     4   9   9   9   5
##   86     1   0   0   0   0
##   87     0   0   1   0   0
##   88     1   1   2   2   2
##   89     0   0   0   0   2
##   90     1   2   0   1   4
##   91     0   0   0   3   0
##   99     0   0   0   1   1
##   100    2   1   3   6   8

table(datosmulti$Rating, kmeans.result$cluster)

##       
##          1   2   3   4   5
##   1      0   0   0   4   0
##   1.5    0   0   4   5   1
##   1.75   0   0   1   1   1
##   2      1   1   5  22   3
##   2.25   5   2   3   0   4
##   2.5   26  18  21  29  33
##   2.75  50  59  53  41  56
##   3     63  54  65 107  52
##   3.25  78  68  57  21  79
##   3.5  100  90  69  61  72
##   3.75  49  50  44  28  39
##   4      9  14  18  33  24
##   5      0   0   0   2   0

library(fpc)
(pamk.result <- pamk(datosmulti))

## $pamobject
## Medoids:
##        ID ValueREF Review.Date Cacao.Percent Rating
## [1,] 1712     1470        2015            70   3.25
## [2,]  313      537        2010            71   3.25
## Clustering vector:
##    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1
##   [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
##   [75] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [112] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2
##  [149] 2 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
##  [186] 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 1
##  [223] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
##  [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1
##  [297] 1 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2
##  [334] 2 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [371] 1 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [408] 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2
##  [445] 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 1 1
##  [482] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
##  [519] 2 1 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 2 2
##  [556] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
##  [593] 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 2 2
##  [630] 2 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
##  [667] 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [704] 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2
##  [741] 1 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
##  [778] 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
##  [815] 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1
##  [852] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2
##  [889] 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
##  [926] 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 1 1 1 1
##  [963] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [1000] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1
## [1037] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2
## [1074] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1
## [1111] 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 1 1 2 1 1 1 2
## [1148] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
## [1185] 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [1222] 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1
## [1259] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1296] 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1333] 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2
## [1370] 2 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## [1407] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1444] 2 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1481] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1518] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1555] 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 1 1 1 1
## [1592] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 2 2
## [1629] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 2 2
## [1666] 2 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1703] 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
## [1740] 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [1777] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
##    build     swap 
## 308.7768 241.7852 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $nc
## [1] 2
## 
## $crit
##  [1] 0.0000000 0.6225388 0.5832990 0.5690972 0.5618299 0.5501934 0.5487890
##  [8] 0.5417501 0.5408763 0.5371833

# numero de clusters
pamk.result$nc

## [1] 2

df <- scale(datosmulti) # Scaling the data

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)

fviz_nbclust(df, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)

Se interpreta que según el grafico se debe considerar 4 clústeres, no obstante, el proceso estadístico indica que se debe considerar 2 Clústeres

Arboles de Clasificación

library(ggplot2)
library(lattice)
library(caret)
library(rpart)
library(rpart.plot)
library(dplyr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ data.table::first()   masks dplyr::first()
## ✖ lubridate::hour()     masks data.table::hour()
## ✖ lubridate::isoweek()  masks data.table::isoweek()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ data.table::last()    masks dplyr::last()
## ✖ purrr::lift()         masks caret::lift()
## ✖ lubridate::mday()     masks data.table::mday()
## ✖ lubridate::minute()   masks data.table::minute()
## ✖ lubridate::month()    masks data.table::month()
## ✖ lubridate::quarter()  masks data.table::quarter()
## ✖ lubridate::second()   masks data.table::second()
## ✖ Hmisc::src()          masks dplyr::src()
## ✖ Hmisc::summarize()    masks dplyr::summarize()
## ✖ purrr::transpose()    masks data.table::transpose()
## ✖ lubridate::wday()     masks data.table::wday()
## ✖ lubridate::week()     masks data.table::week()
## ✖ lubridate::yday()     masks data.table::yday()
## ✖ lubridate::year()     masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)

oj<-datos
str(oj)

## 'data.frame':    1795 obs. of  9 variables:
##  $ Company         : chr  "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
##  $ Origin          : chr  "Agua Grande" "Kpime" "Atsane" "Akata" ...
##  $ ValueREF        : int  1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
##  $ Review.Date     : int  2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
##  $ Cacao.Percent   : num  63 70 70 70 70 70 70 70 70 70 ...
##  $ Company.Location: chr  "France" "France" "France" "France" ...
##  $ Rating          : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
##  $ Bean.Type       : chr  " " " " " " " " ...
##  $ Bean.Origin     : chr  "Sao Tome" "Togo" "Togo" "Togo" ...

skim_to_wide(oj)

## Warning: 'skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")

Data summary
Name	Piped data
Number of rows	1795
Number of columns	9
_______________________
Column type frequency:
character	5
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Company	1	2	39	0	416
Origin	1	3	45	0	1039
Company.Location	1	4	17	0	60
Bean.Type	1	0	24	1	42
Bean.Origin	1	0	29	1	101

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ValueREF	1	1035.90	552.89	5	576.00	1069.00	1502.0	1952	▆▆▇▇▇
Review.Date	1	2012.33	2.93	2006	2010.00	2013.00	2015.0	2017	▃▃▆▆▇
Cacao.Percent	1	71.70	6.32	42	70.00	70.00	75.0	100	▁▁▇▁▁
Rating	1	3.19	0.48	1	2.88	3.25	3.5	5	▁▂▇▆▁

skim(oj)

Data summary
Name	oj
Number of rows	1795
Number of columns	9
_______________________
Column type frequency:
character	5
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Company	1	2	39	0	416
Origin	1	3	45	0	1039
Company.Location	1	4	17	0	60
Bean.Type	1	0	24	1	42
Bean.Origin	1	0	29	1	101

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ValueREF	1	1035.90	552.89	5	576.00	1069.00	1502.0	1952	▆▆▇▇▇
Review.Date	1	2012.33	2.93	2006	2010.00	2013.00	2015.0	2017	▃▃▆▆▇
Cacao.Percent	1	71.70	6.32	42	70.00	70.00	75.0	100	▁▁▇▁▁
Rating	1	3.19	0.48	1	2.88	3.25	3.5	5	▁▂▇▆▁

library(summarytools)

## 
## Attaching package: 'summarytools'
## 
## The following object is masked from 'package:tibble':
## 
##     view
## 
## The following objects are masked from 'package:Hmisc':
## 
##     label, label<-

dfSummary(oj)

## Data Frame Summary  
## oj  
## Dimensions: 1795 x 9  
## Duplicates: 0  
## 
## ------------------------------------------------------------------------------------------------------------------------
## No   Variable           Stats / Values                  Freqs (% of Valid)    Graph                 Valid      Missing  
## ---- ------------------ ------------------------------- --------------------- --------------------- ---------- ---------
## 1    Company            1. Soma                           47 ( 2.6%)                                1795       0        
##      [character]        2. Bonnat                         27 ( 1.5%)                                (100.0%)   (0.0%)   
##                         3. Fresco                         26 ( 1.4%)                                                    
##                         4. Pralus                         25 ( 1.4%)                                                    
##                         5. A. Morin                       23 ( 1.3%)                                                    
##                         6. Arete                          22 ( 1.2%)                                                    
##                         7. Domori                         22 ( 1.2%)                                                    
##                         8. Guittard                       22 ( 1.2%)                                                    
##                         9. Valrhona                       21 ( 1.2%)                                                    
##                         10. Hotel Chocolat (Coppeneur     19 ( 1.1%)                                                    
##                         [ 406 others ]                  1541 (85.8%)          IIIIIIIIIIIIIIIII                         
## 
## 2    Origin             1. Madagascar                     57 ( 3.2%)                                1795       0        
##      [character]        2. Peru                           45 ( 2.5%)                                (100.0%)   (0.0%)   
##                         3. Ecuador                        42 ( 2.3%)                                                    
##                         4. Dominican Republic             37 ( 2.1%)                                                    
##                         5. Venezuela                      21 ( 1.2%)                                                    
##                         6. Chuao                          19 ( 1.1%)                                                    
##                         7. Sambirano                      19 ( 1.1%)                                                    
##                         8. Ocumare                        17 ( 0.9%)                                                    
##                         9. Ghana                          15 ( 0.8%)                                                    
##                         10. Papua New Guinea              15 ( 0.8%)                                                    
##                         [ 1029 others ]                 1508 (84.0%)          IIIIIIIIIIIIIIII                          
## 
## 3    ValueREF           Mean (sd) : 1035.9 (552.9)      440 distinct values             . . : . .   1795       0        
##      [integer]          min < med < max:                                      . . . : : : : : : :   (100.0%)   (0.0%)   
##                         5 < 1069 < 1952                                       : : : : : : : : : :                       
##                         IQR (CV) : 926 (0.5)                                  : : : : : : : : : :                       
##                                                                               : : : : : : : : : :                       
## 
## 4    Review.Date        Mean (sd) : 2012.3 (2.9)        12 distinct values                  . : .   1795       0        
##      [integer]          min < med < max:                                                .   : : :   (100.0%)   (0.0%)   
##                         2006 < 2013 < 2017                                    .       : : : : : :                       
##                         IQR (CV) : 5 (0)                                      : . : : : : : : : :                       
##                                                                               : : : : : : : : : :                       
## 
## 5    Cacao.Percent      Mean (sd) : 71.7 (6.3)          45 distinct values            :             1795       0        
##      [numeric]          min < med < max:                                              : .           (100.0%)   (0.0%)   
##                         42 < 70 < 100                                                 : :                               
##                         IQR (CV) : 5 (0.1)                                            : :                               
##                                                                                     : : : : .                           
## 
## 6    Company.Location   1. U.S.A.                       764 (42.6%)           IIIIIIII              1795       0        
##      [character]        2. France                       156 ( 8.7%)           I                     (100.0%)   (0.0%)   
##                         3. Canada                       125 ( 7.0%)           I                                         
##                         4. U.K.                          96 ( 5.3%)           I                                         
##                         5. Italy                         63 ( 3.5%)                                                     
##                         6. Ecuador                       54 ( 3.0%)                                                     
##                         7. Australia                     49 ( 2.7%)                                                     
##                         8. Belgium                       40 ( 2.2%)                                                     
##                         9. Switzerland                   38 ( 2.1%)                                                     
##                         10. Germany                      35 ( 1.9%)                                                     
##                         [ 50 others ]                   375 (20.9%)           IIII                                      
## 
## 7    Rating             Mean (sd) : 3.2 (0.5)           13 distinct values          . :             1795       0        
##      [numeric]          min < med < max:                                            : :             (100.0%)   (0.0%)   
##                         1 < 3.2 < 5                                                 : :                                 
##                         IQR (CV) : 0.6 (0.2)                                        : : :                               
##                                                                                 . : : : :                               
## 
## 8    Bean.Type          1.                              887 (49.4%)           IIIIIIIII             1795       0        
##      [character]        2. Trinitario                   419 (23.3%)           IIII                  (100.0%)   (0.0%)   
##                         3. Criollo                      153 ( 8.5%)           I                                         
##                         4. Forastero                     87 ( 4.8%)                                                     
##                         5. Forastero (Nacional)          52 ( 2.9%)                                                     
##                         6. Blend                         41 ( 2.3%)                                                     
##                         7. Criollo, Trinitario           39 ( 2.2%)                                                     
##                         8. Forastero (Arriba)            37 ( 2.1%)                                                     
##                         9. Criollo (Porcelana)           10 ( 0.6%)                                                     
##                         10. Trinitario, Criollo           9 ( 0.5%)                                                     
##                         [ 32 others ]                    61 ( 3.4%)                                                     
## 
## 9    Bean.Origin        1. Venezuela                    214 (11.9%)           II                    1795       0        
##      [character]        2. Ecuador                      193 (10.8%)           II                    (100.0%)   (0.0%)   
##                         3. Peru                         165 ( 9.2%)           I                                         
##                         4. Madagascar                   145 ( 8.1%)           I                                         
##                         5. Dominican Republic           141 ( 7.9%)           I                                         
##                         6.                               73 ( 4.1%)                                                     
##                         7. Nicaragua                     60 ( 3.3%)                                                     
##                         8. Brazil                        58 ( 3.2%)                                                     
##                         9. Bolivia                       57 ( 3.2%)                                                     
##                         10. Belize                       49 ( 2.7%)                                                     
##                         [ 91 others ]                   640 (35.7%)           IIIIIII                                   
## ------------------------------------------------------------------------------------------------------------------------

summary(oj)

##    Company             Origin             ValueREF     Review.Date  
##  Length:1795        Length:1795        Min.   :   5   Min.   :2006  
##  Class :character   Class :character   1st Qu.: 576   1st Qu.:2010  
##  Mode  :character   Mode  :character   Median :1069   Median :2013  
##                                        Mean   :1036   Mean   :2012  
##                                        3rd Qu.:1502   3rd Qu.:2015  
##                                        Max.   :1952   Max.   :2017  
##  Cacao.Percent   Company.Location       Rating       Bean.Type        
##  Min.   : 42.0   Length:1795        Min.   :1.000   Length:1795       
##  1st Qu.: 70.0   Class :character   1st Qu.:2.875   Class :character  
##  Median : 70.0   Mode  :character   Median :3.250   Mode  :character  
##  Mean   : 71.7                      Mean   :3.186                     
##  3rd Qu.: 75.0                      3rd Qu.:3.500                     
##  Max.   :100.0                      Max.   :5.000                     
##  Bean.Origin       
##  Length:1795       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

#set.seed(12345)

partition<-createDataPartition(y=oj$Rating, p=0.8, list = FALSE)

# Total de filas
n<-nrow(oj)

#numero de filas
n_train <-round(0.8*n)

#create un vector de indices aleatorios
train_indices<- sample (1:n, n_train)

#subconjunto del cojunto OJ de los indices
train <-oj[train_indices,]

#data testeo
test <-oj[-train_indices,]

#model chi
oj_model1<-rpart(formula = Rating ~ Cacao.Percent,
                 data = train, method = "class")

oj_model2<-rpart(formula = Rating ~ Cacao.Percent + Review.Date,
                 data = train, method = "class")

oj_model3<-rpart(formula = Rating ~ Cacao.Percent + Review.Date+ValueREF,
                 data = train, method = "class")

rpart.plot(oj_model1, yesno = TRUE)

rpart.plot(oj_model2, yesno = TRUE)
rpart.plot(oj_model3, yesno = TRUE)

printcp(oj_model1)

## 
## Classification tree:
## rpart(formula = Rating ~ Cacao.Percent, data = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] Cacao.Percent
## 
## Root node error: 1125/1436 = 0.78343
## 
## n= 1436 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.013333      0   1.00000 1.0000 0.013875
## 2 0.010000      2   0.97333 1.0098 0.013694

plotcp(oj_model1)

oj_opt <- prune(oj_model1,cp=oj_model1$cptable[which.min(oj_model1$cptable["xerror"]), "CP"])

## PARA REMOVER rm(oj_opt)
rpart.plot(oj_opt, yesno = TRUE)

oj_pred <- predict(oj_model1, test,  type= "class")
oj_pred

##   13   23   31   38   64   68   73   75   76   84   85   86   88   91  101  103 
##  3.5    3  3.5  3.5  3.5  3.5    3  3.5  3.5  3.5  3.5    3  3.5    3  3.5  3.5 
##  105  106  108  111  120  128  135  137  143  146  147  153  155  157  168  170 
##  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 2.75    3  3.5 
##  175  176  179  182  183  189  190  192  198  200  205  208  216  219  223  225 
## 2.75 2.75  3.5  3.5    3  3.5    3  3.5  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5 
##  226  227  228  243  247  249  254  259  273  282  288  304  305  314  329  332 
##  3.5  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 2.75 
##  335  339  354  356  360  361  366  367  381  389  393  396  403  404  405  410 
##  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5 2.75  3.5 
##  411  421  422  425  430  440  449  452  454  455  460  466  476  477  481  484 
##  3.5  3.5  3.5 2.75    3  3.5 2.75  3.5  3.5  3.5    3  3.5  3.5  3.5  3.5  3.5 
##  486  494  499  501  503  504  507  509  511  519  531  538  539  544  549  550 
##  3.5  3.5 2.75  3.5  3.5  3.5    3  3.5  3.5    3  3.5  3.5  3.5 2.75 2.75  3.5 
##  554  555  562  564  565  570  572  579  587  591  592  594  600  604  608  615 
## 2.75  3.5  3.5  3.5 2.75  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5    3 
##  621  624  628  631  633  641  645  647  648  654  657  659  663  664  679  681 
## 2.75  3.5  3.5  3.5    3  3.5    3    3  3.5 2.75 2.75  3.5  3.5    3  3.5  3.5 
##  687  702  709  722  734  755  757  768  776  781  787  795  798  800  815  821 
##  3.5 2.75  3.5  3.5  3.5  3.5  3.5  3.5  3.5 2.75 2.75  3.5 2.75  3.5    3  3.5 
##  824  833  837  844  848  851  867  869  876  881  883  886  887  891  898  899 
##  3.5  3.5 2.75  3.5  3.5  3.5    3  3.5  3.5  3.5  3.5    3    3  3.5  3.5 2.75 
##  902  919  921  925  928  930  940  941  944  945  952  962  969  974  977  983 
##    3  3.5  3.5    3  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 
##  987  998 1020 1029 1038 1039 1047 1052 1056 1059 1061 1064 1065 1070 1075 1087 
##    3  3.5  3.5  3.5    3  3.5  3.5  3.5  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5 
## 1090 1095 1097 1101 1102 1116 1120 1137 1142 1143 1149 1156 1160 1162 1165 1169 
##  3.5  3.5  3.5 2.75  3.5 2.75    3  3.5  3.5  3.5  3.5 2.75 2.75  3.5  3.5 2.75 
## 1172 1183 1187 1191 1196 1206 1208 1219 1227 1237 1240 1244 1245 1255 1259 1262 
##  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5    3  3.5  3.5  3.5  3.5  3.5  3.5 
## 1263 1266 1271 1283 1285 1286 1294 1299 1304 1305 1310 1315 1318 1319 1321 1327 
##  3.5  3.5    3  3.5 2.75 2.75  3.5 2.75  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 
## 1329 1330 1333 1334 1338 1343 1353 1354 1356 1357 1358 1361 1370 1375 1379 1382 
##  3.5  3.5  3.5  3.5  3.5 2.75    3  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 
## 1383 1384 1386 1387 1390 1392 1394 1398 1399 1408 1409 1411 1426 1440 1441 1443 
##  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5    3  3.5 2.75  3.5  3.5 
## 1446 1448 1456 1459 1460 1463 1469 1472 1476 1478 1486 1489 1495 1496 1506 1511 
##  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5  3.5 
## 1524 1528 1533 1537 1541 1546 1554 1557 1563 1564 1572 1573 1576 1579 1582 1584 
##  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5  3.5  3.5  3.5 
## 1588 1590 1592 1608 1613 1620 1625 1633 1634 1637 1638 1643 1651 1654 1657 1662 
##  3.5    3  3.5  3.5  3.5 2.75    3  3.5  3.5  3.5  3.5  3.5 2.75  3.5  3.5  3.5 
## 1665 1669 1671 1682 1695 1696 1703 1710 1712 1713 1733 1735 1737 1742 1745 1750 
## 2.75  3.5  3.5  3.5 2.75 2.75    3  3.5  3.5  3.5  3.5  3.5 2.75 2.75  3.5  3.5 
## 1761 1764 1765 1767 1774 1778 1790 
##  3.5  3.5  3.5    3 2.75  3.5    3 
## Levels: 1 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 5

plot(test$Rating, oj_pred, main="Prediccion Vs Real")

rpart.plot(oj_model1)

plot(oj_model1)

plot(varImp(oj_opt))

Arboles de Regresión

carseats_dat <- datosmulti

skim(carseats_dat)

Data summary
Name	carseats_dat
Number of rows	1795
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ValueREF	1	1035.90	552.89	5	576.00	1069.00	1502.0	1952	▆▆▇▇▇
Review.Date	1	2012.33	2.93	2006	2010.00	2013.00	2015.0	2017	▃▃▆▆▇
Cacao.Percent	1	71.70	6.32	42	70.00	70.00	75.0	100	▁▁▇▁▁
Rating	1	3.19	0.48	1	2.88	3.25	3.5	5	▁▂▇▆▁

dfSummary(carseats_dat)

## Data Frame Summary  
## carseats_dat  
## Dimensions: 1795 x 4  
## Duplicates: 168  
## 
## ------------------------------------------------------------------------------------------------------------------
## No   Variable        Stats / Values               Freqs (% of Valid)    Graph                 Valid      Missing  
## ---- --------------- ---------------------------- --------------------- --------------------- ---------- ---------
## 1    ValueREF        Mean (sd) : 1035.9 (552.9)   440 distinct values             . . : . .   1795       0        
##      [integer]       min < med < max:                                   . . . : : : : : : :   (100.0%)   (0.0%)   
##                      5 < 1069 < 1952                                    : : : : : : : : : :                       
##                      IQR (CV) : 926 (0.5)                               : : : : : : : : : :                       
##                                                                         : : : : : : : : : :                       
## 
## 2    Review.Date     Mean (sd) : 2012.3 (2.9)     12 distinct values                  . : .   1795       0        
##      [integer]       min < med < max:                                             .   : : :   (100.0%)   (0.0%)   
##                      2006 < 2013 < 2017                                 .       : : : : : :                       
##                      IQR (CV) : 5 (0)                                   : . : : : : : : : :                       
##                                                                         : : : : : : : : : :                       
## 
## 3    Cacao.Percent   Mean (sd) : 71.7 (6.3)       45 distinct values            :             1795       0        
##      [numeric]       min < med < max:                                           : .           (100.0%)   (0.0%)   
##                      42 < 70 < 100                                              : :                               
##                      IQR (CV) : 5 (0.1)                                         : :                               
##                                                                               : : : : .                           
## 
## 4    Rating          Mean (sd) : 3.2 (0.5)        13 distinct values          . :             1795       0        
##      [numeric]       min < med < max:                                         : :             (100.0%)   (0.0%)   
##                      1 < 3.2 < 5                                              : :                                 
##                      IQR (CV) : 0.6 (0.2)                                     : : :                               
##                                                                           . : : : :                               
## ------------------------------------------------------------------------------------------------------------------

partition1<-createDataPartition(y=carseats_dat$Rating, p=0.8, list = FALSE)

# Total de filas
n1<-nrow(carseats_dat)

#numero de filas
n_train1 <-round(0.8*n1)

#create un vector de indices aleatorios
train_indices1<- sample (1:n1, n_train1)

#subconjunto del cojunto OJ de los indices
train1 <-carseats_dat[train_indices1,]

#data testeo
test1 <-carseats_dat[-train_indices1,]

nrow(test1)

## [1] 359

#model chi
car_model<-rpart(formula = Rating ~ .,
                 data = train1, method = "anova")

rpart.plot(car_model)

printcp(car_model)

## 
## Regression tree:
## rpart(formula = Rating ~ ., data = train1, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Cacao.Percent
## 
## Root node error: 330.8/1436 = 0.23036
## 
## n= 1436 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.054285      0   1.00000 1.00145 0.046383
## 2 0.024825      1   0.94571 0.95626 0.041817
## 3 0.011226      2   0.92089 0.93417 0.041450
## 4 0.010000      3   0.90966 0.93781 0.041827

plotcp(car_model)

car_model2 <- prune(car_model,cp=0.018346)
rpart.plot(car_model2)

car_predict <- predict(car_model2, test1,  type= "vector")
car_predict

##        6       15       16       31       33       36       44       53 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##       55       56       61       62       65       67       69       71 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 3.228595 
##       72       88      103      112      120      124      131      134 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 
##      138      145      148      150      160      161      172      174 
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 
##      177      178      182      183      188      195      197      203 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 3.228595 
##      206      208      210      211      218      220      221      223 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      240      241      250      251      253      259      260      263 
## 3.228595 3.228595 2.194444 3.228595 3.228595 3.228595 3.228595 3.000000 
##      264      266      267      280      291      294      302      304 
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      322      326      331      339      346      366      369      370 
## 3.228595 3.228595 2.194444 3.228595 3.228595 3.228595 3.228595 3.228595 
##      372      390      394      398      407      409      412      413 
## 3.228595 3.000000 3.000000 3.228595 3.000000 3.000000 3.228595 3.000000 
##      417      421      423      439      441      447      448      449 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 3.000000 
##      459      470      474      478      481      490      495      496 
## 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 
##      506      507      518      520      523      524      526      531 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      538      568      577      581      582      583      584      588 
## 3.228595 3.228595 3.228595 3.000000 3.228595 3.228595 3.000000 3.228595 
##      589      596      604      612      618      627      629      634 
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 
##      645      650      661      662      665      667      672      673 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000 3.228595 
##      678      680      682      687      697      699      704      707 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      708      710      711      727      733      741      742      749 
## 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 
##      756      763      768      777      782      786      788      790 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      793      804      818      820      831      835      836      838 
## 3.228595 3.228595 3.228595 3.228595 3.000000 2.194444 2.194444 3.228595 
##      843      852      855      859      860      861      867      869 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      871      882      885      906      914      927      928      935 
## 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 
##      942      952      963      965      971      974      978      979 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##      981      984      985      991      993      996      997     1008 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1012     1016     1021     1023     1028     1030     1039     1043 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1046     1051     1053     1058     1076     1079     1090     1096 
## 3.228595 3.228595 3.228595 3.000000 3.000000 3.228595 3.228595 3.228595 
##     1100     1102     1105     1111     1116     1119     1125     1131 
## 3.228595 3.228595 3.000000 3.228595 3.000000 3.228595 3.228595 3.228595 
##     1135     1138     1139     1140     1153     1156     1160     1162 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000 3.228595 
##     1164     1167     1169     1174     1176     1186     1198     1202 
## 3.000000 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1206     1209     1212     1213     1214     1223     1232     1233 
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 
##     1237     1244     1247     1262     1263     1264     1265     1268 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 
##     1289     1290     1302     1304     1319     1323     1324     1334 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1335     1340     1345     1352     1353     1354     1358     1360 
## 3.228595 3.000000 3.228595 3.000000 3.228595 3.228595 3.228595 3.000000 
##     1369     1372     1374     1383     1394     1406     1412     1420 
## 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 2.194444 3.228595 
##     1424     1428     1430     1454     1457     1461     1463     1475 
## 3.228595 3.000000 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 
##     1478     1485     1488     1490     1492     1500     1501     1502 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1505     1506     1507     1510     1513     1516     1522     1527 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1529     1534     1537     1547     1549     1558     1559     1561 
## 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 
##     1564     1571     1572     1574     1577     1586     1591     1592 
## 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1597     1598     1607     1610     1612     1620     1623     1624 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 2.194444 
##     1630     1634     1635     1636     1643     1648     1658     1663 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000 
##     1666     1673     1676     1682     1684     1685     1687     1690 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 
##     1699     1705     1712     1719     1720     1723     1726     1728 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000 
##     1735     1737     1738     1740     1741     1742     1746     1759 
## 3.228595 3.000000 3.228595 3.228595 3.000000 2.194444 3.228595 3.228595 
##     1760     1766     1768     1777     1782     1790     1795 
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595

plot(test1$Rating, car_predict, main="Prediccion Vs Real")
abline(0,1)

RMSE(pred=car_predict, test1$Rating)

## [1] 0.4648658

sd(carseats_dat$Rating)

## [1] 0.4780624

Redes Neuronales

library("neuralnet") # librería

## 
## Attaching package: 'neuralnet'

## The following object is masked from 'package:dplyr':
## 
##     compute

entrada_de_entrenamiento <- as.data.frame(datosmulti)
salida_de_entrenamiento <- sqrt(entrada_de_entrenamiento)

# Columna que enlaza los datos en una variable
datos_de_entrenamiento <-cbind(entrada_de_entrenamiento,salida_de_entrenamiento)
colnames(datos_de_entrenamiento) <- c("Entrada","Salida")

# Capacitaremos a la red neuronal
# Internamente tendrá 10 nodos dentro de una ocultas
# El umbral es un valor numérico que especifica el umbral para las derivadas
# parciales de la función de error como detener criterios.

net.sqrt <- neuralnet(Salida~Entrada, # y en funcion de sus variables Xs
                      datos_de_entrenamiento, # la base de datos
                      hidden=c(10,5,2), # configuracion de la red neuronal
                      threshold=0.001) # tolerancia del error

# Trazar la red neuronal
plot(net.sqrt)

# Prueba de la red neuronal en algunos datos de entrenamiento
datos_de_pruebas <- as.data.frame((1:10)^2) # Generar algunos números al cuadrado
net.results <- compute(net.sqrt, datos_de_pruebas) # Ejecutar a través de la red neuronal

# observemos qué propiedades tiene net.sqrt
ls(net.results)

## [1] "net.result" "neurons"

# Observemos los resultados
print(net.results$net.result)

##           [,1]
##  [1,] 2012.325
##  [2,] 2012.325
##  [3,] 2012.325
##  [4,] 2012.325
##  [5,] 2012.325
##  [6,] 2012.325
##  [7,] 2012.325
##  [8,] 2012.325
##  [9,] 2012.325
## [10,] 2012.325

#Le permite mostrar una mejor versión de los resultados
cleanoutput <- cbind(datos_de_pruebas,sqrt(datos_de_pruebas), as.data.frame(net.results$net.result))
colnames(cleanoutput) <- c("Entrada","Resultado_esperado","Salida_RNN")

print(cleanoutput) # reporta el resultado

##    Entrada Resultado_esperado Salida_RNN
## 1        1                  1   2012.325
## 2        4                  2   2012.325
## 3        9                  3   2012.325
## 4       16                  4   2012.325
## 5       25                  5   2012.325
## 6       36                  6   2012.325
## 7       49                  7   2012.325
## 8       64                  8   2012.325
## 9       81                  9   2012.325
## 10     100                 10   2012.325

Conclusión

Se ha observado tras el análisis exploratorio que existe variable cuantitativas y cualitativas Existe 4 variables de tipo numerica cuyo grafico corresponde Histograma.

Existen 5 variables de tipo caracter cuyo gráfico corresponde es Pie.

No se visualiza datos atípicos en las variables ValueREF, Review.Date.

Se visualiza una posible presencia de datos atípicos en las variables Cacao.Percent y Rating, tras el analisis se deduce que son valores tanto minimos y maximos de cada variable.

Observamos que no haya ningún valor perdido en este conjunto de datos.

Se interpreta que inicialmente la grafica indica seleccionar 4 clústeres, no obstante, el proceso estadístico indica que seleccionarse 2 Clústeres.

Analisis de Datos

Jaime Llanos

2023-10-17

Analisis Exploratorio de Chocolate Bar Ratings

Renombramos las columnas

Incluimos algunos graficos

Filtrar algunos datos

Analisis de Datos Faltantes

Verificar Datos Outliers

Verificar outliers Multivaridados

Analisis de Cluster

Arboles de Clasificación

Arboles de Regresión

Redes Neuronales

Conclusión