En la siguiente seccióon analizaremos cada atributo de la base de datos Chocolate Bar Ratings disponible en: https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings
library(DataExplorer)
colnames(data)<-c("Company",
"Origin",
"ValueREF",
"Review.Date",
"Cacao.Percent",
"Company.Location",
"Rating",
"Bean.Type",
"Bean.Origin")
datos<-data.frame(data)
#Convertir dato caracter a numerico
datos$Cacao.Percent<-as.numeric(gsub(pattern="%", replacement = "", x=datos$Cacao.Percent))
#Resumen de los datos
summary(datos)
## Company Origin ValueREF Review.Date
## Length:1795 Length:1795 Min. : 5 Min. :2006
## Class :character Class :character 1st Qu.: 576 1st Qu.:2010
## Mode :character Mode :character Median :1069 Median :2013
## Mean :1036 Mean :2012
## 3rd Qu.:1502 3rd Qu.:2015
## Max. :1952 Max. :2017
## Cacao.Percent Company.Location Rating Bean.Type
## Min. : 42.0 Length:1795 Min. :1.000 Length:1795
## 1st Qu.: 70.0 Class :character 1st Qu.:2.875 Class :character
## Median : 70.0 Mode :character Median :3.250 Mode :character
## Mean : 71.7 Mean :3.186
## 3rd Qu.: 75.0 3rd Qu.:3.500
## Max. :100.0 Max. :5.000
## Bean.Origin
## Length:1795
## Class :character
## Mode :character
##
##
##
## Company Origin ValueREF Review.Date
## Length:1795 Length:1795 Min. : 5 Min. :2006
## Class :character Class :character 1st Qu.: 576 1st Qu.:2010
## Mode :character Mode :character Median :1069 Median :2013
## Mean :1036 Mean :2012
## 3rd Qu.:1502 3rd Qu.:2015
## Max. :1952 Max. :2017
## Cacao.Percent Company.Location Rating Bean.Type
## Min. : 42.0 Length:1795 Min. :1.000 Length:1795
## 1st Qu.: 70.0 Class :character 1st Qu.:2.875 Class :character
## Median : 70.0 Mode :character Median :3.250 Mode :character
## Mean : 71.7 Mean :3.186
## 3rd Qu.: 75.0 3rd Qu.:3.500
## Max. :100.0 Max. :5.000
## Bean.Origin
## Length:1795
## Class :character
## Mode :character
##
##
##
El gráfico indica que existen variables de tipo cuantitativas y cualitativas.
## [1] 1795 9
## [1] 9
## [1] 9
El análisis de la variable company, indica que es tipo de datos
carácter, por tanto, el grafico asociado es Pie, se visualiza una
diversidad de empresas. Asimismo, la variable Origin, es de tipo
carácter también le corresponde el tipo de grafico Pie, se aprecia que
uno de los mayores porcentajes del origen radica en Madagascar. El
análisis de la variable ValueREF, indica que es tipo de dato numérico,
por tanto, el grafico asociado es Histograma, se visualiza la mayor
cantidad de datos presenta un valor de referencia de 1500. Asimismo, la
variable Review.Date, es de tipo numérico también le corresponde el tipo
de grafico histograma, se aprecia que con mayores datos que fueron
revisados están entre 2014-2015, en tanto entre año 2007-2008 se
realizaron el menor número de revisiones. El análisis de la variable
Cacao.Percent, indica que es tipo de datos numérico, por tanto, el
grafico asociado es Histograma, se visualiza que la mayor cantidad de
datos presenta porcentajes entre 65% y 70%, y menor cantidad de datos
con % menor a 50%. Se indica también que la variable Cocoa.Percent tenia
asignado tipo de dato carácter, el cual se tuvo que convertir a tipo de
dato numérico. Asimismo, la variable Company.Location, es de tipo
carácter le corresponde el tipo de grafico Pie, se aprecia que la
mayoría de empresas son de USA, luego un % son de Francia, otro % de
Canadá, U.K e Italia. El análisis de la variable Rating, indica que es
tipo de datos numérico, por tanto, el grafico asociado es Histograma, se
visualiza que la mayor cantidad de datos presenta valoraciones entre 3 y
3.5 punto y menor cantidad de datos con valoraciones entre 1 a 1.5
puntos Asimismo, la variable Bean.Type, es de tipo carácter le
corresponde el tipo de grafico Pie, se aprecia que la mayoría de
variedad de utilizada, aun no tiene una tipificación definida, no
obstante, las variedades más usadas son Trinitario y criollo seguido
otras variedades en poca proporción. El análisis de la variable
Bean.Origin, es de tipo carácter le corresponde el tipo de grafico Pie,
se aprecia que el origen de la mayor variedad de cacao se obtiene de
Venezuela, Ecuador, Perú, Madagascar y Republica Dominicana.
library(gapminder)
library(dslabs)
##
## Attaching package: 'dslabs'
## The following object is masked from 'package:gapminder':
##
## gapminder
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
names(datos)
## [1] "Company" "Origin" "ValueREF" "Review.Date"
## [5] "Cacao.Percent" "Company.Location" "Rating" "Bean.Type"
## [9] "Bean.Origin"
# Chocolate según Rating superior a 4
datos %>%
filter(datos$Rating >4)
## Company Origin ValueREF Review.Date Cacao.Percent Company.Location
## 1 Amedei Chuao 111 2007 70 Italy
## 2 Amedei Toscano Black 40 2006 70 Italy
## Rating Bean.Type Bean.Origin
## 1 5 Trinitario Venezuela
## 2 5 Blend
# Chocolate según Cacao.porcentaje > 85 y Rating mayor a la media
datos %>%
filter(Cacao.Percent >85 & Rating > mean(Rating))
## Company Origin ValueREF Review.Date Cacao.Percent
## 1 Adi Vanua Levu 705 2011 88
## 2 Bisou Belize 1486 2015 86
## 3 C-Amaro Ecuador 1185 2013 100
## 4 Kto Belize 1426 2014 90
## 5 Ocho PNG, Voodoo 1411 2014 88
## 6 Soma Crazy 88 1077 2013 88
## 7 Szanto Tibor Baracoa 1704 2015 88
## 8 Taza Alto Beni 785 2011 87
## 9 TCHO Peru- Ecuador 915 2012 99
## 10 Videri Dominican Republic 1211 2014 90
## 11 Videri Dark 1117 2013 90
## 12 Zotter El Ceibo Coop 879 2012 90
## Company.Location Rating Bean.Type Bean.Origin
## 1 Fiji 3.50 Trinitario Fiji
## 2 U.S.A. 3.25 Trinitario Belize
## 3 Italy 3.50 Ecuador
## 4 U.S.A. 3.25 Trinitario Belize
## 5 New Zealand 3.25 Papua New Guinea
## 6 Canada 4.00 Guat., D.R., Peru, Mad., PNG
## 7 Hungary 3.25 Trinitario Cuba
## 8 U.S.A. 3.25 Bolivia
## 9 U.S.A. 3.25 Peru, Ecuador
## 10 U.S.A. 3.25 Dominican Republic
## 11 U.S.A. 3.75 Central and S. America
## 12 Austria 3.25 Bolivia
#Graficar porcentaje de cacao según rating, y asignar el color segun porcentaje de cacao utilizado
cacao_vs_rating <-datos %>% # Se extrae solo los valores del % de cacao y la valoracion
select (Cacao.Percent, Rating)
#Grafico Cacao.Percent vs Rating
par(mfrow = c(1, 1))
ggplot(cacao_vs_rating,
aes(Cacao.Percent,
Rating,
color=Cacao.Percent))+
geom_point(size=3, alpha = 0.8)
Se interpreta que el promedio de Rating que obtiene el Chocolate
contiene Cacao entre 60% y 80%.
apply(X = is.na(datos), MARGIN = 2, FUN = mean)
## Company Origin ValueREF Review.Date
## 0 0 0 0
## Cacao.Percent Company.Location Rating Bean.Type
## 0 0 0 0
## Bean.Origin
## 0
apply(X = is.na(cacao_vs_rating), MARGIN = 2, FUN = mean)
## Cacao.Percent Rating
## 0 0
#Usando plot missiing para ver si existe algun dato faltante
plot_missing(datos)
#A realizar la evaluacion de la existencia de datos faltantes se evidencia que no hay.
plot_histogram(datos)
plot_density(datos)
Usando Plot_missing, se evalua la existencia de datos faltantes,
Observamos que no haya ningún valor perdido en este conjunto de
datos.
par(mfrow = c(1,2))
boxplot(datos$ValueREF, col = "orange", horizontal = TRUE, main="Diagrama Cajas ValueREF")
boxplot(datos$Cacao.Percent, col = "orange", horizontal = TRUE, main="Diagrama de Cajas Porcentaje Cacao")
boxplot(datos$Review.Date, col = "orange", horizontal = TRUE, main="Diagrama de Cajas Review.Date")
boxplot(datos$Rating, col = "orange", horizontal = TRUE, main="Diagrama Cajas Rating")
min(datos$Cacao.Percent)
## [1] 42
mean(datos$Cacao.Percent)
## [1] 71.69833
median(datos$Cacao.Percent)
## [1] 70
max(datos$Cacao.Percent)
## [1] 100
min(datos$Rating)
## [1] 1
mean(datos$Rating)
## [1] 3.185933
median(datos$Rating)
## [1] 3.25
max(datos$Rating)
## [1] 5
No se visualiza valores atípicos en la variable ValueREF Asimismo, se visualiza que variable Cacao.Percent presenta posibles datos atípicos, los cuales al analizar se encuentra que son valores que oscilan cerca del mínimo (42%) y cerca del máximo (100%) siendo el promedio 71.69%, cuya mediana es 70%
No se visualiza valores atípicos en la variable Review.Date Asimismo, se visualiza que variable Rating presenta posibles datos atípicos, los cuales al analizar se encuentra que son valores que oscilan cerca del mínimo (1) y cerca del máximo (5) siendo el promedio 3.18, cuya mediana es 3.25.
par(mfrow = c(1,1))
## valores de los outliers
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
head(cacao_vs_rating)
## Cacao.Percent Rating
## 1 63 3.75
## 2 70 2.75
## 3 70 3.00
## 4 70 3.50
## 5 70 3.50
## 6 70 2.75
attach(cacao_vs_rating)
a <- which(Cacao.Percent %in% boxplot.stats(Cacao.Percent)$out) ## donde estan??
b <- which(Cacao.Percent %in% boxplot.stats(Rating)$out) ## donde estan??
## uniendo outliers por interseccion de sus variables individuales
inter<- intersect(a,b)
inter
## integer(0)
plot(cacao_vs_rating)
points(cacao_vs_rating[inter, ],
col = "red",
cex = 1.5,
pch = "+")
## uniendo outliers por grupo de sus variables individuales
uni <- union(a,b)
uni
## [1] 26 28 30 45 46 47 48 49 50 51 53 54 66 67 90
## [16] 91 93 126 166 174 176 183 185 195 246 250 261 262 267 279
## [31] 325 326 331 343 344 345 379 393 397 417 430 433 438 445 449
## [46] 450 460 468 474 505 507 519 527 549 554 558 569 612 615 616
## [61] 618 632 633 635 636 644 645 654 664 668 676 707 708 728 743
## [76] 756 759 760 770 771 780 782 785 801 815 816 829 832 835 836
## [91] 837 867 886 887 893 894 895 897 902 907 932 933 938 939 942
## [106] 976 985 986 987 990 1014 1015 1032 1038 1050 1065 1115 1116 1121 1154
## [121] 1161 1169 1205 1209 1220 1221 1237 1238 1246 1249 1252 1271 1286 1292 1297
## [136] 1320 1347 1350 1351 1353 1360 1374 1404 1412 1416 1424 1437 1444 1462 1467
## [151] 1468 1499 1508 1529 1541 1575 1583 1590 1593 1596 1615 1616 1620 1623 1624
## [166] 1625 1658 1659 1676 1687 1695 1696 1714 1717 1721 1722 1737 1741 1742 1767
## [181] 1770 1771 1784 1787 1790 1794
plot(cacao_vs_rating)
points(cacao_vs_rating[uni, ],
col = "blue",
cex = 1.5,
pch = "*")
library(DMwR2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
datosmulti <- datos %>%
select(ValueREF,
Review.Date,
Cacao.Percent,
Rating)
puntajes<-lofactor(datosmulti, k=5) # Puntajes anomalos outlier.scores
par(mfrow = c(1,1))
outliers_multivariado<-order(puntajes, decreasing= TRUE)[1:6]
outliers_multivariado
## [1] 185 839 586 250 1624 126
datosmulti[outliers_multivariado,]
## ValueREF Review.Date Cacao.Percent Rating
## 185 636 2011 88 3.00
## 839 623 2011 80 3.00
## 586 623 2011 72 3.75
## 250 341 2009 100 1.50
## 1624 915 2012 99 3.25
## 126 486 2010 100 1.75
n<-nrow(datosmulti)
etiquetas<-1:n
etiquetas[-outliers_multivariado]<-"."
# grafica de los principales componentes, indica outlier
biplot(prcomp(datosmulti), cex=1,xlabs=etiquetas)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# Ajustando la forma de la grafica para el pair
forma <- rep(".", n)
forma[outliers_multivariado]<- "+"
# Ajustando el color de la grafica para el pair
colorv<-rep("black", n)
colorv[outliers_multivariado] <- "blue"
pairs(datosmulti,
col = colorv,
pch = forma) # Grafico para analizar las correlaciones entre variables
datos1 <- datos
colnames(datos1)
## [1] "Company" "Origin" "ValueREF" "Review.Date"
## [5] "Cacao.Percent" "Company.Location" "Rating" "Bean.Type"
## [9] "Bean.Origin"
# la funcion kmeans(base de datos numerica , el numero de cluster)
(kmeans.result <- kmeans(datosmulti,5))
## K-means clustering with 5 clusters of sizes 381, 356, 340, 354, 364
##
## Cluster means:
## ValueREF Review.Date Cacao.Percent Rating
## 1 1417.0709 2014.451 71.92257 3.223097
## 2 1774.6629 2015.750 71.71629 3.245084
## 3 660.6676 2010.765 71.33088 3.180882
## 4 232.4237 2007.692 71.39124 3.085452
## 5 1046.3187 2012.714 72.08791 3.191621
##
## Clustering vector:
## [1] 2 2 2 2 2 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 3 3 1 1 3 3 3 3 4 4 4 3 3 3 5 5
## [38] 5 5 5 2 2 2 2 2 2 2 2 2 5 5 5 5 5 3 3 3 3 4 4 4 4 4 4 4 3 3 5 5 1 1 1 1 1
## [75] 1 5 4 4 4 4 4 4 4 4 4 4 4 4 5 3 3 3 3 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 1 1 1 2 2 2 2 2 2 5 5 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 2 3 3 4 4 4 1 1 5 5
## [149] 5 1 1 1 1 5 5 5 5 2 2 2 2 2 2 2 2 2 3 3 3 3 2 2 2 2 2 2 5 5 5 5 3 3 3 3 3
## [186] 3 3 3 3 3 3 1 1 1 1 5 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 1 5 4 2 2 2 2 5 5 2
## [223] 1 1 1 5 5 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 2 2 1 1 1 5 5
## [260] 5 5 1 1 1 1 1 1 5 5 5 5 1 2 5 1 1 1 1 1 2 5 3 3 3 2 3 4 4 4 2 2 1 1 1 1 2
## [297] 2 2 2 1 1 1 1 2 5 3 3 3 2 3 3 3 3 3 3 4 4 2 2 2 5 4 4 4 4 4 4 1 1 1 5 3 3
## [334] 3 2 2 2 2 2 4 4 4 4 4 4 1 1 1 1 1 3 3 2 2 2 2 2 2 1 1 1 1 1 1 5 5 5 5 5 5
## [371] 5 1 1 1 1 4 1 2 3 4 1 3 1 1 1 2 2 1 1 1 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
## [408] 5 5 5 5 5 1 3 3 3 4 4 4 4 4 4 1 1 4 4 4 4 4 5 2 3 4 3 3 4 4 4 1 2 3 4 4 4
## [445] 4 2 2 5 3 3 3 5 3 3 3 3 3 3 4 4 3 3 4 4 4 4 1 1 5 5 5 5 5 2 2 2 3 2 2 2 2
## [482] 2 2 1 5 1 1 1 5 5 5 5 3 3 5 3 3 3 5 5 5 5 5 5 3 4 4 2 2 2 2 1 1 1 1 5 5 5
## [519] 5 1 1 2 2 4 4 4 4 4 4 4 4 2 4 4 4 2 2 2 1 5 5 5 5 3 3 3 3 3 2 1 1 1 4 4 4
## [556] 2 2 5 5 5 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 1 1 5 3 3 3 3 3 3 3
## [593] 3 3 1 3 2 2 2 2 2 1 1 1 2 2 5 2 3 4 2 4 4 4 4 4 4 5 5 1 1 2 3 5 5 5 5 4 4
## [630] 4 4 3 3 3 3 3 1 1 1 3 2 2 2 3 4 4 4 4 4 4 1 1 1 1 1 2 2 2 2 2 1 1 5 5 1 1
## [667] 1 1 1 2 2 5 1 5 5 5 5 3 3 1 1 1 1 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 4 4
## [704] 4 4 2 2 2 1 5 5 5 5 5 5 3 3 3 2 2 1 1 5 5 5 3 3 1 2 2 2 2 2 2 2 2 2 2 3 4
## [741] 2 3 4 4 4 4 4 4 4 2 5 5 5 5 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5
## [778] 5 5 5 4 1 1 3 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 4 4 4
## [815] 4 4 2 1 1 1 1 4 4 4 4 1 2 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 4 4 4 4 1 1 1
## [852] 1 1 1 5 5 5 3 3 3 3 3 3 3 3 3 2 3 5 1 5 5 1 1 1 1 1 1 4 1 1 1 1 5 5 5 5 3
## [889] 4 4 4 4 4 4 4 4 4 4 4 5 4 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 5 5 5 3 3 3 3 4
## [926] 3 2 2 2 2 5 5 5 5 3 4 4 4 4 4 3 3 1 1 1 1 1 2 2 2 1 1 5 4 2 3 3 3 2 2 2 1
## [963] 1 1 1 1 1 5 2 5 5 5 5 5 5 4 5 1 1 1 5 5 5 1 3 3 3 3 3 3 4 4 4 4 2 5 5 5 5
## [1000] 5 5 5 3 3 3 4 4 4 4 3 4 4 5 3 4 2 1 5 5 3 1 1 1 5 5 5 5 5 2 2 1 1 5 5 5 2
## [1037] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 5 3 3 3 3 5 5 3 3 3 3 5 5 2 2 2 1 1 1 5 5
## [1074] 5 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 1 1 1 1 2 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1
## [1111] 2 5 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 5 5 5 5 5 5 3 3 3 2 2 2 5 5 5 3 2 2 2 3
## [1148] 3 3 2 1 2 2 2 1 1 2 2 2 2 5 5 5 5 5 1 1 1 5 1 1 1 1 3 4 4 4 4 4 1 1 1 1 5
## [1185] 5 1 1 1 3 3 5 5 5 1 1 1 1 1 2 3 2 2 1 1 2 2 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3
## [1222] 3 2 1 1 5 5 5 5 5 1 1 3 4 4 4 4 4 5 5 2 1 1 1 1 3 5 3 4 4 4 4 4 2 2 1 1 1
## [1259] 1 1 1 1 1 1 2 1 2 1 1 1 5 5 3 4 4 4 4 5 1 1 1 2 3 3 3 3 4 4 4 4 4 4 4 4 4
## [1296] 2 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3 3 3 1 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1333] 4 4 4 4 4 4 4 2 1 1 2 5 5 5 5 5 5 5 5 5 5 5 2 2 2 2 5 3 3 3 1 1 1 2 5 3 4
## [1370] 4 4 4 2 1 1 5 5 5 5 5 3 2 2 2 5 5 5 2 1 1 5 5 5 5 5 5 3 3 4 4 4 4 4 3 5 5
## [1407] 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 3 5 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4
## [1444] 4 2 2 1 1 1 1 1 1 3 3 2 3 1 1 5 3 3 3 3 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 2
## [1481] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 1 2 2 2 2 1 5 5 5 5 5 2 2 2 2 2 2 2 1 1 5
## [1518] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 3 3 5 5 5 5 5 5 3 3 3 3 3 3 3 4 4 4 4 4 4
## [1555] 1 1 2 2 2 2 2 2 3 3 3 2 2 2 2 2 2 1 5 5 5 2 2 2 2 2 5 3 1 1 5 5 5 5 5 2 2
## [1592] 2 2 2 2 2 2 2 2 2 1 1 5 5 5 5 5 5 3 5 5 5 5 1 1 1 2 2 1 3 3 4 2 5 4 4 4 4
## [1629] 4 4 3 3 3 3 3 3 3 3 3 3 5 5 5 5 1 1 1 1 1 1 1 1 1 2 5 4 4 4 4 4 2 2 1 5 5
## [1666] 3 3 1 3 2 2 1 5 5 3 2 3 1 1 2 2 1 1 5 5 5 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [1703] 4 5 5 4 4 4 4 4 4 1 1 5 5 5 5 5 5 5 4 4 4 4 1 1 1 1 1 2 2 2 2 2 3 5 5 5 5
## [1740] 2 5 5 3 3 3 3 3 2 2 2 1 5 5 3 3 5 3 3 5 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3
## [1777] 3 3 5 3 3 5 5 5 5 5 5 5 5 5 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 4025518 3797782 4406056 5795094 4349589
## (between_SS / total_SS = 95.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
table(datosmulti$Cacao.Percent, kmeans.result$cluster)
##
## 1 2 3 4 5
## 42 0 0 0 0 1
## 46 0 0 1 0 0
## 50 0 0 1 0 0
## 53 0 0 0 1 0
## 55 1 0 6 6 3
## 56 0 0 1 0 1
## 57 1 0 0 0 0
## 58 0 0 2 3 3
## 60 7 9 6 13 8
## 60.5 0 0 0 1 0
## 61 0 2 1 4 1
## 62 3 2 3 4 2
## 63 4 2 2 3 1
## 64 3 1 9 19 2
## 65 10 12 19 19 18
## 66 3 2 5 7 6
## 67 6 5 5 6 5
## 68 9 8 10 9 11
## 69 2 0 2 1 5
## 70 163 180 97 94 138
## 71 0 3 14 11 3
## 72 45 19 55 32 38
## 72.5 4 0 0 0 0
## 73 12 13 5 3 7
## 73.5 1 0 1 0 0
## 74 11 9 11 4 15
## 75 37 37 38 64 46
## 76 6 7 6 2 2
## 77 9 10 4 8 2
## 78 3 6 4 1 3
## 79 1 0 0 0 0
## 80 24 12 9 13 14
## 81 1 0 2 1 1
## 82 4 2 4 2 5
## 83 1 2 1 0 0
## 84 1 0 1 1 1
## 85 4 9 9 9 5
## 86 1 0 0 0 0
## 87 0 0 1 0 0
## 88 1 1 2 2 2
## 89 0 0 0 0 2
## 90 1 2 0 1 4
## 91 0 0 0 3 0
## 99 0 0 0 1 1
## 100 2 1 3 6 8
table(datosmulti$Rating, kmeans.result$cluster)
##
## 1 2 3 4 5
## 1 0 0 0 4 0
## 1.5 0 0 4 5 1
## 1.75 0 0 1 1 1
## 2 1 1 5 22 3
## 2.25 5 2 3 0 4
## 2.5 26 18 21 29 33
## 2.75 50 59 53 41 56
## 3 63 54 65 107 52
## 3.25 78 68 57 21 79
## 3.5 100 90 69 61 72
## 3.75 49 50 44 28 39
## 4 9 14 18 33 24
## 5 0 0 0 2 0
library(fpc)
(pamk.result <- pamk(datosmulti))
## $pamobject
## Medoids:
## ID ValueREF Review.Date Cacao.Percent Rating
## [1,] 1712 1470 2015 70 3.25
## [2,] 313 537 2010 71 3.25
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
## [75] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2
## [149] 2 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## [186] 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 2 1
## [223] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2
## [334] 2 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2
## [445] 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
## [519] 2 1 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 2 2 2
## [556] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
## [593] 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 1 2 1 1 1 2 2 2
## [630] 2 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
## [667] 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [704] 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2
## [741] 1 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
## [778] 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
## [815] 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1
## [852] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2
## [889] 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
## [926] 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 1 1 1 1
## [963] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [1000] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1
## [1037] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2
## [1074] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1
## [1111] 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 1 2 1 1 2 1 1 1 2
## [1148] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
## [1185] 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [1222] 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1
## [1259] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1296] 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1333] 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2
## [1370] 2 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## [1407] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1444] 2 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1481] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1518] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1555] 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 1 1 1 1
## [1592] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 2 2 2
## [1629] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 2 2
## [1666] 2 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1703] 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
## [1740] 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [1777] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
## build swap
## 308.7768 241.7852
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
##
## $nc
## [1] 2
##
## $crit
## [1] 0.0000000 0.6225388 0.5832990 0.5690972 0.5618299 0.5501934 0.5487890
## [8] 0.5417501 0.5408763 0.5371833
# numero de clusters
pamk.result$nc
## [1] 2
df <- scale(datosmulti) # Scaling the data
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)
Se interpreta que según el grafico se debe considerar 4 clústeres, no
obstante, el proceso estadístico indica que se debe considerar 2
Clústeres
library(ggplot2)
library(lattice)
library(caret)
library(rpart)
library(rpart.plot)
library(dplyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ data.table::first() masks dplyr::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ data.table::last() masks dplyr::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second() masks data.table::second()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
oj<-datos
str(oj)
## 'data.frame': 1795 obs. of 9 variables:
## $ Company : chr "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
## $ Origin : chr "Agua Grande" "Kpime" "Atsane" "Akata" ...
## $ ValueREF : int 1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
## $ Review.Date : int 2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
## $ Cacao.Percent : num 63 70 70 70 70 70 70 70 70 70 ...
## $ Company.Location: chr "France" "France" "France" "France" ...
## $ Rating : num 3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
## $ Bean.Type : chr " " " " " " " " ...
## $ Bean.Origin : chr "Sao Tome" "Togo" "Togo" "Togo" ...
skim_to_wide(oj)
## Warning: 'skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")
Name | Piped data |
Number of rows | 1795 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Company | 0 | 1 | 2 | 39 | 0 | 416 | 0 |
Origin | 0 | 1 | 3 | 45 | 0 | 1039 | 0 |
Company.Location | 0 | 1 | 4 | 17 | 0 | 60 | 0 |
Bean.Type | 0 | 1 | 0 | 24 | 1 | 42 | 0 |
Bean.Origin | 0 | 1 | 0 | 29 | 1 | 101 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ValueREF | 0 | 1 | 1035.90 | 552.89 | 5 | 576.00 | 1069.00 | 1502.0 | 1952 | ▆▆▇▇▇ |
Review.Date | 0 | 1 | 2012.33 | 2.93 | 2006 | 2010.00 | 2013.00 | 2015.0 | 2017 | ▃▃▆▆▇ |
Cacao.Percent | 0 | 1 | 71.70 | 6.32 | 42 | 70.00 | 70.00 | 75.0 | 100 | ▁▁▇▁▁ |
Rating | 0 | 1 | 3.19 | 0.48 | 1 | 2.88 | 3.25 | 3.5 | 5 | ▁▂▇▆▁ |
skim(oj)
Name | oj |
Number of rows | 1795 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Company | 0 | 1 | 2 | 39 | 0 | 416 | 0 |
Origin | 0 | 1 | 3 | 45 | 0 | 1039 | 0 |
Company.Location | 0 | 1 | 4 | 17 | 0 | 60 | 0 |
Bean.Type | 0 | 1 | 0 | 24 | 1 | 42 | 0 |
Bean.Origin | 0 | 1 | 0 | 29 | 1 | 101 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ValueREF | 0 | 1 | 1035.90 | 552.89 | 5 | 576.00 | 1069.00 | 1502.0 | 1952 | ▆▆▇▇▇ |
Review.Date | 0 | 1 | 2012.33 | 2.93 | 2006 | 2010.00 | 2013.00 | 2015.0 | 2017 | ▃▃▆▆▇ |
Cacao.Percent | 0 | 1 | 71.70 | 6.32 | 42 | 70.00 | 70.00 | 75.0 | 100 | ▁▁▇▁▁ |
Rating | 0 | 1 | 3.19 | 0.48 | 1 | 2.88 | 3.25 | 3.5 | 5 | ▁▂▇▆▁ |
library(summarytools)
##
## Attaching package: 'summarytools'
##
## The following object is masked from 'package:tibble':
##
## view
##
## The following objects are masked from 'package:Hmisc':
##
## label, label<-
dfSummary(oj)
## Data Frame Summary
## oj
## Dimensions: 1795 x 9
## Duplicates: 0
##
## ------------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ------------------ ------------------------------- --------------------- --------------------- ---------- ---------
## 1 Company 1. Soma 47 ( 2.6%) 1795 0
## [character] 2. Bonnat 27 ( 1.5%) (100.0%) (0.0%)
## 3. Fresco 26 ( 1.4%)
## 4. Pralus 25 ( 1.4%)
## 5. A. Morin 23 ( 1.3%)
## 6. Arete 22 ( 1.2%)
## 7. Domori 22 ( 1.2%)
## 8. Guittard 22 ( 1.2%)
## 9. Valrhona 21 ( 1.2%)
## 10. Hotel Chocolat (Coppeneur 19 ( 1.1%)
## [ 406 others ] 1541 (85.8%) IIIIIIIIIIIIIIIII
##
## 2 Origin 1. Madagascar 57 ( 3.2%) 1795 0
## [character] 2. Peru 45 ( 2.5%) (100.0%) (0.0%)
## 3. Ecuador 42 ( 2.3%)
## 4. Dominican Republic 37 ( 2.1%)
## 5. Venezuela 21 ( 1.2%)
## 6. Chuao 19 ( 1.1%)
## 7. Sambirano 19 ( 1.1%)
## 8. Ocumare 17 ( 0.9%)
## 9. Ghana 15 ( 0.8%)
## 10. Papua New Guinea 15 ( 0.8%)
## [ 1029 others ] 1508 (84.0%) IIIIIIIIIIIIIIII
##
## 3 ValueREF Mean (sd) : 1035.9 (552.9) 440 distinct values . . : . . 1795 0
## [integer] min < med < max: . . . : : : : : : : (100.0%) (0.0%)
## 5 < 1069 < 1952 : : : : : : : : : :
## IQR (CV) : 926 (0.5) : : : : : : : : : :
## : : : : : : : : : :
##
## 4 Review.Date Mean (sd) : 2012.3 (2.9) 12 distinct values . : . 1795 0
## [integer] min < med < max: . : : : (100.0%) (0.0%)
## 2006 < 2013 < 2017 . : : : : : :
## IQR (CV) : 5 (0) : . : : : : : : : :
## : : : : : : : : : :
##
## 5 Cacao.Percent Mean (sd) : 71.7 (6.3) 45 distinct values : 1795 0
## [numeric] min < med < max: : . (100.0%) (0.0%)
## 42 < 70 < 100 : :
## IQR (CV) : 5 (0.1) : :
## : : : : .
##
## 6 Company.Location 1. U.S.A. 764 (42.6%) IIIIIIII 1795 0
## [character] 2. France 156 ( 8.7%) I (100.0%) (0.0%)
## 3. Canada 125 ( 7.0%) I
## 4. U.K. 96 ( 5.3%) I
## 5. Italy 63 ( 3.5%)
## 6. Ecuador 54 ( 3.0%)
## 7. Australia 49 ( 2.7%)
## 8. Belgium 40 ( 2.2%)
## 9. Switzerland 38 ( 2.1%)
## 10. Germany 35 ( 1.9%)
## [ 50 others ] 375 (20.9%) IIII
##
## 7 Rating Mean (sd) : 3.2 (0.5) 13 distinct values . : 1795 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 1 < 3.2 < 5 : :
## IQR (CV) : 0.6 (0.2) : : :
## . : : : :
##
## 8 Bean.Type 1. 887 (49.4%) IIIIIIIII 1795 0
## [character] 2. Trinitario 419 (23.3%) IIII (100.0%) (0.0%)
## 3. Criollo 153 ( 8.5%) I
## 4. Forastero 87 ( 4.8%)
## 5. Forastero (Nacional) 52 ( 2.9%)
## 6. Blend 41 ( 2.3%)
## 7. Criollo, Trinitario 39 ( 2.2%)
## 8. Forastero (Arriba) 37 ( 2.1%)
## 9. Criollo (Porcelana) 10 ( 0.6%)
## 10. Trinitario, Criollo 9 ( 0.5%)
## [ 32 others ] 61 ( 3.4%)
##
## 9 Bean.Origin 1. Venezuela 214 (11.9%) II 1795 0
## [character] 2. Ecuador 193 (10.8%) II (100.0%) (0.0%)
## 3. Peru 165 ( 9.2%) I
## 4. Madagascar 145 ( 8.1%) I
## 5. Dominican Republic 141 ( 7.9%) I
## 6. 73 ( 4.1%)
## 7. Nicaragua 60 ( 3.3%)
## 8. Brazil 58 ( 3.2%)
## 9. Bolivia 57 ( 3.2%)
## 10. Belize 49 ( 2.7%)
## [ 91 others ] 640 (35.7%) IIIIIII
## ------------------------------------------------------------------------------------------------------------------------
summary(oj)
## Company Origin ValueREF Review.Date
## Length:1795 Length:1795 Min. : 5 Min. :2006
## Class :character Class :character 1st Qu.: 576 1st Qu.:2010
## Mode :character Mode :character Median :1069 Median :2013
## Mean :1036 Mean :2012
## 3rd Qu.:1502 3rd Qu.:2015
## Max. :1952 Max. :2017
## Cacao.Percent Company.Location Rating Bean.Type
## Min. : 42.0 Length:1795 Min. :1.000 Length:1795
## 1st Qu.: 70.0 Class :character 1st Qu.:2.875 Class :character
## Median : 70.0 Mode :character Median :3.250 Mode :character
## Mean : 71.7 Mean :3.186
## 3rd Qu.: 75.0 3rd Qu.:3.500
## Max. :100.0 Max. :5.000
## Bean.Origin
## Length:1795
## Class :character
## Mode :character
##
##
##
#set.seed(12345)
partition<-createDataPartition(y=oj$Rating, p=0.8, list = FALSE)
# Total de filas
n<-nrow(oj)
#numero de filas
n_train <-round(0.8*n)
#create un vector de indices aleatorios
train_indices<- sample (1:n, n_train)
#subconjunto del cojunto OJ de los indices
train <-oj[train_indices,]
#data testeo
test <-oj[-train_indices,]
#model chi
oj_model1<-rpart(formula = Rating ~ Cacao.Percent,
data = train, method = "class")
oj_model2<-rpart(formula = Rating ~ Cacao.Percent + Review.Date,
data = train, method = "class")
oj_model3<-rpart(formula = Rating ~ Cacao.Percent + Review.Date+ValueREF,
data = train, method = "class")
rpart.plot(oj_model1, yesno = TRUE)
rpart.plot(oj_model2, yesno = TRUE)
rpart.plot(oj_model3, yesno = TRUE)
printcp(oj_model1)
##
## Classification tree:
## rpart(formula = Rating ~ Cacao.Percent, data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] Cacao.Percent
##
## Root node error: 1125/1436 = 0.78343
##
## n= 1436
##
## CP nsplit rel error xerror xstd
## 1 0.013333 0 1.00000 1.0000 0.013875
## 2 0.010000 2 0.97333 1.0098 0.013694
plotcp(oj_model1)
oj_opt <- prune(oj_model1,cp=oj_model1$cptable[which.min(oj_model1$cptable["xerror"]), "CP"])
## PARA REMOVER rm(oj_opt)
rpart.plot(oj_opt, yesno = TRUE)
oj_pred <- predict(oj_model1, test, type= "class")
oj_pred
## 13 23 31 38 64 68 73 75 76 84 85 86 88 91 101 103
## 3.5 3 3.5 3.5 3.5 3.5 3 3.5 3.5 3.5 3.5 3 3.5 3 3.5 3.5
## 105 106 108 111 120 128 135 137 143 146 147 153 155 157 168 170
## 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 2.75 3 3.5
## 175 176 179 182 183 189 190 192 198 200 205 208 216 219 223 225
## 2.75 2.75 3.5 3.5 3 3.5 3 3.5 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5
## 226 227 228 243 247 249 254 259 273 282 288 304 305 314 329 332
## 3.5 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 2.75
## 335 339 354 356 360 361 366 367 381 389 393 396 403 404 405 410
## 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5 2.75 3.5
## 411 421 422 425 430 440 449 452 454 455 460 466 476 477 481 484
## 3.5 3.5 3.5 2.75 3 3.5 2.75 3.5 3.5 3.5 3 3.5 3.5 3.5 3.5 3.5
## 486 494 499 501 503 504 507 509 511 519 531 538 539 544 549 550
## 3.5 3.5 2.75 3.5 3.5 3.5 3 3.5 3.5 3 3.5 3.5 3.5 2.75 2.75 3.5
## 554 555 562 564 565 570 572 579 587 591 592 594 600 604 608 615
## 2.75 3.5 3.5 3.5 2.75 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3
## 621 624 628 631 633 641 645 647 648 654 657 659 663 664 679 681
## 2.75 3.5 3.5 3.5 3 3.5 3 3 3.5 2.75 2.75 3.5 3.5 3 3.5 3.5
## 687 702 709 722 734 755 757 768 776 781 787 795 798 800 815 821
## 3.5 2.75 3.5 3.5 3.5 3.5 3.5 3.5 3.5 2.75 2.75 3.5 2.75 3.5 3 3.5
## 824 833 837 844 848 851 867 869 876 881 883 886 887 891 898 899
## 3.5 3.5 2.75 3.5 3.5 3.5 3 3.5 3.5 3.5 3.5 3 3 3.5 3.5 2.75
## 902 919 921 925 928 930 940 941 944 945 952 962 969 974 977 983
## 3 3.5 3.5 3 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5
## 987 998 1020 1029 1038 1039 1047 1052 1056 1059 1061 1064 1065 1070 1075 1087
## 3 3.5 3.5 3.5 3 3.5 3.5 3.5 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5
## 1090 1095 1097 1101 1102 1116 1120 1137 1142 1143 1149 1156 1160 1162 1165 1169
## 3.5 3.5 3.5 2.75 3.5 2.75 3 3.5 3.5 3.5 3.5 2.75 2.75 3.5 3.5 2.75
## 1172 1183 1187 1191 1196 1206 1208 1219 1227 1237 1240 1244 1245 1255 1259 1262
## 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3 3.5 3.5 3.5 3.5 3.5 3.5
## 1263 1266 1271 1283 1285 1286 1294 1299 1304 1305 1310 1315 1318 1319 1321 1327
## 3.5 3.5 3 3.5 2.75 2.75 3.5 2.75 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5
## 1329 1330 1333 1334 1338 1343 1353 1354 1356 1357 1358 1361 1370 1375 1379 1382
## 3.5 3.5 3.5 3.5 3.5 2.75 3 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5
## 1383 1384 1386 1387 1390 1392 1394 1398 1399 1408 1409 1411 1426 1440 1441 1443
## 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3 3.5 2.75 3.5 3.5
## 1446 1448 1456 1459 1460 1463 1469 1472 1476 1478 1486 1489 1495 1496 1506 1511
## 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5
## 1524 1528 1533 1537 1541 1546 1554 1557 1563 1564 1572 1573 1576 1579 1582 1584
## 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5 3.5 3.5 3.5
## 1588 1590 1592 1608 1613 1620 1625 1633 1634 1637 1638 1643 1651 1654 1657 1662
## 3.5 3 3.5 3.5 3.5 2.75 3 3.5 3.5 3.5 3.5 3.5 2.75 3.5 3.5 3.5
## 1665 1669 1671 1682 1695 1696 1703 1710 1712 1713 1733 1735 1737 1742 1745 1750
## 2.75 3.5 3.5 3.5 2.75 2.75 3 3.5 3.5 3.5 3.5 3.5 2.75 2.75 3.5 3.5
## 1761 1764 1765 1767 1774 1778 1790
## 3.5 3.5 3.5 3 2.75 3.5 3
## Levels: 1 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 5
plot(test$Rating, oj_pred, main="Prediccion Vs Real")
rpart.plot(oj_model1)
plot(oj_model1)
plot(varImp(oj_opt))
carseats_dat <- datosmulti
skim(carseats_dat)
Name | carseats_dat |
Number of rows | 1795 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ValueREF | 0 | 1 | 1035.90 | 552.89 | 5 | 576.00 | 1069.00 | 1502.0 | 1952 | ▆▆▇▇▇ |
Review.Date | 0 | 1 | 2012.33 | 2.93 | 2006 | 2010.00 | 2013.00 | 2015.0 | 2017 | ▃▃▆▆▇ |
Cacao.Percent | 0 | 1 | 71.70 | 6.32 | 42 | 70.00 | 70.00 | 75.0 | 100 | ▁▁▇▁▁ |
Rating | 0 | 1 | 3.19 | 0.48 | 1 | 2.88 | 3.25 | 3.5 | 5 | ▁▂▇▆▁ |
dfSummary(carseats_dat)
## Data Frame Summary
## carseats_dat
## Dimensions: 1795 x 4
## Duplicates: 168
##
## ------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- --------------- ---------------------------- --------------------- --------------------- ---------- ---------
## 1 ValueREF Mean (sd) : 1035.9 (552.9) 440 distinct values . . : . . 1795 0
## [integer] min < med < max: . . . : : : : : : : (100.0%) (0.0%)
## 5 < 1069 < 1952 : : : : : : : : : :
## IQR (CV) : 926 (0.5) : : : : : : : : : :
## : : : : : : : : : :
##
## 2 Review.Date Mean (sd) : 2012.3 (2.9) 12 distinct values . : . 1795 0
## [integer] min < med < max: . : : : (100.0%) (0.0%)
## 2006 < 2013 < 2017 . : : : : : :
## IQR (CV) : 5 (0) : . : : : : : : : :
## : : : : : : : : : :
##
## 3 Cacao.Percent Mean (sd) : 71.7 (6.3) 45 distinct values : 1795 0
## [numeric] min < med < max: : . (100.0%) (0.0%)
## 42 < 70 < 100 : :
## IQR (CV) : 5 (0.1) : :
## : : : : .
##
## 4 Rating Mean (sd) : 3.2 (0.5) 13 distinct values . : 1795 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 1 < 3.2 < 5 : :
## IQR (CV) : 0.6 (0.2) : : :
## . : : : :
## ------------------------------------------------------------------------------------------------------------------
partition1<-createDataPartition(y=carseats_dat$Rating, p=0.8, list = FALSE)
# Total de filas
n1<-nrow(carseats_dat)
#numero de filas
n_train1 <-round(0.8*n1)
#create un vector de indices aleatorios
train_indices1<- sample (1:n1, n_train1)
#subconjunto del cojunto OJ de los indices
train1 <-carseats_dat[train_indices1,]
#data testeo
test1 <-carseats_dat[-train_indices1,]
nrow(test1)
## [1] 359
#model chi
car_model<-rpart(formula = Rating ~ .,
data = train1, method = "anova")
rpart.plot(car_model)
printcp(car_model)
##
## Regression tree:
## rpart(formula = Rating ~ ., data = train1, method = "anova")
##
## Variables actually used in tree construction:
## [1] Cacao.Percent
##
## Root node error: 330.8/1436 = 0.23036
##
## n= 1436
##
## CP nsplit rel error xerror xstd
## 1 0.054285 0 1.00000 1.00145 0.046383
## 2 0.024825 1 0.94571 0.95626 0.041817
## 3 0.011226 2 0.92089 0.93417 0.041450
## 4 0.010000 3 0.90966 0.93781 0.041827
plotcp(car_model)
car_model2 <- prune(car_model,cp=0.018346)
rpart.plot(car_model2)
car_predict <- predict(car_model2, test1, type= "vector")
car_predict
## 6 15 16 31 33 36 44 53
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 55 56 61 62 65 67 69 71
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 3.228595
## 72 88 103 112 120 124 131 134
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000
## 138 145 148 150 160 161 172 174
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000
## 177 178 182 183 188 195 197 203
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 3.228595
## 206 208 210 211 218 220 221 223
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 240 241 250 251 253 259 260 263
## 3.228595 3.228595 2.194444 3.228595 3.228595 3.228595 3.228595 3.000000
## 264 266 267 280 291 294 302 304
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 322 326 331 339 346 366 369 370
## 3.228595 3.228595 2.194444 3.228595 3.228595 3.228595 3.228595 3.228595
## 372 390 394 398 407 409 412 413
## 3.228595 3.000000 3.000000 3.228595 3.000000 3.000000 3.228595 3.000000
## 417 421 423 439 441 447 448 449
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 3.000000
## 459 470 474 478 481 490 495 496
## 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595
## 506 507 518 520 523 524 526 531
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 538 568 577 581 582 583 584 588
## 3.228595 3.228595 3.228595 3.000000 3.228595 3.228595 3.000000 3.228595
## 589 596 604 612 618 627 629 634
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595
## 645 650 661 662 665 667 672 673
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000 3.228595
## 678 680 682 687 697 699 704 707
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 708 710 711 727 733 741 742 749
## 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595
## 756 763 768 777 782 786 788 790
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 793 804 818 820 831 835 836 838
## 3.228595 3.228595 3.228595 3.228595 3.000000 2.194444 2.194444 3.228595
## 843 852 855 859 860 861 867 869
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 871 882 885 906 914 927 928 935
## 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595
## 942 952 963 965 971 974 978 979
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 981 984 985 991 993 996 997 1008
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1012 1016 1021 1023 1028 1030 1039 1043
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1046 1051 1053 1058 1076 1079 1090 1096
## 3.228595 3.228595 3.228595 3.000000 3.000000 3.228595 3.228595 3.228595
## 1100 1102 1105 1111 1116 1119 1125 1131
## 3.228595 3.228595 3.000000 3.228595 3.000000 3.228595 3.228595 3.228595
## 1135 1138 1139 1140 1153 1156 1160 1162
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000 3.228595
## 1164 1167 1169 1174 1176 1186 1198 1202
## 3.000000 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595
## 1206 1209 1212 1213 1214 1223 1232 1233
## 3.228595 3.000000 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595
## 1237 1244 1247 1262 1263 1264 1265 1268
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595
## 1289 1290 1302 1304 1319 1323 1324 1334
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1335 1340 1345 1352 1353 1354 1358 1360
## 3.228595 3.000000 3.228595 3.000000 3.228595 3.228595 3.228595 3.000000
## 1369 1372 1374 1383 1394 1406 1412 1420
## 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595 2.194444 3.228595
## 1424 1428 1430 1454 1457 1461 1463 1475
## 3.228595 3.000000 3.228595 3.228595 3.000000 3.228595 3.228595 3.228595
## 1478 1485 1488 1490 1492 1500 1501 1502
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1505 1506 1507 1510 1513 1516 1522 1527
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1529 1534 1537 1547 1549 1558 1559 1561
## 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000
## 1564 1571 1572 1574 1577 1586 1591 1592
## 3.000000 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1597 1598 1607 1610 1612 1620 1623 1624
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.228595 2.194444
## 1630 1634 1635 1636 1643 1648 1658 1663
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000
## 1666 1673 1676 1682 1684 1685 1687 1690
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
## 1699 1705 1712 1719 1720 1723 1726 1728
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.000000 3.000000
## 1735 1737 1738 1740 1741 1742 1746 1759
## 3.228595 3.000000 3.228595 3.228595 3.000000 2.194444 3.228595 3.228595
## 1760 1766 1768 1777 1782 1790 1795
## 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595 3.228595
plot(test1$Rating, car_predict, main="Prediccion Vs Real")
abline(0,1)
RMSE(pred=car_predict, test1$Rating)
## [1] 0.4648658
sd(carseats_dat$Rating)
## [1] 0.4780624
library("neuralnet") # librería
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
entrada_de_entrenamiento <- as.data.frame(datosmulti)
salida_de_entrenamiento <- sqrt(entrada_de_entrenamiento)
# Columna que enlaza los datos en una variable
datos_de_entrenamiento <-cbind(entrada_de_entrenamiento,salida_de_entrenamiento)
colnames(datos_de_entrenamiento) <- c("Entrada","Salida")
# Capacitaremos a la red neuronal
# Internamente tendrá 10 nodos dentro de una ocultas
# El umbral es un valor numérico que especifica el umbral para las derivadas
# parciales de la función de error como detener criterios.
net.sqrt <- neuralnet(Salida~Entrada, # y en funcion de sus variables Xs
datos_de_entrenamiento, # la base de datos
hidden=c(10,5,2), # configuracion de la red neuronal
threshold=0.001) # tolerancia del error
# Trazar la red neuronal
plot(net.sqrt)
# Prueba de la red neuronal en algunos datos de entrenamiento
datos_de_pruebas <- as.data.frame((1:10)^2) # Generar algunos números al cuadrado
net.results <- compute(net.sqrt, datos_de_pruebas) # Ejecutar a través de la red neuronal
# observemos qué propiedades tiene net.sqrt
ls(net.results)
## [1] "net.result" "neurons"
# Observemos los resultados
print(net.results$net.result)
## [,1]
## [1,] 2012.325
## [2,] 2012.325
## [3,] 2012.325
## [4,] 2012.325
## [5,] 2012.325
## [6,] 2012.325
## [7,] 2012.325
## [8,] 2012.325
## [9,] 2012.325
## [10,] 2012.325
#Le permite mostrar una mejor versión de los resultados
cleanoutput <- cbind(datos_de_pruebas,sqrt(datos_de_pruebas), as.data.frame(net.results$net.result))
colnames(cleanoutput) <- c("Entrada","Resultado_esperado","Salida_RNN")
print(cleanoutput) # reporta el resultado
## Entrada Resultado_esperado Salida_RNN
## 1 1 1 2012.325
## 2 4 2 2012.325
## 3 9 3 2012.325
## 4 16 4 2012.325
## 5 25 5 2012.325
## 6 36 6 2012.325
## 7 49 7 2012.325
## 8 64 8 2012.325
## 9 81 9 2012.325
## 10 100 10 2012.325
Se ha observado tras el análisis exploratorio que existe variable cuantitativas y cualitativas Existe 4 variables de tipo numerica cuyo grafico corresponde Histograma.
Existen 5 variables de tipo caracter cuyo gráfico corresponde es Pie.
No se visualiza datos atípicos en las variables ValueREF, Review.Date.
Se visualiza una posible presencia de datos atípicos en las variables Cacao.Percent y Rating, tras el analisis se deduce que son valores tanto minimos y maximos de cada variable.
Observamos que no haya ningún valor perdido en este conjunto de datos.
Se interpreta que inicialmente la grafica indica seleccionar 4 clústeres, no obstante, el proceso estadístico indica que seleccionarse 2 Clústeres.