Clase

Aprendizaje Automatico Estudio del aprendizaje a partir de datos (data-driven) para conseguir hacer predicciones a partir de las observaciones

3 algoritmos de clasificación

Clasificación de supervisión

Clasificación Semisupervisada

Clasificación No Supervisada

Tema 1.

Objetos:

A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)

Paso 1. Determinar el número de grupos o clusters

3

Paso 2. Seleccionar aleatoriamente los centroides

C1 = A1 C2 = A4 C3 = A7

Paso 3. Asignar cada objeto al centroide más cercano

d= √(〖(x_2-x_(1))〗2+〖(y_2-y_(1))〗2 )

1.- Iteración

Objeto: distancia (Objeto, Centroide) A1: d(A1,A1) = 0 ya que es centroide d(A1,A4) = 3.61 d(A1,A7) = 8.06 A1 Є Cluster 1

A2: d(A2,A1) = 5 ya que es centroide d(A2,A4) = 4.24 d(A2,A7) = 3.16

A3: A4 A5: A6: A7: A8:

Resumen de la 1.- Iteración

Cluster 1 {A1}, 2{A2,A7}, 3{A3,A4,A5,A6,A8}

Paso 4: Actualizar posición de centroides con la posición de los objetos pertenecientes a dicho grupo o cluster.

C1 = (2,10) C2 = (1.5,3.5) C3 = (6,6)

Paso 5. Repartir paso 3 y 4 hasta que los centroides no se muevan, o se muevan por debajo de una distancia umbraken cada paso

  1. Iteración Resumen de la 2da iteración

Cluster 1 {A1,A8}, 2{A2,A7}, 3{A3,A4,A5,A6} C1 = (2,10) C2 = (1.5,3.5) C3 = (6,6)

  1. Iteración Resumen de la 2da iteración

Cluster 1 {A1,A4,A8}, 2{A2,A7}, 3{A3,A5,A6} C1 = (2,10) C2 = (1.5,3.5) C3 = (6,6)

Teoría

# 1. Crear base de datos

x <- c(2, 2, 8, 5, 7, 6, 1, 4)
y <- c(10, 5, 4, 8, 5, 4, 2, 9)

agrup <- data.frame(x, y)


# 2. Determinar el número de grupos
grupos <- 3

# 3. Realizar la clasificación 
segmentos <- kmeans(agrup, grupos)
segmentos
## K-means clustering with 3 clusters of sizes 2, 1, 5
## 
## Cluster means:
##     x    y
## 1 4.5  8.5
## 2 2.0 10.0
## 3 4.8  4.0
## 
## Clustering vector:
## [1] 2 3 3 1 3 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1]  1.0  0.0 44.8
##  (between_SS / total_SS =  54.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# 4. Revisar la asignación de grupos
asignación <- cbind(agrup, cluster=segmentos$cluster)
asignación
##   x  y cluster
## 1 2 10       2
## 2 2  5       3
## 3 8  4       3
## 4 5  8       1
## 5 7  5       3
## 6 6  4       3
## 7 1  2       3
## 8 4  9       1
# 5. Graficar 
# install.packages("factoextra")
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(segmentos, data=agrup,
             palette=c("red","blue","darkgreen"),
             ellipse.type = "euclid",
             star.plot = T,
             repel = T,
             ggtheme = theme())
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

# 6. Optimizar cantidad de grupos
library(cluster)
library(data.table)
set.seed(123)
optimización <- clusGap(agrup, FUN=kmeans,nstart=1,K.max = 7)
plot(optimización, xlab="Número de clusters K")

# El punto más alto de al gráfica indica la cantidad de grupos óptimo

Actividad

Importando base de datos

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()     masks data.table::between()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ lubridate::hour()    masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ lubridate::mday()    masks data.table::mday()
## ✖ lubridate::minute()  masks data.table::minute()
## ✖ lubridate::month()   masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second()  masks data.table::second()
## ✖ purrr::transpose()   masks data.table::transpose()
## ✖ lubridate::wday()    masks data.table::wday()
## ✖ lubridate::week()    masks data.table::week()
## ✖ lubridate::yday()    masks data.table::yday()
## ✖ lubridate::year()    masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ventas <- read.csv("ventas.csv")
summary(ventas)
##     BillNo            Itemname            Quantity            Date          
##  Length:522064      Length:522064      Min.   :-9600.00   Length:522064     
##  Class :character   Class :character   1st Qu.:    1.00   Class :character  
##  Mode  :character   Mode  :character   Median :    3.00   Mode  :character  
##                                        Mean   :   10.09                     
##                                        3rd Qu.:   10.00                     
##                                        Max.   :80995.00                     
##                                                                             
##      Hour               Price              CustomerID       Country         
##  Length:522064      Min.   :-11062.060   Min.   :12346    Length:522064     
##  Class :character   1st Qu.:     1.250   1st Qu.:13950    Class :character  
##  Mode  :character   Median :     2.080   Median :15265    Mode  :character  
##                     Mean   :     3.827   Mean   :15317                      
##                     3rd Qu.:     4.130   3rd Qu.:16837                      
##                     Max.   : 13541.330   Max.   :18287                      
##                                          NA's   :134041                     
##      Total          
##  Min.   :-11062.06  
##  1st Qu.:     3.75  
##  Median :     9.78  
##  Mean   :    19.69  
##  3rd Qu.:    17.40  
##  Max.   :168469.60  
## 
#count(ventas,BillNo, sort=TRUE)
#count(ventas,Itemname, sort=TRUE)
#count(ventas,Date, sort=TRUE)
#count(ventas,Hour, sort=TRUE)
#count(ventas,Country, sort=TRUE)
# ¿Cuántos NA tengo en la base de datos?

sum(is.na(ventas))
## [1] 134041
# ¿Cuántos NA tengo por variable?

sapply(ventas, function(x) sum(is.na(x)))
##     BillNo   Itemname   Quantity       Date       Hour      Price CustomerID 
##          0          0          0          0          0          0     134041 
##    Country      Total 
##          0          0
ventasact <- ventas %>% select("BillNo","CustomerID","Total") %>%  na.omit() %>%  filter(Total > 0)

ticket <- aggregate(Total ~ CustomerID + BillNo, data = ventasact, FUN = sum)

ticket_promedio <- aggregate(Total ~ CustomerID, data = ticket, FUN = mean)

visitas <- ventasact %>% group_by(CustomerID) %>% summarise(Visitas = n_distinct(BillNo))

objetos <- merge(ticket_promedio, visitas, by="CustomerID")

rownames(objetos) <- objetos$CustomerID


# Los datos fuera de lo normal están fuera de los siguientes límites: 
# Límite inferior = q1 -1.5*IQR
# Límite superior = Q3 + 1.5*IQR
# Q1: Cuartil 1, Q3  

IQR_V <- IQR(objetos$Total)
LI_V <- 1-1.5*IQR_V
LS_V <- 5+1.5*IQR_V
objetos <- objetos[objetos$Visitas <=11,]


# Columna Ticket promedio
colnames(objetos) <- c("Visitas","TicketPromedio")
IQR_TP <- IQR(objetos$Total)
LI_TP <- 178.30-1.5*IQR_TP
LS_TP <- 426.63-1.5*IQR_TP
objetos <- objetos[objetos$TicketPromedio <=791.69, ]
# 2. Determinar el número de grupos
gruposact <- 4

# 3. Realizar la clasificación 
segmentosact <- kmeans(objetos, gruposact)


# 4. Revisar la asignación de grupos
asignaciónact <- cbind(objetos, cluster=segmentosact$cluster)

# 5. Graficar 
# install.packages("factoextra")
library(ggplot2)
library(factoextra)

fviz_cluster(segmentosact, data=objetos,
             palette=c("red","blue","darkgreen","yellow"),
             ellipse.type = "euclid",
             star.plot = T,
             repel = T,
             ggtheme = theme())

# 6. Optimizar cantidad de grupos
library(cluster)
library(data.table)
set.seed(123)
optimización <- clusGap(objetos, FUN=kmeans,nstart=1,K.max = 99)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 188200)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 188200)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 188200)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 188200)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 188200)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
plot(optimización, xlab="Número de clusters K")

# El punto más alto de al gráfica indica la cantidad de grupos óptimo