file.choose()
bd12 <- read.csv("C:\\Users\\sofia\\OneDrive\\Documentos\\Usaarrests.csv")
bd13 <- bd12
rownames(bd13)<- bd13$Lugar
bd14 <- bd13
bd14 <- subset (bd14, select = -c (Lugar))
summary(bd14)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
## cluster
## Min. :1.00
## 1st Qu.:1.25
## Median :2.00
## Mean :2.32
## 3rd Qu.:3.00
## Max. :4.00
boxplot(bd14)
Se determinó que hay datos anormales en Rape (Fuera del límitee superior), pero no se eliminarán al ser muy cercano a los demás datos.
bd15 <- bd14
bd15 <- as.data.frame(scale(bd14))
segmentos <- kmeans(bd15, 4)
segmentos
## K-means clustering with 4 clusters of sizes 5, 16, 21, 8
##
## Cluster means:
## Murder Assault UrbanPop Rape cluster
## 1 -0.5299035 -1.0098676 -1.4742901 -0.8575347 -1.2706509
## 2 -0.4894375 -0.3826001 0.5758298 -0.2616538 -0.3080366
## 3 0.9681443 0.9765436 0.1370530 0.7978278 1.0212880
## 4 -1.2313140 -1.1670594 -0.5899924 -1.0350312 -1.2706509
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 3 3 3 3 3
## Colorado Connecticut Delaware Florida Georgia
## 3 2 2 3 3
## Hawaii Idaho Illinois Indiana Iowa
## 2 4 3 2 4
## Kansas Kentucky Louisiana Maine Maryland
## 2 1 3 4 3
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 3 4 3 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 4 3 4 2
## New Mexico New York North Carolina North Dakota Ohio
## 3 3 3 4 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 3
## South Dakota Tennessee Texas Utah Vermont
## 1 3 3 2 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 1 4 2
##
## Within cluster sum of squares by cluster:
## [1] 3.984580 16.212213 55.042281 3.875042
## (between_SS / total_SS = 67.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
asignacion <- cbind(bd14, cluster = segmentos$cluster)
head(asignacion,10)
## Murder Assault UrbanPop Rape cluster cluster
## Alabama 13.2 236 58 21.2 4 3
## Alaska 10.0 263 48 44.5 3 3
## Arizona 8.1 294 80 31.0 3 3
## Arkansas 8.8 190 50 19.5 4 3
## California 9.0 276 91 40.6 3 3
## Colorado 7.9 204 78 38.7 3 3
## Connecticut 3.3 110 77 11.1 2 2
## Delaware 5.9 238 72 15.8 2 2
## Florida 15.4 335 80 31.9 3 3
## Georgia 17.4 211 60 25.8 4 3
write.csv(asignacion,"arrestos_segmentados.csv")
install.packages(“factoextra”)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(segmentos, data = bd15,
palette=c("red", "blue", "black", "darkgreen"),
ellipse.type = "euclid",
star.plot = T,
repel = T,
ggtheme = theme())
install.packages(“data.table”) install.packages(“cluster”)
library(data.table)
library(cluster)
set.seed(123)
optimizacion <- clusGap(bd15, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
plot(optimizacion, xlab = "Numero de clusters k")
En este analisis de datos realizamos 4 clusters, y encontramos que los estados más cercanos al eje son en los que que hay una mayor cantidad de crimenes, siendo estos principalmente California, Nevada, New York, Arizona y Colorado. En cambio, los que están mas lejos del eje, son los más seguros o los que menos cantidad de crimenes tienen, tales como West virginia, Vermont y North Dakota.