students <- read.csv("C:/Users/LUIS 1/Desktop/MachineLearningR/data/t1/data-conversion.csv")
str(students)
## 'data.frame': 10 obs. of 5 variables:
## $ Age : int 23 13 36 31 58 29 39 50 23 36
## $ State : chr "NJ" "NY" "NJ" "VA" ...
## $ Gender: chr "F" "M" "M" "F" ...
## $ Height: int 61 55 66 64 70 63 67 70 61 66
## $ Income: int 5000 1000 3000 4000 30000 10000 50000 55000 2000 20000
head(students)
## Age State Gender Height Income
## 1 23 NJ F 61 5000
## 2 13 NY M 55 1000
## 3 36 NJ M 66 3000
## 4 31 VA F 64 4000
## 5 58 NY F 70 30000
## 6 29 TX F 63 10000
Se hacen dos arreglos con categorías de valores.
bp <- c(-Inf, 10000, 31000, Inf)
names <- c("Low", "Average", "High")
La función cut()divide el rango del array en intervalos y también codifica los valores según el intervalo en el que caen. El intervalo más a la izquierda corresponde al nivel uno, el siguiente a la izquierda al nivel dos y así sucesivamente.
print(students$Income.cat <- cut(students$Income, breaks = bp, labels = names))
## [1] Low Low Low Low Average Low High High Low
## [10] Average
## Levels: Low Average High
print(students$Income.cat2 <- cut(students$Income, breaks = bp))
## [1] (-Inf,1e+04] (-Inf,1e+04] (-Inf,1e+04] (-Inf,1e+04]
## [5] (1e+04,3.1e+04] (-Inf,1e+04] (3.1e+04, Inf] (3.1e+04, Inf]
## [9] (-Inf,1e+04] (1e+04,3.1e+04]
## Levels: (-Inf,1e+04] (1e+04,3.1e+04] (3.1e+04, Inf]
print(students$Income.cat3 <- cut(students$Income,
breaks = 4,
labels = c("Level 1", "Level 2",
"Level 3", "Level 4")
))
## [1] Level 1 Level 1 Level 1 Level 1 Level 3 Level 1 Level 4 Level 4 Level 1
## [10] Level 2
## Levels: Level 1 Level 2 Level 3 Level 4
La versión 4 de R da problemas con este paquete.
students <- read.csv("C:/Users/LUIS 1/Desktop/MachineLearningR/data/t1/data-conversion.csv")
#install.packages("dummies")
#library(dummies)
#students.dummy <- dummy.data.frame(students, sep = ".")
#names(students.dummy)
#dummy(students$State, sep=".")
#dummy.data.frame(students, names = c("State", "Gender"), sep = ".")