Get the data.
customer_data <- read.table('custdata.tsv', header=T, sep='\t')
summary(customer_data)
## custid sex is.employed income
## Min. : 2068 F:440 Mode :logical Min. : -8700
## 1st Qu.: 345667 M:560 FALSE:73 1st Qu.: 14600
## Median : 693403 TRUE :599 Median : 35000
## Mean : 698500 NA's :328 Mean : 53505
## 3rd Qu.:1044606 3rd Qu.: 67000
## Max. :1414286 Max. :615000
##
## marital.stat health.ins
## Divorced/Separated:155 Mode :logical
## Married :516 FALSE:159
## Never Married :233 TRUE :841
## Widowed : 96 NA's :0
##
##
##
## housing.type recent.move num.vehicles
## Homeowner free and clear :157 Mode :logical Min. :0.000
## Homeowner with mortgage/loan:412 FALSE:820 1st Qu.:1.000
## Occupied with no rent : 11 TRUE :124 Median :2.000
## Rented :364 NA's :56 Mean :1.916
## NA's : 56 3rd Qu.:2.000
## Max. :6.000
## NA's :56
## age state.of.res
## Min. : 0.0 California :100
## 1st Qu.: 38.0 New York : 71
## Median : 50.0 Pennsylvania: 70
## Mean : 51.7 Texas : 56
## 3rd Qu.: 64.0 Michigan : 52
## Max. :146.7 Ohio : 51
## (Other) :600
See, we have some NA’s. Look at is.employed. recent.move, num.vehicles, housing.type, all have 56 NAs. To drop or not to drop…
Much of R analysis functions will drop rows with missing data. If we change it to “missing” we’ll keep them.
customer_data$is.employed.fix <- ifelse(is.na(customer_data$is.employed),
"missing",
ifelse(customer_data$is.employed==T,
"employed",
"not employed"))
summary(as.factor(customer_data$is.employed.fix))
## employed missing not employed
## 599 328 73
meanIncome <- mean(customer_data$Income, na.rm=T)
## Warning in mean.default(customer_data$Income, na.rm = T): argument is not
## numeric or logical: returning NA
Income.fix <- ifelse(is.na(customer_data$Income),
meanIncome,
customer_data$Income)
## Warning in is.na(customer_data$Income): is.na() applied to non-(list or
## vector) of type 'NULL'
summary(Income.fix)
## Mode NA's
## logical 0
Converting continuous variables to discrete
Here replace the income vaiable with a Boolean as to whether income is less than $20,000
customer_data$income.lt.20K <- customer_data$income < 20000
summary(customer_data$income.lt.20K)
## Mode FALSE TRUE NA's
## logical 678 322 0
Converting ages into ranges with cut()
brks <- c(0, 25, 65, Inf)
customer_data$age.range <- cut(customer_data$age, breaks=brks, include.lowes=T)
summary(customer_data$age.range)
## [0,25] (25,65] (65,Inf]
## 56 732 212
Normalization and Rescaling
summary(customer_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 38.0 50.0 51.7 64.0 146.7
mean_age <- mean(customer_data$age)
customer_data$age.normalized <- customer_data$age / mean_age
summary(customer_data$age.normalized)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.7350 0.9671 1.0000 1.2380 2.8370
sd_age <- sd(customer_data$age)
mean_age
## [1] 51.69981
sd_age
## [1] 18.86343