Managing Data


Cleaning Data



Missing Values

Get the data.

customer_data <- read.table('custdata.tsv', header=T, sep='\t')
summary(customer_data)
##      custid        sex     is.employed         income      
##  Min.   :   2068   F:440   Mode :logical   Min.   : -8700  
##  1st Qu.: 345667   M:560   FALSE:73        1st Qu.: 14600  
##  Median : 693403           TRUE :599       Median : 35000  
##  Mean   : 698500           NA's :328       Mean   : 53505  
##  3rd Qu.:1044606                           3rd Qu.: 67000  
##  Max.   :1414286                           Max.   :615000  
##                                                            
##              marital.stat health.ins     
##  Divorced/Separated:155   Mode :logical  
##  Married           :516   FALSE:159      
##  Never Married     :233   TRUE :841      
##  Widowed           : 96   NA's :0        
##                                          
##                                          
##                                          
##                        housing.type recent.move      num.vehicles  
##  Homeowner free and clear    :157   Mode :logical   Min.   :0.000  
##  Homeowner with mortgage/loan:412   FALSE:820       1st Qu.:1.000  
##  Occupied with no rent       : 11   TRUE :124       Median :2.000  
##  Rented                      :364   NA's :56        Mean   :1.916  
##  NA's                        : 56                   3rd Qu.:2.000  
##                                                     Max.   :6.000  
##                                                     NA's   :56     
##       age              state.of.res
##  Min.   :  0.0   California  :100  
##  1st Qu.: 38.0   New York    : 71  
##  Median : 50.0   Pennsylvania: 70  
##  Mean   : 51.7   Texas       : 56  
##  3rd Qu.: 64.0   Michigan    : 52  
##  Max.   :146.7   Ohio        : 51  
##                  (Other)     :600

See, we have some NA’s. Look at is.employed. recent.move, num.vehicles, housing.type, all have 56 NAs. To drop or not to drop…

Much of R analysis functions will drop rows with missing data. If we change it to “missing” we’ll keep them.

customer_data$is.employed.fix <- ifelse(is.na(customer_data$is.employed), 
                                   "missing",
                                   ifelse(customer_data$is.employed==T,
                                          "employed",
                                          "not employed"))

summary(as.factor(customer_data$is.employed.fix))
##     employed      missing not employed 
##          599          328           73
meanIncome <- mean(customer_data$Income, na.rm=T)
## Warning in mean.default(customer_data$Income, na.rm = T): argument is not
## numeric or logical: returning NA
Income.fix <- ifelse(is.na(customer_data$Income),
                     meanIncome,
                     customer_data$Income)
## Warning in is.na(customer_data$Income): is.na() applied to non-(list or
## vector) of type 'NULL'
summary(Income.fix)
##    Mode    NA's 
## logical       0

Data Transformations

Converting continuous variables to discrete

Here replace the income vaiable with a Boolean as to whether income is less than $20,000

customer_data$income.lt.20K <- customer_data$income < 20000
summary(customer_data$income.lt.20K)
##    Mode   FALSE    TRUE    NA's 
## logical     678     322       0

Converting ages into ranges with cut()

brks <- c(0, 25, 65, Inf)
customer_data$age.range <- cut(customer_data$age, breaks=brks, include.lowes=T)
summary(customer_data$age.range)
##   [0,25]  (25,65] (65,Inf] 
##       56      732      212

Normalization and Rescaling

summary(customer_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    38.0    50.0    51.7    64.0   146.7
mean_age <- mean(customer_data$age)
customer_data$age.normalized <- customer_data$age / mean_age

summary(customer_data$age.normalized)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.7350  0.9671  1.0000  1.2380  2.8370
sd_age <- sd(customer_data$age)
mean_age
## [1] 51.69981
sd_age
## [1] 18.86343