Exploring Data


Data Summaries and typical problems

  • Missing values
  • invalid values and outliers
  • data ranges that are too wide or too narrow



Notice that the is.employed data is missing quite a bit of data.

housing.type, recent.move, and num.vehicles are all also missing data.

customer_data <- read.table('custdata.tsv', header=T, sep='\t')
summary(customer_data)
##      custid        sex     is.employed         income      
##  Min.   :   2068   F:440   Mode :logical   Min.   : -8700  
##  1st Qu.: 345667   M:560   FALSE:73        1st Qu.: 14600  
##  Median : 693403           TRUE :599       Median : 35000  
##  Mean   : 698500           NA's :328       Mean   : 53505  
##  3rd Qu.:1044606                           3rd Qu.: 67000  
##  Max.   :1414286                           Max.   :615000  
##                                                            
##              marital.stat health.ins     
##  Divorced/Separated:155   Mode :logical  
##  Married           :516   FALSE:159      
##  Never Married     :233   TRUE :841      
##  Widowed           : 96   NA's :0        
##                                          
##                                          
##                                          
##                        housing.type recent.move      num.vehicles  
##  Homeowner free and clear    :157   Mode :logical   Min.   :0.000  
##  Homeowner with mortgage/loan:412   FALSE:820       1st Qu.:1.000  
##  Occupied with no rent       : 11   TRUE :124       Median :2.000  
##  Rented                      :364   NA's :56        Mean   :1.916  
##  NA's                        : 56                   3rd Qu.:2.000  
##                                                     Max.   :6.000  
##                                                     NA's   :56     
##       age              state.of.res
##  Min.   :  0.0   California  :100  
##  1st Qu.: 38.0   New York    : 71  
##  Median : 50.0   Pennsylvania: 70  
##  Mean   : 51.7   Texas       : 56  
##  3rd Qu.: 64.0   Michigan    : 52  
##  Max.   :146.7   Ohio        : 51  
##                  (Other)     :600



Notice a few things here. For instance negative income? 0 for age? 146.7 for age?

summary(customer_data$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8700   14600   35000   53500   67000  615000
summary(customer_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    38.0    50.0    51.7    64.0   146.7

Spotting problems with graphics/visualizations

Historgram!

library(ggplot2)

ggplot(customer_data) +
  geom_histogram(aes(x=age),
                 binwidth=5, 
                 fill="gray")

Density plot!

library(scales)

ggplot(customer_data) + 
  geom_density(aes(x=income)) + 
  scale_x_continuous(labels=dollar)

Log-scaled density plot!

ggplot(customer_data) + geom_density(aes(x=income)) +
   scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) +
   annotation_logticks(sides="bt")
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 79 rows containing non-finite values (stat_density).

Bar Charts!

ggplot(customer_data) +
  geom_bar(aes(x=marital.stat), fill="gray")

ggplot(customer_data) + geom_bar(aes(x=state.of.res), fill="gray") +
  coord_flip() +
  theme(axis.text.y = element_text(size=rel(0.8)))

Producing a bar chart with sorted categories.

state_sums <- table(customer_data$state.of.res)
statef <- as.data.frame(state_sums)
colnames(statef) <- c("state.of.res", "count")
summary(statef)
##      state.of.res     count       
##  Alabama   : 1    Min.   :  1.00  
##  Alaska    : 1    1st Qu.:  5.00  
##  Arizona   : 1    Median : 12.00  
##  Arkansas  : 1    Mean   : 20.00  
##  California: 1    3rd Qu.: 26.25  
##  Colorado  : 1    Max.   :100.00  
##  (Other)   :44
ggplot(statef) +
  geom_bar(aes(x=state.of.res, y=count),
           stat="identity",
           fill="gray") +
  coord_flip() +
  theme(axis.text.y=element_text(size=rel(0.8)))