Practical Data Science (Chapter 3)

Exploring Data

Data Summaries and typical problems

Missing values
invalid values and outliers
data ranges that are too wide or too narrow

Notice that the is.employed data is missing quite a bit of data.

housing.type, recent.move, and num.vehicles are all also missing data.

customer_data <- read.table('custdata.tsv', header=T, sep='\t')
summary(customer_data)

##      custid        sex     is.employed         income      
##  Min.   :   2068   F:440   Mode :logical   Min.   : -8700  
##  1st Qu.: 345667   M:560   FALSE:73        1st Qu.: 14600  
##  Median : 693403           TRUE :599       Median : 35000  
##  Mean   : 698500           NA's :328       Mean   : 53505  
##  3rd Qu.:1044606                           3rd Qu.: 67000  
##  Max.   :1414286                           Max.   :615000  
##                                                            
##              marital.stat health.ins     
##  Divorced/Separated:155   Mode :logical  
##  Married           :516   FALSE:159      
##  Never Married     :233   TRUE :841      
##  Widowed           : 96   NA's :0        
##                                          
##                                          
##                                          
##                        housing.type recent.move      num.vehicles  
##  Homeowner free and clear    :157   Mode :logical   Min.   :0.000  
##  Homeowner with mortgage/loan:412   FALSE:820       1st Qu.:1.000  
##  Occupied with no rent       : 11   TRUE :124       Median :2.000  
##  Rented                      :364   NA's :56        Mean   :1.916  
##  NA's                        : 56                   3rd Qu.:2.000  
##                                                     Max.   :6.000  
##                                                     NA's   :56     
##       age              state.of.res
##  Min.   :  0.0   California  :100  
##  1st Qu.: 38.0   New York    : 71  
##  Median : 50.0   Pennsylvania: 70  
##  Mean   : 51.7   Texas       : 56  
##  3rd Qu.: 64.0   Michigan    : 52  
##  Max.   :146.7   Ohio        : 51  
##                  (Other)     :600

Notice a few things here. For instance negative income? 0 for age? 146.7 for age?

summary(customer_data$income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8700   14600   35000   53500   67000  615000

summary(customer_data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    38.0    50.0    51.7    64.0   146.7

Spotting problems with graphics/visualizations

Historgram!

library(ggplot2)

ggplot(customer_data) +
  geom_histogram(aes(x=age),
                 binwidth=5, 
                 fill="gray")

Density plot!

library(scales)

ggplot(customer_data) + 
  geom_density(aes(x=income)) + 
  scale_x_continuous(labels=dollar)

Log-scaled density plot!

ggplot(customer_data) + geom_density(aes(x=income)) +
   scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) +
   annotation_logticks(sides="bt")

## Warning in scale$trans$trans(x): NaNs produced

## Warning: Removed 79 rows containing non-finite values (stat_density).

Bar Charts!

ggplot(customer_data) +
  geom_bar(aes(x=marital.stat), fill="gray")

ggplot(customer_data) + geom_bar(aes(x=state.of.res), fill="gray") +
  coord_flip() +
  theme(axis.text.y = element_text(size=rel(0.8)))

Producing a bar chart with sorted categories.

state_sums <- table(customer_data$state.of.res)
statef <- as.data.frame(state_sums)
colnames(statef) <- c("state.of.res", "count")
summary(statef)

##      state.of.res     count       
##  Alabama   : 1    Min.   :  1.00  
##  Alaska    : 1    1st Qu.:  5.00  
##  Arizona   : 1    Median : 12.00  
##  Arkansas  : 1    Mean   : 20.00  
##  California: 1    3rd Qu.: 26.25  
##  Colorado  : 1    Max.   :100.00  
##  (Other)   :44

ggplot(statef) +
  geom_bar(aes(x=state.of.res, y=count),
           stat="identity",
           fill="gray") +
  coord_flip() +
  theme(axis.text.y=element_text(size=rel(0.8)))