Notice that the is.employed data is missing quite a bit of data.
housing.type, recent.move, and num.vehicles are all also missing data.
customer_data <- read.table('custdata.tsv', header=T, sep='\t')
summary(customer_data)
## custid sex is.employed income
## Min. : 2068 F:440 Mode :logical Min. : -8700
## 1st Qu.: 345667 M:560 FALSE:73 1st Qu.: 14600
## Median : 693403 TRUE :599 Median : 35000
## Mean : 698500 NA's :328 Mean : 53505
## 3rd Qu.:1044606 3rd Qu.: 67000
## Max. :1414286 Max. :615000
##
## marital.stat health.ins
## Divorced/Separated:155 Mode :logical
## Married :516 FALSE:159
## Never Married :233 TRUE :841
## Widowed : 96 NA's :0
##
##
##
## housing.type recent.move num.vehicles
## Homeowner free and clear :157 Mode :logical Min. :0.000
## Homeowner with mortgage/loan:412 FALSE:820 1st Qu.:1.000
## Occupied with no rent : 11 TRUE :124 Median :2.000
## Rented :364 NA's :56 Mean :1.916
## NA's : 56 3rd Qu.:2.000
## Max. :6.000
## NA's :56
## age state.of.res
## Min. : 0.0 California :100
## 1st Qu.: 38.0 New York : 71
## Median : 50.0 Pennsylvania: 70
## Mean : 51.7 Texas : 56
## 3rd Qu.: 64.0 Michigan : 52
## Max. :146.7 Ohio : 51
## (Other) :600
Notice a few things here. For instance negative income? 0 for age? 146.7 for age?
summary(customer_data$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8700 14600 35000 53500 67000 615000
summary(customer_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 38.0 50.0 51.7 64.0 146.7
Historgram!
library(ggplot2)
ggplot(customer_data) +
geom_histogram(aes(x=age),
binwidth=5,
fill="gray")
Density plot!
library(scales)
ggplot(customer_data) +
geom_density(aes(x=income)) +
scale_x_continuous(labels=dollar)
Log-scaled density plot!
ggplot(customer_data) + geom_density(aes(x=income)) +
scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) +
annotation_logticks(sides="bt")
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 79 rows containing non-finite values (stat_density).
Bar Charts!
ggplot(customer_data) +
geom_bar(aes(x=marital.stat), fill="gray")
ggplot(customer_data) + geom_bar(aes(x=state.of.res), fill="gray") +
coord_flip() +
theme(axis.text.y = element_text(size=rel(0.8)))
Producing a bar chart with sorted categories.
state_sums <- table(customer_data$state.of.res)
statef <- as.data.frame(state_sums)
colnames(statef) <- c("state.of.res", "count")
summary(statef)
## state.of.res count
## Alabama : 1 Min. : 1.00
## Alaska : 1 1st Qu.: 5.00
## Arizona : 1 Median : 12.00
## Arkansas : 1 Mean : 20.00
## California: 1 3rd Qu.: 26.25
## Colorado : 1 Max. :100.00
## (Other) :44
ggplot(statef) +
geom_bar(aes(x=state.of.res, y=count),
stat="identity",
fill="gray") +
coord_flip() +
theme(axis.text.y=element_text(size=rel(0.8)))