source("more/cdc.R")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

There are 200000 cases. There are 9 variables. genhlth is a categorical nominal. exerany is a categorical nominal. hlthplan is a categorical nominal. smoke100 is a categorical nominal. height is a numerical continuous. weight is a numerical continuous. wtdesire is a numerical continuous. age is a numerical continuous. gender is a categorical nominal.

  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457
table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
summary(cdc$gender)
##     m     f 
##  9569 10431
##There are 9569 males in the sample. 
table(cdc$genhlth)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
##23% of the sample reports excelent health
  1. What does the mosaic plot reveal about smoking habits and gender?

More men than women smoke.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23<- subset(cdc, age < 23)
under23_and_smoke <- subset(under23, smoke100==1)
  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Those who reported having excellent health had lower BMIs than those who did not. BMIs increase as the self-reporting of healt gets worse.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$smoke100)

Many people talk about using smoking to reduce weight. However, it appears that smokers have the same weight as non-smokers

On Your Own

ggplot(cdc,aes(x=weight, wtdesire))+geom_point()+scale_y_continuous(limits=c(0,700))+scale_x_continuous(limits=c(0,700))

The majority of desired weight seems to be below the x=y line, meaning that most people want to lose weight.

wdiff<-cdc$wtdesire-cdc$weight
cdc<- cbind(cdc,wdiff)
summary(cdc$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
ggplot(cdc,aes(x=wdiff))+geom_histogram(binwidth = 10)+scale_x_continuous(limits=c(-150,50), breaks=seq(-150,50, by =10))
## Warning: Removed 66 rows containing non-finite values (stat_bin).

Most people want to lose weight. The majority want to lose 10 pounds or less. The median for weight change is -10, the mean is skewed by some outliers which I left off the histogram. The histogram is right skewed, showing that most people do not desire a radical weight change.

cdcm<-subset(cdc,gender=="m")
cdcf<-subset(cdc,gender=="f")
summary(cdcf$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00
summary(cdcm$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")

ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")

#I'd like to look at them scaled the same way for a better comparison
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-300,500))

ggplot(cdcf,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-300,500))

#without the outliers
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-40,20))
## Warning: Removed 919 rows containing non-finite values (stat_boxplot).

## Warning: Removed 919 rows containing non-finite values (stat_boxplot).

ggplot(cdcf,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-40,20))
## Warning: Removed 1284 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1284 rows containing non-finite values (stat_boxplot).

Men have many more outliers that want to gain weight than women. Men tend to want to lose less weight than women.

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0
cdcmean<-subset(cdc, weight<190 & weight > 68)
count(cdcmean)/20000
##        n
## 1 0.7161

71.6% are within 1 standard deviation of the mean. Given that the data is not normal, but skewed slightly left, it is not surprising that it is greater than the normal 68%