source("more/cdc.R")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
There are 200000 cases. There are 9 variables. genhlth is a categorical nominal. exerany is a categorical nominal. hlthplan is a categorical nominal. smoke100 is a categorical nominal. height is a numerical continuous. weight is a numerical continuous. wtdesire is a numerical continuous. age is a numerical continuous. gender is a categorical nominal.
height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
table(cdc$exerany)/20000
##
## 0 1
## 0.2543 0.7457
table(cdc$gender)/20000
##
## m f
## 0.47845 0.52155
summary(cdc$gender)
## m f
## 9569 10431
##There are 9569 males in the sample.
table(cdc$genhlth)/20000
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
##23% of the sample reports excelent health
More men than women smoke.
under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.under23<- subset(cdc, age < 23)
under23_and_smoke <- subset(under23, smoke100==1)
Those who reported having excellent health had lower BMIs than those who did not. BMIs increase as the self-reporting of healt gets worse.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$smoke100)
Many people talk about using smoking to reduce weight. However, it appears that smokers have the same weight as non-smokers
ggplot(cdc,aes(x=weight, wtdesire))+geom_point()+scale_y_continuous(limits=c(0,700))+scale_x_continuous(limits=c(0,700))
The majority of desired weight seems to be below the x=y line, meaning that most people want to lose weight.
wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.wdiff<-cdc$wtdesire-cdc$weight
cdc<- cbind(cdc,wdiff)
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative? wdiff is a continuous numeric variable. If it is 0, a person is at the weight they want to be. If it is negative, they want to lose weight and if it is positive, they want to gain weight.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
summary(cdc$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
ggplot(cdc,aes(x=wdiff))+geom_histogram(binwidth = 10)+scale_x_continuous(limits=c(-150,50), breaks=seq(-150,50, by =10))
## Warning: Removed 66 rows containing non-finite values (stat_bin).
Most people want to lose weight. The majority want to lose 10 pounds or less. The median for weight change is -10, the mean is skewed by some outliers which I left off the histogram. The histogram is right skewed, showing that most people do not desire a radical weight change.
cdcm<-subset(cdc,gender=="m")
cdcf<-subset(cdc,gender=="f")
summary(cdcf$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
summary(cdcm$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")
#I'd like to look at them scaled the same way for a better comparison
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-300,500))
ggplot(cdcf,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-300,500))
#without the outliers
ggplot(cdcm,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-40,20))
## Warning: Removed 919 rows containing non-finite values (stat_boxplot).
## Warning: Removed 919 rows containing non-finite values (stat_boxplot).
ggplot(cdcf,aes(y=wdiff,x=1))+geom_boxplot()+stat_boxplot(geom = "errorbar")+scale_y_continuous(limits=c(-40,20))
## Warning: Removed 1284 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1284 rows containing non-finite values (stat_boxplot).
Men have many more outliers that want to gain weight than women. Men tend to want to lose less weight than women.
weight and determine what proportion of the weights are within one standard deviation of the mean.summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
cdcmean<-subset(cdc, weight<190 & weight > 68)
count(cdcmean)/20000
## n
## 1 0.7161
71.6% are within 1 standard deviation of the mean. Given that the data is not normal, but skewed slightly left, it is not surprising that it is greater than the normal 68%