rm(list=ls())
source("more/cdc.R") cat("There are",nrow(cdc),"cases in the dataset and", ncol(cdc),"columns or variables.")## There are 20000 cases in the dataset and 9 columns or variables.
str(cdc)## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
genhlth and gender are categorical variables, while height, weight, wtdesire are integers/discrete numeric.
exerany, hlthplan and smoke100 are dichotomous categorical variables.
height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?summary(cdc$height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
cat("Interquartile range for height is",IQR(cdc$height),"\n")## Interquartile range for height is 6
summary(cdc$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
cat("Interquartile range for age is",IQR(cdc$age))## Interquartile range for age is 26
Relative frequency distribution for gender:
table(cdc$gender)/20000##
## m f
## 0.47845 0.52155
Relative frequency distribution for exerany:
table(cdc$exerany)/20000##
## 0 1
## 0.2543 0.7457
library(plyr)
count(cdc$gender[cdc$gender=="m"])## x freq
## 1 m 9569
count(cdc$genhlth[cdc$genhlth=="excellent"])## x freq
## 1 excellent 4657
There are 9569 males in the sample; 23.285% of the total sample reports excellent health.
What does the mosaic plot reveal about smoking habits and gender?
According to the plot, more males have smoked at least 100 cigarettes than females in this sample.
Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke<-subset(cdc,age<23 & smoke100==1)
under23_and_smoke<-subset(cdc,age<23 & smoke100==1)Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:
\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$gender, main="BMI vs gender") The boxplot above explores the relationship between gender and BMI; given biological differences, a difference should be detected. The plot shows that females in the sample had lower BMI and wider IQR on average than men.
hist(bmi)hist(bmi, breaks = 50)Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot(cdc$weight~cdc$wtdesire) Weight and desired weight are roughly linearly associated.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
cdc$wdiff<-cdc$wtdesire-cdc$weightWhat type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
“wdiff” is an integer describing the weight that the participant would like to gain or lose; if the value is negative, the participant would like to lose weight; if positive, the participant would like to gain weight.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
summary(cdc$wdiff)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
library(ggplot2)
library(ggthemes)
ggplot(cdc,aes(cdc$wdiff))+geom_histogram()+
theme_bw()+
ggtitle("Difference of desired weight and actual weight",subtitle="20000 responses, collected from CDC")+
xlab("difference")+
ylab("frequency")+
theme(legend.position="none")+
theme(axis.text.y = element_blank())plot(table(cdc$wdiff))IQR(cdc$wdiff)## [1] 21
cat("The most common value for wdiff is",unique(cdc$wdiff)[which.max(tabulate(match(cdc$wdiff,unique(cdc$wdiff))))])## The most common value for wdiff is 0
According to the histogram, most participants would like to remain at their present weight or lose up to around 20-25 lbs.;
IQR confirms this, indicating 50% of the sample wanted to lose no weight or up to 21 lbs., with a mean of -14.59 and a median of -10. Outliers–most likely errors–(500, -300) skew the data. The mode value of wdiff is 0.
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
boxplot(cdc$wdiff~cdc$gender,main="wdiff vs weight")cdc.alt<-subset(cdc,wdiff<100 & wdiff>-100)
boxplot(cdc.alt$wdiff~cdc.alt$gender,main="wdiff vs gender,-100<wdiff<100") In the sample, women wanted to lose weight more often than men. On average, men tended to be happier with their weight than women. This association is more pronounced in a subset of the sample which omits observations with wdiff values exceeding 100 and less than -100, shown in the second plot above.
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
summary(cdc$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
weight.sd<-sd(cdc$weight)
weight.mean<-mean(cdc$weight)
cdc$zscore<-(cdc$weight-weight.mean)/(weight.sd)
hist(cdc$zscore)length(cdc$zscore[cdc$zscore<=1 & cdc$zscore>=-1])/20000## [1] 0.7076
70.76% of the sample indicated a weight that fell within one standard deviation of the mean (169.7).