source("more/cdc.R")
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"
  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Number of cases

nrow(cdc)
## [1] 20000

Number of Variables

ncol(cdc)
## [1] 9
Variable Data Type
genhlth categorical ordinal
exerany categorical
hlthplan categorical
smoke100 categorical
height numerical continuous
weight mumerical continuous
wtdesire mumerical discrete. One could argue it is continuous but i dont think any one is going to say i want be 180.25 lbs.
age uumeric discrete or continous if you take month in to account
gender categorical
  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Summary and IQR : Height

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
IQR(cdc$height)
## [1] 6

Summary and IQR : Age

summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
IQR(cdc$age)
## [1] 26

Relative Frequencey Distribution for Gender

table(cdc$gender)/nrow(cdc)
## 
##       m       f 
## 0.47845 0.52155

Relative Frequencey Distribution for exerany

table(cdc$exerany)/nrow(cdc)
## 
##      0      1 
## 0.2543 0.7457

Number of Males

nrow(subset(cdc, gender =="m", select =c('gender')))
## [1] 9569

Or

 summary(cdc$gender)
##     m     f 
##  9569 10431

Good Health

nrow(subset(cdc, genhlth =="good", select =c('genhlth')))/(nrow(cdc))
## [1] 0.28375

Or

table(cdc$genhlth)/nrow(cdc)
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
  1. What does the mosaic plot reveal about smoking habits and gender?

Males smoke slightly more than females.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, smoke100 == 1 & age < 23)
  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Box plot shows BMI increases as genhealth gets worse. It also shows the middle 50% range gets bigger as genhealth gets worse.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)

Box plot shows that people who excercise have a lower BMI and the BMI range is narrower. Outliers for excericed group accounts for people like Arnold Schwarzenegger since BMI does not take muscle mass into account.


On Your Own

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.