1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

While the question asked for summaries of twos specific variables, with the limited number of variables it’s quicker to just get the summary of all (which also allows the analyst to eyeball things like desired weight being about 15 lbs lower than actual weight) and how close to a normal curve the data have.

## The IQR for the height variable in cdc is 6
## The IQR for the age variable in cdc is 26
gen = cdc$gender
genfreq = table(gen)
gen.relfreq = genfreq / nrow(cdc)
cbind(gen.relfreq)
##   gen.relfreq
## m     0.47845
## f     0.52155

There are 9,569 men in the data, representing 48% of respondants.

exercise = cdc$exerany
exercisefreq = table(exercise)
exercise.relfreq = exercisefreq / nrow(cdc)
cbind(exercise.relfreq)
##   exercise.relfreq
## 0           0.2543
## 1           0.7457
genhlth = cdc$genhlth
genhlthfreq = table(genhlth)
genhlth.relfreq = genhlthfreq / nrow(cdc)
cbind(genhlth.relfreq)
##           genhlth.relfreq
## excellent         0.23285
## very good         0.34860
## good              0.28375
## fair              0.10095
## poor              0.03385

23% of respondants report that they are in excellent health