1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Cases: 2000 Variables: 9 genhealth: categorial, ordinal exerany: categorical, nominal hlthplan: categorical, nominal smoke100: categorical, nominal height: numerical, discrete weight: numerical, discrete age: numerical, discrete gender: categorical, nominal

  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
source("C:/Users/Andrew/Documents/R/win-library/3.1/IS606/labs/Lab1/more/cdc.R")
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
70-64
## [1] 6
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
57-31
## [1] 26
table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457
table(cdc$gender)
## 
##     m     f 
##  9569 10431
#males = 9569

table(cdc$genhlt)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
#excellent = .23285
  1. What does the mosaic plot reveal about smoking habits and gender?

More men have reported to smoke at least 100 cigarettes.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1) 
summary(under23_and_smoke)
##       genhlth       exerany          hlthplan         smoke100
##  excellent:110   Min.   :0.0000   Min.   :0.0000   Min.   :1  
##  very good:244   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1  
##  good     :204   Median :1.0000   Median :1.0000   Median :1  
##  fair     : 53   Mean   :0.8145   Mean   :0.6952   Mean   :1  
##  poor     :  9   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1  
##                  Max.   :1.0000   Max.   :1.0000   Max.   :1  
##      height          weight         wtdesire          age        gender 
##  Min.   :59.00   Min.   : 85.0   Min.   : 80.0   Min.   :18.00   m:305  
##  1st Qu.:65.00   1st Qu.:130.0   1st Qu.:125.0   1st Qu.:19.00   f:315  
##  Median :68.00   Median :155.0   Median :150.0   Median :20.00          
##  Mean   :67.92   Mean   :158.9   Mean   :152.2   Mean   :20.22          
##  3rd Qu.:71.00   3rd Qu.:180.0   3rd Qu.:175.0   3rd Qu.:21.00          
##  Max.   :79.00   Max.   :350.0   Max.   :315.0   Max.   :22.00
  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

It shows boxplots of bmi for each self-reported general health bin. This figure suggests that people who have higher bmi’s are more likely to report worse general health as well.

exerany, or exercised in past month, is likely associated with better health, and lower bmi, since they are more likely to burn calories and weigh less. As the boxplot shows, those who have exercised in the past month have a slightly lower median bmi and a thinner iqr, although there are still many outliers.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)


On Your Own

plot(cdc$weight ~ cdc$wtdesire)

The general relationship looks to have a slope above 1, suggesting that people generally want to lose some weight

wdiff <- cdc$wtdesire - cdc$weight

wdiff is numerical and discrete

If an observation is 0, then the respondent is satisfied with their current weight

If wdiff is negative, than they want to lose weight, if it is positive, they want to gain weight

boxplot(wdiff)

hist(wdiff, breaks = 40)

plot(wdiff)

summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

Wdiff median = -10, mean = -14.59, so it’s typical for people to want to lose around 10 to 15 pounds

The Wdiff histogram is unimodal with a slight left skew, so there are some people who want to lose a lot of weight, and few people who want to gain weight

The iqr spread is between 0 and -21 pounds, although there are many outliers, mostly of people who want to lose weight

genwdiff <- data.frame(wdiff, cdc$gender)
summary(subset(genwdiff, cdc.gender == "m"))
##      wdiff         cdc.gender
##  Min.   :-300.00   m:9569    
##  1st Qu.: -20.00   f:   0    
##  Median :  -5.00             
##  Mean   : -10.71             
##  3rd Qu.:   0.00             
##  Max.   : 500.00
summary(subset(genwdiff, cdc.gender == "f"))
##      wdiff         cdc.gender
##  Min.   :-300.00   m:    0   
##  1st Qu.: -27.00   f:10431   
##  Median : -10.00             
##  Mean   : -18.15             
##  3rd Qu.:   0.00             
##  Max.   :  83.00
boxplot(genwdiff$wdiff ~ genwdiff$cdc.gender)

Women (median = -10) generally appear to want to lose a few more pounds than men (median = -5), and women have a slightly larger range of how much they want to lose/gain (iqr = 27) than men (iqr = 20). Interestingly, more men than women appear to want to gain weight.

avgwt <- mean(cdc$weight)
sdwt <- sd(cdc$weight)
instdev <- subset(cdc, weight < (avgwt + sdwt) & weight > (avgwt - sdwt))
dim(instdev)[1]/dim(cdc)[1]
## [1] 0.7076

mean of weight = 169.7

standard deviation = 40.08

proportion within one standard deviation of the mean = .7076