source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"
head(cdc, n=10)
##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 4       good       1        1        0     66    132      124  42      f
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 9       good       0        1        1     65    150      130  27      f
## 10      good       1        1        0     70    180      170  44      m
tail(cdc, n=10)
##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19991 excellent       1        1        0     71    195      190  43
## 19992 very good       1        1        1     72    210      175  52
## 19993 very good       1        1        0     71    180      180  36
## 19994 very good       0        1        1     63    165      120  31
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19991      m
## 19992      m
## 19993      m
## 19994      f
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m
  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
#Cases and Variables in the Dataset
dim(cdc)
## [1] 20000     9

genhlth: Categorical exerany: Numerical, Discrete hlthplan: Numerical, Discrete smoke100: Numerical, Discrete height: Numerical, Continuous weight: Numerical, Continuous wtdesire: Numerical, Continuous age: Numerical, Continuous gender: Categorical

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0
mean(cdc$weight) 
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165
table(cdc$smoke100)
## 
##     0     1 
## 10559  9441
table(cdc$smoke100)/20000
## 
##       0       1 
## 0.52795 0.47205
barplot(table(cdc$smoke100))
smoke <- table(cdc$smoke100)
barplot(smoke)

2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

height interquartile: 70-64 = 6 weight interquartile: 57-31 = 26 males = 9569 Excellent Health: 4657

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
mean(cdc$height) 
## [1] 67.1829
var(cdc$height)
## [1] 17.0235
median(cdc$height)
## [1] 67
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
mean(cdc$age) 
## [1] 45.06825
var(cdc$age)
## [1] 295.5886
median(cdc$age)
## [1] 43
summary(cdc$gender)
##     m     f 
##  9569 10431
summary(cdc$exerany)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.7457  1.0000  1.0000
summary(cdc$genhlth)
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677
table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419
mosaicplot(table(cdc$gender,cdc$smoke100))

3. What does the mosaic plot reveal about smoking habits and gender?

More males smoke than females

cdc[567,6]
## [1] 160
  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

Box plot shows that people with lower BMIs tend to find themselves in better health.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$smoke100)

People who dont smoke are typically in a lower BMI range

plot(cdc$weight ~ cdc$wtdesire)

Relationship is generally upward sloping, so people in general desire to be in their same weight. Some people wanted to be significantly heavier or lighter, but not most.

wtdiff <- c(cdc$weight-cdc$wtdesire)
plot(wtdiff)

If it is positive, more people want to gain weight, while if it is negative, more people want to lose.

Most people seem comfortable about their current range and do not want to stray too far away.