Load the CDC dataset:

source("http://www.openintro.org/stat/data/cdc.R")

EXERCISE 1.

  1. Use the commands dim [you used dim last week] and names to display the size of the dataset and the number of variables.
dim(cdc)
## [1] 20000     9
  1. How many cases are there in this data set? 20000
  2. How many variables? 9
  3. For each variable, identify its data type (e.g. categorical, discrete). Genhlth - Categorical ordinal Exerany - Categorical binary Hlthplan - Categorical binary Smoke100 - Categorical binary Gender - Categorical nominal height - quantitative continuous weight - quantitative continuous wtdesire - quantitative continuous age - quantitative continuous

EXERCISE 2.

  1. Obtain a summary for height.
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
  1. Obtain the interquartile range for height. 6
  2. Obtain a summary for age.
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
  1. Obtain the interquartile range for age.
57-31
## [1] 26

EXERCISE 3.

  1. Compute the frequency distribution for gender.
table(cdc$gender)
## 
##     m     f 
##  9569 10431
  1. How many males are in the sample? 9569
  2. Compute the relative frequency distribution for the reported health status.
table(cdc$genhlth)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
  1. What percent report being in excellent health? 23.285%

  2. Create a mosaic plot to see the smoking habits based on gender.

mosaicplot(table(cdc$gender,cdc$smoke100))

f. What does the mosaic plot reveal about smoking habits and gender? Males smoke more than females

EXERCISE 4.

  1. Create a new object called under45_and_smoke that contains all observations of respondents under the age of 45 that have smoked at least 100 cigarettes in their lifetime.
  2. Obtain the head of under45_and_smoke so that you can look at the top rows/observations/cases. Also, notice that the object includes all the columns for those that meet the criteria, not just their age and smoking status.
under45_and_smoke <- (cdc$age<45 / cdc$smoke100) 
  1. Based on looking at the first rows of under45_and_smoke, how do you know that your object is correct (that you used the right command in step a.)? There are 6 people that are 45 and under and there were 6 results

  2. Obtain the frequency distribution of the age of those who have smoked 100 cigarettes or more. 18 19 20 21 22 23 24 25 26 92 110 127 154 137 151 158 125 141 27 28 29 30 31 32 33 34 35 150 166 158 170 152 142 159 156 201 36 37 38 39 40 41 42 43 44 187 198 186 202 237 186 226 254 192 45 46 47 48 49 50 51 52 53 230 186 202 195 178 201 128 198 181 54 55 56 57 58 59 60 61 62 132 169 132 160 148 135 121 113 126 63 64 65 66 67 68 69 70 71 107 131 152 95 108 123 106 111 112 72 73 74 75 76 77 78 79 80 110 96 104 88 83 65 67 64 62 81 82 83 84 85 86 87 88 89 35 40 36 25 26 7 9 6 5 90 91 92 93 96 97 99 Inf 1 6 4 2 1 1 1 10559

  3. Obtain a barplot of the frequency distribution. (put the smoking variable first and then age within the table for the barplot command to get a nice plot)

barplot(table(cdc$smoke100, cdc$age))

EXERCISE 5.

  1. Compare the heights of males and females using the boxplot function.
  2. Obtain the summary statistics for the heights of males.
boxplot(cdc$height ~ cdc$gender)

summary(cdc$height,cdc$gender == "m")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
  1. How do the summary statistics for males compare to what you see in the boxplot? They look accurate to what the boxplot shows

  2. Find the bmi associated with the data and create the boxplot comparing it based on general health.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

e. What does this box plot of bmi against general health show? The higher the BMI, the poorer the health

  1. Pick another categorical variable from the data set and see how it relates to BMI.
boxplot(bmi ~ cdc$age)

g. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest. I chose age as the variable, I thought it might show BMI rangers higher in the middle age range as metabolism slows down. The figure seems to suggest that I was generally correct, the higher BMI range seemes to be mid 30’s to 58.