1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
- There are 20000 observations and 9 variables
- Categorical variable
- genhlth, exerany, hlthplan, smoke100, gender
- Discrete variable
- height, weioght, wtdesire, age
2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
# Height numerical summary
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
# Height Interquartile range
IQR(cdc$height)
## [1] 6
# Age numerical summary
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
# Age Interquartile range
IQR(cdc$age)
## [1] 26
# Relative frequency distribution for gender
table(cdc$gender)/nrow(cdc)
##
## m f
## 0.47845 0.52155
# Relative frequency distribution for exerany
table(cdc$exerany)/nrow(cdc)
##
## 0 1
## 0.2543 0.7457
# Number of male samples are 9569
table(cdc$gender)
##
## m f
## 9569 10431
# Excellent in Health in proportion: 0.23285
table(cdc$genhlth)/nrow(cdc)
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
3. What does the mosaic plot reveal about smoking habits and gender?
# The mosaic plot displays that a larger proportion of male smoked at least 100 cigarettes.
mosaicplot(table(cdc$gender, cdc$smoke100))

4. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1)
head(under23_and_smoke,10)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
## 262 fair 0 1 1 71 185 185 20 m
## 296 fair 1 1 1 72 185 170 19 m
## 297 excellent 1 0 1 63 105 100 19 m
## 300 fair 1 1 1 71 185 150 18 m
On Your Own
1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
# As weight increases, so does the desired weight.
ggplot(cdc, aes(weight, wtdesire)) +
labs( x = "Weight", y = "Desired Weight") + geom_point()

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
cdc$wdiff <- cdc$wtdesire - cdc$weight
3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
# wdiff is numeric yet discrete variable -- the weights are integer values.
# If wdiff < 0, the person wants to weigh less.
# If wdiff = 0, the person is content with their weight.
# if wdiff > 0, the person wants to weight more.
typeof(cdc$wdiff)
## [1] "integer"
4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
# Given the median value = -10 and mean = -14.59, we can assume the majority of people generally desire to weigh less than they do now.
summary(cdc$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
ggplot(data = cdc, aes(cdc$wdiff)) + geom_histogram(binwidth = 30)

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
# The outliers are removed in the boxplot below. Given the numerical summaries and the plot, we can state that the wdiff median is lower for female than for male; woman are far from their desired weight than men are. Furthermore, women have a larger spread of wdiff than males do.
summary(cdc$wdiff[cdc$gender == "m"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(cdc$wdiff[cdc$gender == "f"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
boxplot(cdc$wdiff ~ cdc$gender, outline = F, ylab = "wdiff")

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
# mean 169.7
avg <- mean(cdc$weight)
avg
## [1] 169.683
#SD 40.08
sd <- sd(cdc$weight)
sd
## [1] 40.08097
#weights within one standard deviation of the mean: 70.76%
wwos <- subset(cdc, weight < avg + sd & weight > avg - sd)
nrow(wwos)/nrow(cdc)
## [1] 0.7076