Set up workspace

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
knitr::opts_chunk$set(echo = TRUE)
rm(list=ls())

Load data

Questions to answer:

  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

    • There are 20000 cases in the data set.
    • There are 9 variables.
    • The variable types are as follows:
##   variable                                type
## 1  genhlth               categorical (ordinal)
## 2  exerany                categorical (binary)
## 3 hlthplan                categorical (binary)
## 4 smoke100                categorical (binary)
## 5   height discrete (handled as a continuous?)
## 6   weight discrete (handled as a continuous?)
## 7 wtdesire discrete (handled as a continuous?)
## 8      age discrete (handled as a continuous?)
## 9   gender                categorical (binary)
  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height, c(0.25, 0.75))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$weight, c(0.25, 0.75))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0
heightIQR <- 70 - 64
weightIQR <- 190 - 140

The IQRs for height and weight are 6 and 50, respectively.

The relative frequency distribution for gender and any exercise are as follows:

table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457
  1. What does the mosaic plot reveal about smoking habits and gender?
mosaicplot(table(cdc$gender,cdc$smoke100))  

Males appear to have a higher proportion of individuals who have smoked at least 100 cigarettes than females.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- cdc[cdc$age < 23 & cdc$smoke100 == 1,]
4.  What does this box plot show? Pick another categorical variable from the 
data set and see how it relates to BMI. List the variable you chose, why you
might think it would have a relationship to BMI,  and indicate what the 
figure seems to suggest. 
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

There’s a trend toward lower bmi correlating with better “general health” scores.

boxplot(bmi ~ cdc$exerany)

Initially, I expected that the “any exercise”" would be associated with a lower bmi, but it looks like this is not the most obvious relation between the two variables. Though the exercisers look to have a slightly lower median bmi, the range is much wider. This makes sense considering that people may be prone to exercise more if their bmi is not in a “normal” range.

## On Your Own
plot(cdc$weight, cdc$wtdesire)

As expected, almost all participants had a desired weight lower than their actual weight.

cdc$wdiff <- cdc$wtdesire - cdc$weight

wdiff is a discrete data point. If wdiff is 0, desired weight matches the person’s actual weight. If it’s positive (wtdesire > weight), the person would like to gain weight; if negative, the person would like to lose weight.

summary(cdc$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
boxplot(cdc$wdiff)

Most people would like to lose weight, there are some people who would like to gain quite a bit of weight. The extent of this (500 lbs) is surprising to me. There are two outliers in particular that may possible be questionable data points.

boxplot(cdc$wdiff ~ cdc$gender)

It looks like there are more men who would like to gain weight (perhaps those looking to bulk up muscle mass through exercise?). Both outliers for weight gain were men.

  meanWt <- mean(cdc$weight, na.rm = TRUE)
  sdWt <- sd(cdc$weight, na.rm = TRUE)
  
  cdc$winOneSD <- as.numeric(cdc$weight > meanWt - sdWt &
                             cdc$weight < meanWt + sdWt)
  
  table(cdc$winOneSD)/20000
## 
##      0      1 
## 0.2924 0.7076

About 70% of participants were within one standard deviation of the mean weight (which is about what would be expected).