Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Getting started

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We begin by loading the data set of 20,000 observations into the R workspace.

source("more/cdc.R")

On Your Own

  1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

The plot seems to suggest a linear relationship. The concentration of the scatterplots are between 100-300 lbs as most respondent’s weight are, and their desire weight is mostly a little between these range, too.

plot(cdc$wtdesire, cdc$weight)

  1. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wdiff <- cdc$wtdesire - cdc$weight
  1. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

The wdiff has a categorical meaning. A value of 0 meant that a person’s weight is exactly his/her desired weight, a positive value meant that the person’s weight is below his/her desired weight, and a negative value meant that the person’s weight is over his/her desired weight

  1. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

Based on the histogram of wdiff, there’s a negative center and more of a negative spread. In general, this seems to suggests that most respondents feel they are just a little overweight?

hist(wdiff, breaks = 100)

  1. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

Excluding outliers, in general it seems that men has better view of their weight, that they weigh closer to their desired weight as women.

male_data <- subset(cdc, gender == "m")
male_wdiff <- male_data$wtdesire - male_data$weight

female_data <- subset(cdc, gender == "f")
female_wdiff <- female_data$wtdesire - female_data$weight

summary(male_wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00
summary(female_wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00
boxplot(wdiff ~ cdc$gender)

  1. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

There is 70% of respondents who fall within one standard deviation of the mean

avgwt <-mean(cdc$weight)
stdev <- sd(cdc$weight)

within_1_pos_sd <- avgwt + stdev
within_1_neg_sd <- avgwt - stdev

sd_data <- subset(cdc, weight >= within_1_neg_sd & weight <= within_1_pos_sd)

nrow(sd_data)/nrow(cdc) * 100
## [1] 70.76

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.