The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations into the R workspace.
Start by loading the DATA606 library.
Then, read in the CDC data using the script cdc.R.
In the instructions it has us read in from just “more/cdc.R”, but I am running using R instead of RStudio so this was not working.
So I supply the full path to this file, which is in a subdirectory below where I have installed DATA606.
library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
source('/Library/Frameworks/R.framework/Versions/3.4/Resources/library/DATA606/labs/Lab1/more/cdc.R')
set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.
To view the names of the variables, type the command
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.
We can have a look at the first few entries (rows) of our data with the command
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
and similarly we can look at the last few by typing
tail(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995 good 0 1 1 69 224 224 73
## 19996 good 1 1 0 66 215 140 23
## 19997 excellent 0 1 0 73 200 185 35
## 19998 poor 0 1 0 65 216 150 57
## 19999 good 1 1 0 67 165 165 81
## 20000 good 1 1 1 69 170 165 83
## gender
## 19995 m
## 19996 f
## 19997 m
## 19998 f
## 19999 f
## 20000 m
You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is
summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter
190 - 140
## [1] 50
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type
mean(cdc$weight)
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
table(cdc$smoke100)
##
## 0 1
## 10559 9441
or instead look at the relative frequency distribution by typing
table(cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the Introduction to R; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.
barplot(table(cdc$smoke100))
Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:
smoke <- table(cdc$smoke100)
barplot(smoke)
Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. This is another important idea that we’ll return to later.
height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
IQR(cdc$height)
## [1] 6
IQR(cdc$age)
## [1] 26
table(cdc$gender)
##
## m f
## 9569 10431
table(cdc$gender)/nrow(cdc)
##
## m f
## 0.47845 0.52155
#There are 9,569 males making up 47.8% of the sample.
table(cdc$exerany)/nrow(cdc)
##
## 0 1
## 0.2543 0.7457
#74.6% of the sample reports getting some exercise.
table(cdc$genhlth)/nrow(cdc)
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
#23.3% of the sample reports being in excellent health.
The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.
table(cdc$gender,cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.
mosaicplot(table(cdc$gender,cdc$smoke100))
We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/barplot example above).
Going to skip copying this section into my Rmd to avoid the resulting Rpubs being too long.
Just going to answer the question.
Question 3 appears twice. Going to switch it to 4.
under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.under23_and_smoke <- cdc[which(cdc$age < 23 & cdc$smoke100 == 1),]
With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.
boxplot(cdc$height)
You can compare the locations of the components of the box by examining the summary statistics.
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with
boxplot(cdc$height ~ cdc$gender)
The notation here is new. The ~ character can be read versus or as a function of. So we’re asking R to give us a box plots of heights where the groups are defined by gender.
Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:
\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).
The following two lines first make a new object called bmi and then creates box plots of these values, defining groups by the variable cdc$genhlth.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.
It’s a bit hard to see what’s going on in the boxplot given the lower range is a bit squished because of the few extreme outliers. Let’s try replotting setting ylim at 50.
boxplot(bmi ~ cdc$genhlth,ylim=c(10,50),las=2)
Based on this figure, there does not appear to be an especially strong relationship between BMI and self-reported general health.
It does seem like those with very good to excellent health tend to have somewhat lower BMIs, but there is plenty of overlap between groups (low-to-mid BMIs with less good health and higher BMIs with better health).
Let’s check the association of exercise with BMI. We might expect those who exercise to have lower BMI.
boxplot(bmi ~ cdc$exerany)
boxplot(bmi ~ cdc$exerany,ylim=c(10,50))
While individuals who exercise do seem to have lower BMI (lower median), the effect is not as strong as I expected it to be.
Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with the command
hist(cdc$age)
Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks.
par(mfrow=c(1,2))
hist(bmi)
hist(bmi, breaks = 50)
You get more resolution with more breaks.
At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
library(ggplot2)
ggplot(cdc,aes(wtdesire,weight)) + geom_point(alpha=1/10)
#Pretty sure the people who want to 600+ pounds can be safely classified as extreme outliers if not possible errors.
#Let's replot minus these two points.
ggplot(cdc[which(cdc$wtdesire < 400),],aes(wtdesire,weight)) + geom_point(alpha=1/10)
Looks like there are many people whose current weight is equal to their desired weight. Good for them! For those whose values are different, it tends to be that their current weight is higher than their desired weight more often than lower. Current weight and desired weight are also correlated. Which makes sense. Given two people with the same body type (say “slightly overweight”), a man and a woman, or a short vs. tall person, will have different both current weights, and goal weights to be “normal weight” or “lean”.
wdiff <- cdc$wtdesire - cdc$weight
wdiff of 0 means the person is exactly at the weight they want to be. wdiff < 0 means the person would like to lose weight. wdiff > 0 means the person would like to gain weight.
wdiff is a variable resulting from a transformation of the original data. It is also numeric.
hist(wdiff,xlab="Desired weight - current weight",ylab="Number of individuals",main="",labels=TRUE)
#Plot minus the two individuals who want to be 600+ pounds again.
hist(wdiff[which(cdc$wtdesire < 400)],xlab="Desired weight - current weight",ylab="Number of individuals",main="",labels=TRUE)
median(wdiff)
## [1] -10
We see even more clearly now that there are many more individuals who want to lose weight than gain weight.
Wanting to lose less than 20 pounds seems to be the most common.
The center of the data is -10, meaning that half of the people want to lose at least 10 pounds.
The data is somewhat left-skewed, with a few outliers of people who want to lose over 150-200 pounds.
#Make a data frame of wdiff vs. gender.
wdiff_vs_gender <- data.frame(wdiff = wdiff,gender = cdc$gender)
#Remove the two outliers who want to be 600+ pounds.
wdiff_vs_gender <- wdiff_vs_gender[which(cdc$wtdesire < 600),]
boxplot(wdiff ~ gender,data=wdiff_vs_gender)
boxplot(wdiff ~ gender,data=wdiff_vs_gender,ylim=c(-200,120))
Similar to the BMI vs. health plot, there is a lot of overlap between the genders, but we do see somewhat of a trend of women tending to want to lose more weight and men more often than women either being closer to 0 (OK with current weight) or wanting to gain weight.
Finally, let’s get the mean and standard deviation of weight, and determine what proportion of the weights are within one standard deviation of the mean.
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
one_sd_under <- mean(cdc$weight) - sd(cdc$weight)
one_sd_over <- mean(cdc$weight) + sd(cdc$weight)
length(which(cdc$weight >= one_sd_under & cdc$weight <= one_sd_over))/nrow(cdc)
## [1] 0.7076
#Just out of curiosity, let's also separate the mean by gender.
aggregate(weight ~ gender,data=cdc,FUN=mean)
## gender weight
## 1 m 189.3227
## 2 f 151.6662
Answer - around 71% of people have weight within one SD of the mean, which is about what we would expect.