Intro to Data

Getting started

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command.

source("more/cdc.R")

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

There are 20,000 rows which represent cases.

There are 9 variables: genhlth - categorical (ordinal), exerany - categorical, hlthplan - categorical, smoke100 - categorical, height - numeric (discrete), weight - numeric (discrete), wtdesire - numeric (discrete), age - numeric (discrete), and gender - categorical.

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum.

2. Create a numerical summary for height and age, and compute the interquartile range for each.

For height and agethis is:

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

The Interquartile range for height is

70-64
## [1] 6

The Interquartile range for age is

57-31
## [1] 26

Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

table(cdc$gender)
## 
##     m     f 
##  9569 10431

There are 9569 males in the sample.

The frequency distribution for gender and exerany:

table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457

The proportion of the sample reporting to be in excellent health:

sum(cdc$genhlth == "excellent")/20000*100
## [1] 23.285

3. What does the mosaic plot reveal about smoking habits and gender?

mosaicplot(table(cdc$gender,cdc$smoke100))

More males than females reported smoking at least 100 cigarettes.

Quantitative data

The following two lines first make a new object called bmi and then creates box plots of these values, defining groups by the variable cdc$genhlth.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

4. What does this box plot show?

As BMI increases, the level of health moves progressively from excellent to poor.

Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$age)

There does seem to be a general trend towards a higher BMI during middle age.

On Your Own

- Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight, cdc$wtdesire, pch=".")

There is a distinct line that demonstrates that almost no one has a desired weight higher than the one they are at currently.

- Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- cdc$wtdesire - cdc$weight
str(wdiff)
##  int [1:20000] 0 -10 0 -8 -20 0 -9 -10 -20 -10 ...
summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

- What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

Wdiff is a numerical data type, if wdiff is 0 then there is no difference between the desired weight and the current weight. If it is positive then the desired weight is higher than the current weight and a negative wdiff indicates that the desired weight is lower than the current weight.

- Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

hist(wdiff, breaks = 100)

boxplot(wdiff, ylim = c(-100,100))

The histogram shows an almost symmetric but slighty left skewed plot of the data and the boxplot shows that the median is negative. The plots show that there are a greater number of people who want to lose weight from the current weight than those who want to gain weight.

- Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

mweight <- subset(cdc$weight, cdc$gender =="m")
mwtdesire <- subset(cdc$wtdesire, cdc$gender =="m")
mwdiff<- mwtdesire - mweight
summary(mwdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00
fweight <- subset(cdc$weight, cdc$gender =="f")
fwtdesire <- subset(cdc$wtdesire, cdc$gender =="f")
fwdiff<- fwtdesire - fweight
summary(fwdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00
boxplot(wdiff ~ cdc$gender, ylim = c(-100,100))

More women want to lose weight than men.

- Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
within <- ifelse( (cdc$weight > (mean(cdc$weight)-1*sd(cdc$weight))) &
                  (cdc$weight < (mean(cdc$weight)+1*sd(cdc$weight))), 1, 0)
totalSamples <- nrow(cdc)

sum(within) / totalSamples
## [1] 0.7076

The proportion of the weights that are within one standard deviation of the mean is 70.76%.