Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Getting started

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command.

source("http://www.openintro.org/stat/data/cdc.R")

The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.

To view the names of the variables, type the command

names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
#How many cases are there in this data set
nrow(cdc)
## [1] 20000
#How many variables

ncol(cdc)
## [1] 9

For each variable, identify its data type (e.g. categorical, discrete)

-genhlth: categorical ordinal

-exerany: numerical discrete

-hlthplan: numerical discrete

-smoke100: numerical discrete

-height: numerical Continuous

-weight: numerical Continuous

-wtdesire: numerical Continuous

-age: numerical Continuous

-gender: categorical nominal

We can have a look at the first few entries (rows) of our data with the command

head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

and similarly we can look at the last few by typing

tail(cdc)
##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter

190 - 140
## [1] 50

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

mean(cdc$weight) 
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)
## 
##     0     1 
## 10559  9441

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000
## 
##       0       1 
## 0.52795 0.47205

Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the Introduction to R; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot command.

barplot(table(cdc$smoke100))

Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:

smoke <- table(cdc$smoke100)

barplot(smoke)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. This is another important idea that we’ll return to later.

  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
#numerical summary for `height`
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
#numerical summary for `age`
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
#compute the interquartile range for each


#interquartile range for height
70-64
## [1] 6
#interquartile range for age
57-31
## [1] 26
#Compute the relative frequency distribution for`gender`
table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
#Compute the relative frequency distribution for`exerany`
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457
#How many males are in the sample
table(cdc$gender)[1]
##    m 
## 9569
#What proportion of the sample reports being in excellent health
table(cdc$genhlth)[1]/20000
## excellent 
##   0.23285

The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.

table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

mosaicplot(table(cdc$gender,cdc$smoke100))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/barplot example above).

  1. What does the mosaic plot reveal about smoking habits and gender?
table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419

From the mosaic plot, we can see that male smoker is more probable to smoke more than 100 cigarette than female smoker.Male smoke more heavily compare to female.

Interlude: How R thinks about data

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke<-subset(cdc,cdc$age<23 & cdc$smoke100==1)

head(under23_and_smoke)
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Quantitative data

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

The box plot compares BMI data with different health status. It indicates that better health status have average lower BMI.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi~cdc$exerany)

I choose the exercise data to see whether those people exercise in the past one month have a lower BMI. As the box plot shows, exerany(1) has a lower average BMI compare to exerany(0). Exercise can lower the BMI data.


On Your Own

plot(cdc$weight,cdc$wtdesire,xlab="weight",ylab="wtdesire",main="Weight vs Desired_weight")

The plot is not linear and more and more people want lighter weight when their actual weight increase.
cdc$wdiff<-cdc$wtdesire-cdc$weight

head(cdc$wdiff)
## [1]   0 -10   0  -8 -20   0

Answer: wdiff is a numerical data type. If an observation wdiff is 0, it means respondent’s actual weight is equal to their desired weight. If wdiff is positive or negative, it means that the respondent is blow or above their desired weight, respectively.

summary(cdc$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
hist(cdc$wdiff,breaks=60)

The center of the wdiff is median number -10 and distribution is skewed to the right. The IQR is 21.The plot shows that most people want to be lighter than their actual weight.

summary(cdc$wdiff[cdc$gender == "m"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00
summary(cdc$wdiff[cdc$gender == "f"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00
 boxplot(cdc$wdiff ~ cdc$gender, outline = F,ylab = "wdiff")

From the boxplot, we can see that female have lower weight different, which means female have higher standard of their desired weight.

#mean
meandata<-mean(cdc$weight)
meandata
## [1] 169.683
#standard deviation
sddata<-sd(cdc$weight)
sddata
## [1] 40.08097
#what proportion of the weights are within one standard deviation of the mean.
dim(subset(cdc,cdc$weight<meandata+sddata & cdc$weight>meandata-sddata ))
## [1] 14152    10

Because 14152 records are within one standard deviation of the mean, so the proportion is 14152/20000=70.76%.