knitr::opts_chunk$set(echo = TRUE)

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Getting started

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command.

source("http://www.openintro.org/stat/data/cdc.R")

The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.

To view the names of the variables, type the command

names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"   "wtdesire"
## [8] "age"      "gender"

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

  1. How many cases (observations) are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Answer:-

We can have a look at the first few entries (rows) of our data with the command

head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

and similarly we can look at the last few by typing

tail(cdc)
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 19995      good       0        1        1     69    224      224  73      m
## 19996      good       1        1        0     66    215      140  23      f
## 19997 excellent       0        1        0     73    200      185  35      m
## 19998      poor       0        1        0     65    216      150  57      f
## 19999      good       1        1        0     67    165      165  81      f
## 20000      good       1        1        1     69    170      165  83      m

The types of data are given below:

genhlth-categorical exerany-binary hlthplan-binary smoke100-binary height-continuous weight-continuous wtdesire-continuous age-continuous gender-binary

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. We can avoid this problem by using tibbles. We’ll first load the tidyverse package and then convert cdc to a tibble.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
cdc <- as_tibble(cdc)

Now typing cdc to view the data will not cause a problem as a tibble (which is just a special type of data frame) will only print out by default the first 10 rows and show as many of the variables/columns as possible based on the width of the window you have open.

cdc
## # A tibble: 20,000 x 9
##    genhlth   exerany hlthplan smoke100 height weight wtdesire   age gender
##    <fct>       <dbl>    <dbl>    <dbl>  <dbl>  <int>    <int> <int> <fct> 
##  1 good            0        1        0     70    175      175    77 m     
##  2 good            0        1        1     64    125      115    33 f     
##  3 good            1        1        1     60    105      105    49 f     
##  4 good            1        1        0     66    132      124    42 f     
##  5 very good       0        1        0     61    150      130    55 f     
##  6 very good       1        1        0     64    114      114    55 f     
##  7 very good       1        1        0     71    194      185    31 m     
##  8 very good       0        1        0     67    170      160    45 m     
##  9 good            0        1        1     65    150      130    27 f     
## 10 good            1        1        0     70    180      170    44 m     
## # ... with 19,990 more rows

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is

summary(cdc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter

190 - 140
## [1] 50

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

mean(cdc$weight) 
## [1] 169.683
var(cdc$weight)
## [1] 1606.484
median(cdc$weight)
## [1] 165

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)
## 
##     0     1 
## 10559  9441

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000
## 
##       0       1 
## 0.52795 0.47205

Notice how R automatically divides all entries in the table by 20,000 in the command above.

Next, we make a bar plot of the entries. Since smoke100 is a categorical variable, we need to remember to convert it to a factor.

ggplot(cdc, aes(x = factor(smoke100))) +
  geom_bar()

  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
#summary for height
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
#Interquartile Range for Height:
70-64
## [1] 6
#summary for age
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
#Interquartile range for height
57-31
## [1] 26
#Relative frequency distribution for gender 
gender <- cdc$gender
fgender <- table(gender)
relfgender <- fgender/nrow(cdc)
relfgender
## gender
##       m       f 
## 0.47845 0.52155
#Relative frequency distribution for exerany
exerany <- cdc$exerany
fexerany <- table(exerany)
relfexerany <- fexerany/nrow(cdc) 
relfexerany
## exerany
##      0      1 
## 0.2543 0.7457
#how many males are in the sample
table(gender)[1]
##    m 
## 9569
#What proportion of the sample reports being in excellent health?
cat((table(cdc$genhlth)[1]/nrow(cdc))*100,"%")
## 23.285 %

The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.

table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a tile plot of this table, we would enter the following command.

cdc %>% mutate(smoke100 = factor(smoke100)) %>% 
        count(gender, smoke100) %>% 
        ggplot(aes(x = gender, y = smoke100, fill = n)) +
          geom_tile()

  1. What does this plot reveal about smoking habits and gender?

The tile plot shows that a randomly selected male is more probable for smoking at least 100 cigarettes than female.

Interlude: How R thinks about data

We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany and so on). We can see the size of the data frame next to the object name in the workspace or we can type

dim(cdc)
## [1] 20000     9

which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format

cdc[567,6]
## # A tibble: 1 x 1
##   weight
##    <int>
## 1    160

which means we want the element of our data set that is in the 567th row (meaning the 567th person or observation) and the 6th column (in this case, weight). We know that weight is the 6th variable because it is the 6th entry in the list of variable names

colnames(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"   "wtdesire"
## [8] "age"      "gender"

To see the weights for the first 10 respondents we can type

cdc[1:10,6]
## # A tibble: 10 x 1
##    weight
##     <int>
##  1    175
##  2    125
##  3    105
##  4    132
##  5    150
##  6    114
##  7    194
##  8    170
##  9    150
## 10    180

In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

Finally, if we want all of the data for the first 10 respondents, type

cdc[1:10,]
## # A tibble: 10 x 9
##    genhlth   exerany hlthplan smoke100 height weight wtdesire   age gender
##    <fct>       <dbl>    <dbl>    <dbl>  <dbl>  <int>    <int> <int> <fct> 
##  1 good            0        1        0     70    175      175    77 m     
##  2 good            0        1        1     64    125      115    33 f     
##  3 good            1        1        1     60    105      105    49 f     
##  4 good            1        1        0     66    132      124    42 f     
##  5 very good       0        1        0     61    150      130    55 f     
##  6 very good       1        1        0     64    114      114    55 f     
##  7 very good       1        1        0     71    194      185    31 m     
##  8 very good       0        1        0     67    170      160    45 m     
##  9 good            0        1        1     65    150      130    27 f     
## 10 good            1        1        0     70    180      170    44 m

By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 567th, or rows 1 through 10.

cdc[,6]
## # A tibble: 20,000 x 1
##    weight
##     <int>
##  1    175
##  2    125
##  3    105
##  4    132
##  5    150
##  6    114
##  7    194
##  8    170
##  9    150
## 10    180
## # ... with 19,990 more rows

Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data set. An alternative method to access the weight data is by referring to the name. We can use any of the variable names to select items in our data set.

Weight<-cdc$weight

The dollar-sign tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we can subset it with just a single index inside square brackets. We see the weight for the 567th respondent by typing

cdc$weight[567]
## [1] 160

Similarly, for just the first 10 respondents

cdc$weight[1:10]
##  [1] 175 125 105 132 150 114 194 170 150 180

The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.

Remembering dplyr

We will generally use the arrange, filter, mutate, select, summarize, and group_by functions of the dplyr package to subset our data.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime.
library(dplyr)
under23_and_smoke<-cdc %>% filter(smoke100==1 & age<23)
head(under23_and_smoke)
## # A tibble: 6 x 9
##   genhlth   exerany hlthplan smoke100 height weight wtdesire   age gender
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>  <int>    <int> <int> <fct> 
## 1 excellent       1        0        1     66    185      220    21 m     
## 2 very good       1        0        1     70    160      140    18 f     
## 3 excellent       1        1        1     74    175      200    22 m     
## 4 good            1        1        1     64    190      140    20 f     
## 5 very good       1        1        1     62     92       92    21 f     
## 6 very good       1        0        1     64    125      115    22 f

Quantitative data

With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.

ggplot(cdc, aes(x = '', y = height)) +
  geom_boxplot()

You can compare the locations of the components of the box by examining the summary statistics.

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with

ggplot(cdc, aes(x = gender, y = height)) +
  geom_boxplot()

So here, we are asking R to give us box plots of heights where the groups are defined by gender.

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI) (http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:

\[ BMI = \frac{weight~(lb)}{height~(in)^2} * 703 \]

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).

The following lines create a new variable called bmi and then a box plot of these values for each group defined by the variable cdc$genhlth.

cdc <- cdc %>% mutate(bmi = (weight / height)*703)
ggplot(cdc, aes(x = genhlth, y = bmi)) +
  geom_boxplot()

Remember that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.

  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

    The boxplot shows the relationship between a respondent’s BMI and their assesment of their overall health.

    Now, I am picking another variable from data named exerany (Monthly Exercise).

    I think that this variable correlated to ibm as exercise create variation on weight.

bmi = (cdc$weight /cdc$ height)*703
boxplot(bmi ~ cdc$exerany)

From above box plot, it is clear that people who did not exercise in the past month have high bmi’s .

Finally, let’s make some histograms. We can look at the histogram for the age of our respondents:

ggplot(cdc, aes(x = age)) + 
  geom_histogram(fill = 'lightblue', colour = 'black', binwidth = 3)

Histograms are generally a very good way to see the shape of a single distribution, but remember that the shape can change depending on how the data is split between the different bins. You can control the number of bins by specifying the binwidth argument.

At this point, we’ve done a rough first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI.


On Your Own

cor(cdc$weight,cdc$wtdesire)
## [1] 0.8000521

From above result, we can say that the relation between weight and wtdesire is strong.

With increasing the weight, the desired weight also increases.

typeof(New_cdc$wdiff)
## [1] "integer"

Therefore, the type of data of wdiff is an integer.

If an observation ‘wdiff’ is 0,means the person’s weight is also the desired weight for that person. If an observation ‘wdiff’ is negative, the person’s weight is less than his/her wtdesire, means that his/her is underweight. If an observation ‘wdiff’ is positive, the person’s weight is more than his/her wtdesire, means that his/her is overweight.

mean(New_cdc$wdiff)
## [1] -14.5891
var(New_cdc$wdiff)
## [1] 578.2032
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(ggplot(data = New_cdc) +
  geom_histogram(mapping = aes(x = wdiff) ,
                 fill = "orange",
                 color = "blue",
                 bins = 15) )
# finding shape using density
ggplot(New_cdc) +
  geom_density(aes(wdiff))

People feel relax about their current weight because most of people of weight is almost closer to their desired weight.

boxplot(cdc$weight - cdc$wtdesire ~ cdc$gender)

From above box plot, it is clear that there is no major difference between weight of men and women.

mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
plot(density(cdc$weight))

From above density, the shape of curve is almost same as normal distribution. Approximately 68% of values drawn from a normal distribution are within one standerd deviation.