Data Analysis and Statistical Inference

The Behavioral Risk Factor Surveillance System (http://www.cdc.gov/brfss) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. The .Rdata file can be found at http://s3.amazonaws.com/assets.datacamp.com/course/dasi/cdc.Rdata.

# cdc data frame
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/cdc.Rdata"))

Which variables are you working with?

The object cdc is a data matrix, with each row representing a case, and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs.

To see what kind of data your data frame contains, you can use the names function in R. This function returns a vector of variable names in which each name corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health from excellent down to poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage. The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in his lifetime. More information about the survey here

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

Taking a peek at your data

You can have a look at the first or last few entries (rows) of your data frame with the commands head and tail, respectively.

Note that you could also look at all of the data frame at once by typing its name into the console. However, since cdc has 20,000 rows, this would mean flooding the console screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

tail(cdc)

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

dim(cdc)

## [1] 20000     9

class(cdc$genhlth)

## [1] "factor"

table(cdc$smoke100)

## 
##     0     1 
## 10559  9441

Turning info into knowledge

Numerical data

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data.

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. Let’s start with a numerical summary.

You can use functions mean, var and median to calculate the (surprise, surprise) mean, variance and median of certain variables of your data frame.

The function summary returns a numerical summary: minimum, first quartile, median, mean, third quartile, and maximum.

mean(cdc$weight)

## [1] 169.7

var(cdc$weight)

## [1] 1606

median(cdc$weight)

## [1] 165

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      68     140     165     170     190     500

Categorical data

Another type of data you can analyze is categorical data. For this type of variable it makes more sense to look at absolute or relative frequency.

The function table counts the number of times each kind of category occurs in a variable. For example, to see the number of people who have smoked at least 100 cigarettes in their lifetime, try table(cdc$smoke100) in the console.

You can also get the relative frequencies by dividing the table by the amount of observations in your data frame.

# Create the frequency table here:
table(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

# Create the relative frequency table here:
table(cdc$genhlth) / length(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

Creating your first barplot

Now that you have calculated the relevant frequencies in the previous exercise, you can represent them graphically. A barplot is a very appropriate type of graph for this task.

Making a barplot is easy: you use the function barplot. This function requires you to put the table inside the barplot command.

barplot(table(cdc$smoke100))

plot of chunk unnamed-chunk-6

Questions

Create a numerical summary for gender.

How many males are in the sample?

summary(cdc$gender)

##     m     f 
##  9569 10431

Compute the relative frequency distribution of genhlth.

What proportion of the sample reports being in excellent health? Choose the closest answer. Remember there are 20000 observations in the sample.

table(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

table(cdc$genhlth) / length(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

Even prettier: the Mosaic Plot

The table command can be used to tabulate any number of variables that you provide. This means you can investigate how different categories relate to each other. For example, you can see how many people have smoked at least 100 cigarettes in the different general health groups by executing table(cdc$genhlth, cdc$smoke100). (Try it in the console!).

To have a nice graphical representation of this distribution you can use the mosaicplot command.

table(cdc$genhlth, cdc$smoke100)

##            
##                0    1
##   excellent 2879 1778
##   very good 3758 3214
##   good      2782 2893
##   fair       911 1108
##   poor       229  448

mosaicplot(table(cdc$genhlth, cdc$smoke100))

plot of chunk unnamed-chunk-9

gender_smokers = table(cdc$gender, cdc$smoke100)
gender_smokers

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

# Plot the mosaicplot:

mosaicplot(gender_smokers)

plot of chunk unnamed-chunk-9

What does the mosaic plot reveal about smoking habits and gender?

Remember: 1 indicates that a respondent has smoked at least 100 cigarettes. > That males are more likely to smoke than females.

Interlude: How R thinks about data

which(names(cdc) == 'height')

## [1] 5

which(names(cdc) == 'weight')

## [1] 6

# Create the subsets:
height_1337 = cdc[1337, 5]
weight_111 = cdc[111, 6]

# Print the results:
height_1337

## [1] 70

weight_111

## [1] 210

which(names(cdc) == 'hlthplan')

## [1] 3

which(names(cdc) == 'height')

## [1] 5

first8 = cdc[1:8, 3:5]

which(names(cdc) == 'weight')

## [1] 6

which(names(cdc) == 'gender')

## [1] 9

wt_gen_10_20 = cdc[10:20, 6:9]

# Print the subsets:
first8

##   hlthplan smoke100 height
## 1        1        0     70
## 2        1        1     64
## 3        1        1     60
## 4        1        0     66
## 5        1        0     61
## 6        1        0     64
## 7        1        0     71
## 8        1        0     67

wt_gen_10_20

##    weight wtdesire age gender
## 10    180      170  44      m
## 11    186      175  46      m
## 12    168      148  62      m
## 13    185      220  21      m
## 14    170      170  69      m
## 15    170      170  23      m
## 16    185      175  79      m
## 17    156      150  47      m
## 18    185      185  76      m
## 19    200      190  43      m
## 20    125      120  33      f

resp205 = cdc[205,]
ht_wt = cdc[, c('height', 'weight')]

# Print the subsets:
resp205

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 205    fair       0        0        1     61    200      125  49      f

head(ht_wt)

##   height weight
## 1     70    175
## 2     64    125
## 3     60    105
## 4     66    132
## 5     61    150
## 6     64    114

resp1000_smk = cdc$smoke100[1000]
first30_ht = cdc$height[1:30]

# Print the subsets:
resp1000_smk

## [1] 1

first30_ht

##  [1] 70 64 60 66 61 64 71 67 65 70 69 69 66 70 69 73 67 71 75 67 69 65 73
## [24] 67 64 68 67 69 61 74

A little more on subsetting

It’s often useful to extract all individuals (cases) in a data frame that have specific characteristics. You can accomplish this through conditioning commands.

First, consider expressions like cdc$gender == "m" or cdc$age > 30 (try them in the console!). These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male or older than 30, respectively.

Suppose now you want to extract just the data for the men in the sample, or just for those over 30. You can simply use subset to do that. For example, the command subset(cdc, cdc$gender == "m") will return a data frame that only contains the men from the cdc data frame. (Note the double equal sign!)

very_good = subset(cdc, cdc$genhlth == "very good")
age_gt50 = subset(cdc, cdc$age > 50)

head(very_good)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 20 very good       1        1        0     67    125      120  33      f
## 21 very good       1        1        0     69    200      150  48      f

head(age_gt50)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 12      fair       1        1        1     69    168      148  62      m
## 14 excellent       1        1        1     70    170      170  69      m
## 16      good       1        1        1     73    185      175  79      m

Subset - one last time

What makes conditioning commands really powerful is the fact that you can use several of these conditions together with the logical operators & and |.

The & is read “and” so that subset(cdc, cdc$gender == "f" & cdc$age > 30) will give you the data for women over the age of 30. The | character is read “or” so that subset(cdc, cdc$gender == "f" | cdc$age > 30) will take people who are women or over the age of 30.

In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

# Create the subset:
under23_and_smoke = subset(cdc, cdc$age < 23 & cdc$smoke100 == TRUE)

# Print the result
head(under23_and_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

How many observations are in the subset under23_and_smoke that you created in the previous exercise, i.e. how many people in the sample are under the age of 23 and have smoked at least 100 cigarettes in their lifetime?

length(which(cdc$age < 23 & cdc$smoke100))

## [1] 620

Visualizing with box plots

With our subsetting tools in hand, let’s return to the task of the day: making basic summaries of the BRFSS questionnaire.

You already looked at categorical data such as smoke100 and gender so now let’s turn our attention to numerical data. A common way to visualize numerical data is with a box plot.

To construct a box plot for a single variable, you use the boxplot function. Example: boxplot(cdc$weight).

boxplot(cdc$height)

plot of chunk unnamed-chunk-14

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    48.0    64.0    67.0    67.2    70.0    93.0

More on box plots

The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So you can, for example, compare the heights of men and women with boxplot(cdc$height ~ cdc$gender).

This notation is new. The ~ operator can be read “versus” or “as a function of”. So you’re asking R to give you a box plot of heights where the groups are defined by gender.

boxplot(cdc$weight ~ cdc$smoke100)

plot of chunk unnamed-chunk-15

One last box plot

You can also calculate new variables that don’t directly show up in your data frame. Say for example you want to create a boxplot of the sum of weight and height versus gender. This is of course nonsense, but it can be done like this: w_height = cdc$weight + cdc$height boxplot(w_height ~ cdc$gender)

Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 observations in cdc. That is, for each of the 20,000 participants, we take their weight and add it to their height. The result is 20,000 sums, one for each respondent. This is one reason why we like R: it lets us perform computations like this using very simple expressions.

Now let’s try a more meaningful calculation. * Calculate the BMI for each respondent and assign it to bmi. * Draw a box plot of the BMI versus the general health of the respondents.

bmi = (cdc$weight / cdc$height ** 2) * 703
boxplot(bmi ~ cdc$genhlth)

plot of chunk unnamed-chunk-16

Which of the following is false based on the box plot of BMI vs. general health?

The median BMI is roughly 25 for all general health categories, and there is a slight increase in median BMI as general health status declines (from excellent to poor).
The IQR increases slightly as general health status declines (from excellent to poor).
Among people with excellent health, there are some with unusually low BMIs compared to the rest of the group.
The distributions of BMIs within each health status group is left skewed.

The distributions of BMIs within each health status group is left skewed.

Histograms

Finally, let’s make some histograms. You can look at the histogram for the age of our respondents with the command hist(cdc$age). Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins.

With the breaks argument, you have more control over the number of bins. For example, the command hist(cdc$weight, breaks=50) will split the data across 50 bins.

hist(bmi)

plot of chunk unnamed-chunk-17

hist(bmi, breaks=50)

plot of chunk unnamed-chunk-17

hist(bmi, breaks=100)

plot of chunk unnamed-chunk-17

Weight vs. Desired Weight

In the previous lab, when exploring how percentages of boys and girls born vary over time (two numerical variables) we used a scatterplot. Now let’s use the same tools to compare people’s actual weight against their desired weight.

plot(cdc$weight, cdc$wtdesire)

plot of chunk unnamed-chunk-18

Based on the plot you made in the previous exercise, which of the following is true about the relationship between weight and desired weight?

moderately weak negative linear association
moderately weak positive linear association
moderately strong positive linear association
moderately weak negative linear association

moderately strong positive linear association

Correct! You’ve reached the end of this lab! At this point, you’ve done a good first pass at analyzing the information in the BRFSS questionnaire. You’ve found an interesting association between smoking and gender, and can now say something about the relationship between people’s assessment of their general health and their own BMI. You’ve also picked up essential computing tools - summary statistics, subsetting, and plots - that will serve you well throughout this course.

DataCamp - Lab 1: Introduction to Data

Alexey

Saturday, June 28, 2014

Data Analysis and Statistical Inference

Which variables are you working with?

Taking a peek at your data

Turning info into knowledge

Numerical data

Categorical data

Creating your first barplot

Questions

Create a numerical summary for gender.

Compute the relative frequency distribution of genhlth.

Even prettier: the Mosaic Plot

Interlude: How R thinks about data

A little more on subsetting

Subset - one last time

Visualizing with box plots

More on box plots

One last box plot

Histograms

Weight vs. Desired Weight