Overview: In this lab exercise, you will create graphical and numerical summaries of data.
Objectives: At the end of this lab you will be able to:
Many tasks and commands that were explained in the first lab will be used here with less direction. Refer to the first lab if more direction is needed.
Create a subdirectory named Lab 2 in the PUBHBIO 2210 Labs directory you created in Lab 1.
Download all Lab 2 files from Carmen to the Lab 2 directory created.
Change the author and date information in the lab header.
Sometimes it is desirable to look at a breakdown of the number of people in different categories in a dataset. We will summarize some of the categorical variables in the NHANES dataset.
In the code chunk below, load the nhanes.RData file and print it. Recall from Lab 1 on how to load a .RData file and print as well.
# Enter code here
load("nhanes.RData")
nhanes
## # A tibble: 100 x 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hisp~ fema~ 56 1 metr~ midwe~
## 2 2 white not hisp~ fema~ 73 1 other west
## 3 3 white not hisp~ fema~ 25 2 metr~ south
## 4 4 white mexican-~ fema~ 53 2 other south
## 5 5 white mexican-~ fema~ 68 2 other south
## 6 6 white not hisp~ fema~ 44 3 other west
## 7 7 black not hisp~ fema~ 28 2 metr~ south
## 8 8 white not hisp~ male 74 2 other midwe~
## 9 9 white not hisp~ fema~ 65 1 other north~
## 10 10 white other hi~ fema~ 61 3 metr~ west
## # ... with 90 more rows, and 25 more variables: pir <dbl>,
## # yrsEducation <int>, maritalStatus <fct>,
## # healthStatus <ord>, heightInSelf <int>,
## # weightLbSelf <int>, beer <int>, wine <int>,
## # liquor <int>, everSmoke <fct>, smokeNow <fct>,
## # active <ord>, SBP <int>, DBP <int>, weightKg <dbl>,
## # heightCm <dbl>, waist <dbl>, tricep <dbl>, thigh <dbl>,
## # BMD <dbl>, RBC <dbl>, lead <dbl>, cholesterol <int>,
## # triglyceride <int>, hdl <int>
This dataset is ready for analysis—all the variable types have been set and value labels have been assigned in Lab #1. Ensure that you have opened the right version of the dataset. Look at the variable race. Do you see the words white and black or the numbers 1 and 2? You should see the coded values: white and black.
First we will make a frequency table using the tally() function.
For example:
# Not evaluated
tally( ~ sex, data = mydata)
creates a frequency table of the variable sex from the dataset named mydata.
In the code chunk below, make a frequency table for the variable region.
# Enter code here
tally( ~ region, data = nhanes)
## region
## northeast midwest south west
## 14 20 44 22
The ~ region within the tally() function in the above code is a simple formula that describes a model. A formula such as ~ race | region can be read as “race differs by region” and can be used to create more complex tables.
Using the tally() function, create a table of race by region in the code chunk below.
# Enter code here
tally( ~ race|region, data = nhanes)
## region
## race northeast midwest south west
## white 13 16 29 20
## black 1 4 15 2
We will talk much more about relationships between variables in later modules.
Instead of presenting a table, we could also create a bar chart using the gf_bar() function.
For example,
# Not evaluated
gf_bar( ~ sex, data = mydata)
would create a bar chart of sex from mydata.
In the code chunk below, create a bar chart of the region variable from the nhanes data.
# Enter code here
gf_bar( ~region, data=nhanes)
The bar chart is rather plain. We can make it prettier and more informative by adding a title, proper axis labels, and a caption, using the gf_labs() command.
For example:
# Not evaluated
gf_bar( ~ sex, data = mydata) %>%
gf_labs(
title = "Plot Title",
x = "horizontal axis label",
y = "vertical axis label",
caption = "A caption describes the chart."
)
In the code chunk below, re-create the bar chart of region using more informative labels. Thus, include an apropriate label for the x-axis and y-axis as well as a plot title of your choosing in your bar chart.
# Enter code here
gf_bar( ~ region, data = nhanes) %>%
gf_labs(
title = "Distribution of Sample Across Regions of the United States ",
x = "Region of the United States",
y = "Number of Individuals",
caption = "This graph shows how many individuals from the sample live in each of these predefined regions of the United States."
)
The first continuous variable we will analyze is age. The functions gf_histogram() and gf_boxplot() produce histograms and boxplots, respectively, using the same command format as gf_bar() above. In the code chunk below, create a histogram and a boxplot for the variable age.
# Enter code here
gf_histogram(~age, data=nhanes)
gf_boxplot(~age, data=nhanes)
The default histogram looks rather noisy. We can fix this by adjusting the width of “bins” using the binwidth argument.
For example:
# Not evaluated
gf_histogram( ~ age, data = mydata, binwidth = 5, color="black")
In the code chunk below, create two histograms for age: one with the bin width set to 20, and one with the bin width set to 10. Keep both histograms in this document.
# Enter code here
gf_histogram( ~ age, data = nhanes, binwidth = 20, color="black")
gf_histogram( ~ age, data = nhanes, binwidth = 10, color="black")
The favstats() command provides some useful summary statistics.
For example:
# Not evaluated
favstas( ~ height, data = mydata)
generates or prints summary statistics for the variable height from the dataset named mydata.
In the code chunk below, use favstats() with the formula ~ age and data nhanes to compute summary statistics for the variable age.
# Enter code here
favstats(~age, data=nhanes)
## min Q1 median Q3 max mean sd n missing
## 17 34.75 48.5 70.25 90 51.43 21.52858 100 0
In the code chunk below, generate (1) summary statistics, (2) a histogram, and (3) a boxplot for the variable beer using similar commands (as before).
# Enter code here
favstats(~beer, data=nhanes)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 0.5 365 7.131313 38.21412 99 1
gf_histogram(~beer, data=nhanes)
## Warning: Removed 1 rows containing non-finite values
## (stat_bin).
gf_boxplot(~beer, data=nhanes)
## Warning: Removed 1 rows containing non-finite values
## (stat_boxplot).
Notice that there is a warning when the histogram and boxplot are created. One row was ignored because it contained a “non-finite” value for beer. Looking at the summary statistics, we can see what happened: even though there are 100 rows, the report says that n is 99 and that there is 1 missing value. We can find the row with missing data using filter() and then select() to choose just the columns we want to display.
For example:
# Not evaluated
mydata %>%
filter(is.na(age)) %>%
select(id, age, sex)
finds the observations in the dataset mydata with a missing value of age and selects the variables id, age, and sex for those observations. It is important to use the is.na() function instead of writing beer == NA, since NA itself is a missing value and matches nothing, or beer == "NA", which compares beer to the text value "NA".
Using the code chunk below, print id, age, sex, and beer for the subjects in nhanes who are missing values of beer.
# Enter code here
nhanes %>%
filter(is.na(beer)) %>%
select(id, age, sex)
## # A tibble: 1 x 3
## id age sex
## <int> <int> <fct>
## 1 5 68 female
There is a very conspicuous outlier reporting an average monthly consumption of over 300 beers per month! We can find out more about them using filter(), then using select() to choose just the columns we want to display/print.
For example:
# Not evaluated
mydata %>%
filter(age >= 120) %>%
select(id, age, sex)
prints the variables id, age, and sex from the dataset mydata for all subjects with age greater than or equal to 120.
In the code chunk below, print the variables id, age, sex, and beer for the subject in nhanes who reported over (greater than) 300 beers per month.
# Enter code here
nhanes %>%
filter(beer >= 120) %>%
select(id, age, sex)
## # A tibble: 1 x 3
## id age sex
## <int> <int> <fct>
## 1 54 48 female
We have two options to remove the outlier: remove the entire row or change the value of beer to NA (missing). Since we don’t want to damage our original dataset, we will create new datasets with the outlier removed. We can then inspect the data with the outlier removed.
For example, if mydata outliers with ages 120 or higher, the following commands would create a dataset without those outlier (i.e., mydata.no.outliers), and a dataset with those value changed to NA (i.e., mydata.outliers.missing), respectively.
# Not evaluated
mydata.no.outliers <- mydata %>% filter(age < 120)
mydata.outliers.missing <- mydata %>%
mutate(
age = replace(
age, # start with age
age >= 120, # if age >= 120
NA # replace outliers with NA
)
)
In the code chunk below, create one dataset with the outlier on beer removed (i.e., a dataset without the outlier), and another dataset with the outlying value of beer changed to NA.
In addition, print summary statistics of beer using favstats() for both datasets created.
# Enter code here
nhanes.no.outliers <- nhanes %>% filter(beer < 120)
nhanes.outliers.missing <- nhanes %>%
mutate(
beer = replace(
beer,
beer >= 120,
NA
)
)
favstats(~beer, data=nhanes.no.outliers)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 0 103 3.479592 11.89926 98 0
favstats(~beer, data=nhanes.outliers.missing)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 0 103 3.479592 11.89926 98 2
Please turn in your completed worksheet to Carmen by the due date.