Overview: In this lab exercise, you will create graphical and numerical summaries of data.
Objectives: At the end of this lab you will be able to:
Many tasks and commands that were explained in the first lab will be used here with less direction. Refer to the first lab if more direction is needed.
Create a subdirectory named Lab 2
in the
PUBHBIO 2210 Labs
directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-02-summaries-blank.html
lab-02-summaries-blank.Rmd
lab-02-summaries-worksheet-blank.docx
nhanes.RData
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 2
directory (i.e., save the
downloaded files in the Lab 2
directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Sometimes it is desirable to look at a breakdown of the number of people in different categories in a dataset. We will summarize some of the categorical variables in the NHANES dataset.
In the code chunk below, load the nhanes.RData
file and
print it. Recall from (latter part of) Lab 1 on how to load a
.RData
file and print as well.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hisp… fema… 56 1 metr… midwe…
## 2 2 white not hisp… fema… 73 1 other west
## 3 3 white not hisp… fema… 25 2 metr… south
## 4 4 white mexican-… fema… 53 2 other south
## 5 5 white mexican-… fema… 68 2 other south
## 6 6 white not hisp… fema… 44 3 other west
## 7 7 black not hisp… fema… 28 2 metr… south
## 8 8 white not hisp… male 74 2 other midwe…
## 9 9 white not hisp… fema… 65 1 other north…
## 10 10 white other hi… fema… 61 3 metr… west
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## # maritalStatus <fct>, healthStatus <ord>,
## # heightInSelf <int>, weightLbSelf <int>, beer <int>,
## # wine <int>, liquor <int>, everSmoke <fct>,
## # smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## # weightKg <dbl>, heightCm <dbl>, waist <dbl>, …
This dataset is ready for analysis—all the variable types have been
set and value labels have been assigned in Lab #1. Ensure that you have
opened the right version of the dataset. Look at the variable
race
. Do you see the words white
and
black
or the numbers 1
and 2
?
You should see the coded values: white
and
black
.
First we will make a frequency table using the tally()
function.
For example:
# Not evaluated
tally( ~ sex, data = mydata)
creates a frequency table of the variable sex
from the
dataset named mydata
.
In the code chunk below, make a frequency table for the variable
region
.
# Enter code here
tally( ~ region, data = nhanes)
## region
## northeast midwest south west
## 14 20 44 22
The ~ region
within the tally()
function in
the above code is a simple formula that describes a
model. A formula such as ~ race | region
can be read as “race differs by region” and can be used to create more
complex tables.
Using the tally()
function, create a table of
race
by region
in the code chunk below.
# Enter code here
tally( ~ race|region, data = nhanes)
## region
## race northeast midwest south west
## white 13 16 29 20
## black 1 4 15 2
We will talk much more about relationships between variables in later modules.
Instead of presenting a table, we could also create a bar chart using
the gf_bar()
function.
For example,
# Not evaluated
gf_bar( ~ sex, data = mydata)
would create a bar chart of sex
from
mydata
.
In the code chunk below, create a bar chart of the
region
variable from the nhanes
data.
# Enter code here
gf_bar( ~ region, data = nhanes)
The bar chart is rather plain. We can make it prettier and more
informative by adding a title, proper axis labels, and a caption, using
the gf_labs()
command.
For example:
# Not evaluated
gf_bar( ~ sex, data = mydata) %>%
gf_labs(
title = "Plot Title",
x = "horizontal axis label",
y = "vertical axis label",
caption = "A caption describes the chart."
)
In the code chunk below, re-create the bar chart of
region
using more informative labels. Thus, include an
apropriate label for the x-axis and y-axis as well as a plot title of
your choosing in your bar chart.
# Enter code here
gf_bar( ~ region, data = nhanes) %>%
gf_labs(
title = "Distribution of Regions",
x = "Region",
y = "Number of Individuals",
caption = "This chart shows the number of people in each region from the sample."
)
The first continuous variable we will analyze is age
.
The functions gf_histogram()
and gf_boxplot()
produce histograms and boxplots, respectively, using the same command
format as gf_bar()
above. In the code chunk below, create a
histogram and a boxplot for the variable age
.
# Enter code here
gf_histogram( ~ age, data=nhanes)
gf_boxplot( ~ age, data=nhanes)
The default histogram looks rather noisy. We can fix this by
adjusting the width of “bins” using the binwidth
argument.
For example:
# Not evaluated
gf_histogram( ~ age, data = mydata, binwidth = 5, color="black")
In the code chunk below, create two histograms for age
:
one with the bin width set to 20, and one with the bin width set to 10.
Keep both histograms in this document.
# Enter code here
gf_histogram( ~ age, data = nhanes, binwidth = 20, color="black")
gf_histogram( ~ age, data = nhanes, binwidth = 10, color="black")
The favstats()
command provides some useful summary
statistics.
For example:
# Not evaluated
favstats( ~ height, data = mydata)
generates or prints summary statistics for the variable
height
from the dataset named mydata
.
In the code chunk below, use favstats()
with the formula
~ age
and data nhanes
to compute summary
statistics for the variable age
.
# Enter code here
favstats( ~ age, data = nhanes
)
## min Q1 median Q3 max mean sd n missing
## 17 34.75 48.5 70.25 90 51.43 21.52858 100 0
In the code chunk below, generate (1) summary statistics, (2) a
histogram, and (3) a boxplot for the variable beer
using
similar commands (as before).
# Enter code here
favstats( ~ beer, data = nhanes
)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 0.5 365 7.131313 38.21412 99 1
gf_histogram( ~ beer, data = nhanes, binwidth = 10, color="black")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
gf_boxplot( ~ beer, data=nhanes)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
Notice that there is a warning when the histogram and boxplot are
created. One row was ignored because it contained a “non-finite” value
for beer
. Looking at the summary statistics, we can see
what happened: even though there are 100 rows, the report says that
n
is 99 and that there is 1 missing value. We can find the
row with missing data using filter()
and then
select()
to choose just the columns we want to display.
For example:
# Not evaluated
mydata %>%
filter(is.na(age)) %>%
select(id, age, sex)
finds the observations in the dataset mydata
with a
missing value of age
and selects the variables
id
, age
, and sex
for those
observations. It is important to use the is.na()
function
instead of writing beer == NA
, since NA
itself
is a missing value and matches nothing, or beer == "NA"
,
which compares beer
to the text value
"NA"
.
Using the code chunk below, print id
, age
,
sex
, and beer
for the subjects in
nhanes
who are missing values of beer
.
# Enter code here
nhanes %>%
filter(is.na(beer)) %>%
select(id, age, sex, beer)
## # A tibble: 1 × 4
## id age sex beer
## <int> <int> <fct> <int>
## 1 5 68 female NA
There is a very conspicuous outlier reporting an average monthly
consumption of over 300 beers per month! We can find out more about them
using filter()
, then using select()
to choose
just the variable/columns we want to display/print.
For example:
# Not evaluated
mydata %>%
filter(age >= 120) %>%
select(id, age, sex)
prints the variables id
, age
, and
sex
from the dataset mydata
for all subjects
with age greater than or equal to 120.
In the code chunk below, print the variables id
,
age
, sex
, and beer
for the
subject in nhanes
who reported over (greater than) 300
beers per month.
# Enter code here
nhanes %>%
filter(beer >= 300) %>%
select(id, age, sex, beer)
## # A tibble: 1 × 4
## id age sex beer
## <int> <int> <fct> <int>
## 1 54 48 female 365
We have two options to remove the outlier: remove the entire row or
change the value of beer
to NA
(missing).
Since we don’t want to damage our original dataset, we will create new
datasets with the outlier removed. We can then inspect the data with the
outlier removed.
For example, if mydata
outliers with ages 120 or higher,
the following commands would create a dataset without those outlier
(i.e., mydata.no.outliers
), and a dataset with those value
changed to NA
(i.e., mydata.outliers.missing
),
respectively.
# Not evaluated
mydata.no.outliers <- mydata %>% filter(age < 120)
mydata.outliers.missing <- mydata %>%
mutate(
age = replace(
age, # start with age
age >= 120, # if age >= 120
NA # replace outliers with NA
)
)
In the code chunk below, create one dataset with the outlier on
beer
removed (i.e., a dataset without the outlier), and
another dataset with the outlying value of beer
changed to
NA
.
In addition, print summary statistics of beer
using
favstats()
for both datasets created.
# Enter code here
nhanes.no.outliers <- nhanes %>% filter(beer < 300)
favstats( ~ beer, data = nhanes.no.outliers
)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 0 103 3.479592 11.89926 98 0
nhanes.outliers.missing <- nhanes %>%
mutate(
beer = replace(
beer,
beer >= 300,
NA
)
)
favstats( ~ beer, data = nhanes.outliers.missing
)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 0 103 3.479592 11.89926 98 2
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.