Lab 2

Introduction

Overview: In this lab exercise, you will create graphical and numerical summaries of data.

Objectives: At the end of this lab you will be able to:

Create frequency tables and bar charts and calculate proportions;
Create histograms and boxplots;
Calculate numeric summaries such as means and standard deviations;
Identify outliers in your data.

Part 0: Download and organize files

Many tasks and commands that were explained in the first lab will be used here with less direction. Refer to the first lab if more direction is needed.

Create a subdirectory named Lab 2 in the PUBHBIO 2210 Labs directory you created in your OneDrive folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
1. lab-02-summaries-blank.html
2. lab-02-summaries-blank.Rmd
3. lab-02-summaries-worksheet-blank.docx
4. nhanes.RData
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the PUBHBIO 2210 Labs/Lab 2 directory (i.e., save the downloaded files in the Lab 2 directory or folder created). When working on labs, it is important to keep all related files in the same directory.
Change the author and date information in the lab header.

Part 1: Summarizing nominal and ordinal variables

Sometimes it is desirable to look at a breakdown of the number of people in different categories in a dataset. We will summarize some of the categorical variables in the NHANES dataset.

In the code chunk below, load the nhanes.RData file and print it. Recall from (latter part of) Lab 1 on how to load a .RData file and print as well.

# Enter code here
load("nhanes.RData")
print(nhanes)

## # A tibble: 100 × 33
##       id race  ethnicity sex     age familySize urban region
##    <int> <fct> <fct>     <fct> <int>      <int> <fct> <fct> 
##  1     1 black not hisp… fema…    56          1 metr… midwe…
##  2     2 white not hisp… fema…    73          1 other west  
##  3     3 white not hisp… fema…    25          2 metr… south 
##  4     4 white mexican-… fema…    53          2 other south 
##  5     5 white mexican-… fema…    68          2 other south 
##  6     6 white not hisp… fema…    44          3 other west  
##  7     7 black not hisp… fema…    28          2 metr… south 
##  8     8 white not hisp… male     74          2 other midwe…
##  9     9 white not hisp… fema…    65          1 other north…
## 10    10 white other hi… fema…    61          3 metr… west  
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <fct>, healthStatus <ord>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <fct>,
## #   smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …

This dataset is ready for analysis—all the variable types have been set and value labels have been assigned in Lab #1. Ensure that you have opened the right version of the dataset. Look at the variable race. Do you see the words white and black or the numbers 1 and 2? You should see the coded values: white and black.

First we will make a frequency table using the tally() function.

For example:

# Not evaluated
tally( ~ sex, data = mydata)

creates a frequency table of the variable sex from the dataset named mydata.

In the code chunk below, make a frequency table for the variable region.

# Enter code here
tally( ~ region, data = nhanes)

## region
## northeast   midwest     south      west 
##        14        20        44        22

The ~ region within the tally() function in the above code is a simple formula that describes a model. A formula such as ~ race | region can be read as “race differs by region” and can be used to create more complex tables.

Using the tally() function, create a table of race by region in the code chunk below.

# Enter code here
tally( ~ race|region, data = nhanes)

##        region
## race    northeast midwest south west
##   white        13      16    29   20
##   black         1       4    15    2

We will talk much more about relationships between variables in later modules.

STOP! Answer Question 1 now.

Instead of presenting a table, we could also create a bar chart using the gf_bar() function.

For example,

# Not evaluated
gf_bar( ~ sex, data = mydata)

would create a bar chart of sex from mydata.

In the code chunk below, create a bar chart of the region variable from the nhanes data.

# Enter code here
gf_bar( ~ region, data = nhanes)

The bar chart is rather plain. We can make it prettier and more informative by adding a title, proper axis labels, and a caption, using the gf_labs() command.

For example:

# Not evaluated
gf_bar( ~ sex, data = mydata) %>%
  gf_labs(
    title = "Plot Title",
    x = "horizontal axis label",
    y = "vertical axis label",
    caption = "A caption describes the chart."
  )

In the code chunk below, re-create the bar chart of region using more informative labels. Thus, include an apropriate label for the x-axis and y-axis as well as a plot title of your choosing in your bar chart.

# Enter code here
gf_bar( ~ region, data = nhanes) %>%
  gf_labs(
    title = "Distribution of Regions",
    x = "Region",
    y = "Number of Individuals",
    caption = "This chart shows the number of people in each region from the sample."
  )

STOP! Answer Question 2 now.

Part 2: Summarizing continuous variables

The first continuous variable we will analyze is age. The functions gf_histogram() and gf_boxplot() produce histograms and boxplots, respectively, using the same command format as gf_bar() above. In the code chunk below, create a histogram and a boxplot for the variable age.

# Enter code here
gf_histogram( ~ age, data=nhanes)

gf_boxplot( ~ age, data=nhanes)

The default histogram looks rather noisy. We can fix this by adjusting the width of “bins” using the binwidth argument.

For example:

# Not evaluated
gf_histogram( ~ age, data = mydata, binwidth = 5, color="black")

In the code chunk below, create two histograms for age: one with the bin width set to 20, and one with the bin width set to 10. Keep both histograms in this document.

# Enter code here
gf_histogram( ~ age, data = nhanes, binwidth = 20, color="black")

gf_histogram( ~ age, data = nhanes, binwidth = 10, color="black")

STOP! Answer Question 3 now.

The favstats() command provides some useful summary statistics.

For example:

# Not evaluated
favstats( ~ height, data = mydata)

generates or prints summary statistics for the variable height from the dataset named mydata.

In the code chunk below, use favstats() with the formula ~ age and data nhanes to compute summary statistics for the variable age.

# Enter code here
favstats( ~ age, data = nhanes
          )

##  min    Q1 median    Q3 max  mean       sd   n missing
##   17 34.75   48.5 70.25  90 51.43 21.52858 100       0

STOP! Answer Question 4 now.

In the code chunk below, generate (1) summary statistics, (2) a histogram, and (3) a boxplot for the variable beer using similar commands (as before).

# Enter code here
favstats( ~ beer, data = nhanes
          )

##  min Q1 median  Q3 max     mean       sd  n missing
##    0  0      0 0.5 365 7.131313 38.21412 99       1

gf_histogram( ~ beer, data = nhanes, binwidth = 10, color="black")

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

gf_boxplot( ~ beer, data=nhanes)

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

STOP! Answer Question 5 now.

Notice that there is a warning when the histogram and boxplot are created. One row was ignored because it contained a “non-finite” value for beer. Looking at the summary statistics, we can see what happened: even though there are 100 rows, the report says that n is 99 and that there is 1 missing value. We can find the row with missing data using filter() and then select() to choose just the columns we want to display.

For example:

# Not evaluated
mydata %>%
    filter(is.na(age)) %>%
    select(id, age, sex)

finds the observations in the dataset mydata with a missing value of age and selects the variables id, age, and sex for those observations. It is important to use the is.na() function instead of writing beer == NA, since NA itself is a missing value and matches nothing, or beer == "NA", which compares beer to the text value "NA".

Using the code chunk below, print id, age, sex, and beer for the subjects in nhanes who are missing values of beer.

# Enter code here
nhanes %>%
    filter(is.na(beer)) %>%
    select(id, age, sex, beer)

## # A tibble: 1 × 4
##      id   age sex     beer
##   <int> <int> <fct>  <int>
## 1     5    68 female    NA

STOP! Answer Question 6 now.

There is a very conspicuous outlier reporting an average monthly consumption of over 300 beers per month! We can find out more about them using filter(), then using select() to choose just the variable/columns we want to display/print.

For example:

# Not evaluated
mydata %>%
    filter(age >= 120) %>%
    select(id, age, sex)

prints the variables id, age, and sex from the dataset mydata for all subjects with age greater than or equal to 120.

In the code chunk below, print the variables id, age, sex, and beer for the subject in nhanes who reported over (greater than) 300 beers per month.

# Enter code here
nhanes %>%
    filter(beer >= 300) %>%
    select(id, age, sex, beer)

## # A tibble: 1 × 4
##      id   age sex     beer
##   <int> <int> <fct>  <int>
## 1    54    48 female   365

STOP! Answer Question 7 now.

We have two options to remove the outlier: remove the entire row or change the value of beer to NA (missing). Since we don’t want to damage our original dataset, we will create new datasets with the outlier removed. We can then inspect the data with the outlier removed.

For example, if mydata outliers with ages 120 or higher, the following commands would create a dataset without those outlier (i.e., mydata.no.outliers), and a dataset with those value changed to NA (i.e., mydata.outliers.missing), respectively.

# Not evaluated
mydata.no.outliers <- mydata %>% filter(age < 120)

mydata.outliers.missing <- mydata %>%
  mutate(
    age = replace(
      age,         # start with age
      age >= 120,  # if age >= 120
      NA           # replace outliers with NA
    )
  )

In the code chunk below, create one dataset with the outlier on beer removed (i.e., a dataset without the outlier), and another dataset with the outlying value of beer changed to NA.

In addition, print summary statistics of beer using favstats() for both datasets created.

# Enter code here
nhanes.no.outliers <- nhanes %>% filter(beer < 300)

favstats( ~ beer, data = nhanes.no.outliers
          )

##  min Q1 median Q3 max     mean       sd  n missing
##    0  0      0  0 103 3.479592 11.89926 98       0

nhanes.outliers.missing <- nhanes %>%
  mutate(
    beer = replace(
      beer,         
      beer >= 300,  
      NA          
    )
  )
favstats( ~ beer, data = nhanes.outliers.missing
          )

##  min Q1 median Q3 max     mean       sd  n missing
##    0  0      0  0 103 3.479592 11.89926 98       2

STOP! Answer Questions 8 and 9 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.

Lab 2

Summarizing Data with Numbers and Pictures

Samuel Gilliland

09/10/2025

Introduction

Part 0: Download and organize files

Part 1: Summarizing nominal and ordinal variables

STOP! Answer Question 1 now.

STOP! Answer Question 2 now.

Part 2: Summarizing continuous variables

STOP! Answer Question 3 now.

STOP! Answer Question 4 now.

STOP! Answer Question 5 now.

STOP! Answer Question 6 now.

STOP! Answer Question 7 now.

STOP! Answer Questions 8 and 9 now.