Exploratory Data Analysis With R

Rivers School Science Researchers

Exploratory Data Analysis

How can you do this work?

Now, we don’t want to totally discredit the use of paper

Famous Historical Examples
Can be Excellent for Planning

Famous Historical Examples - W.E.B. Du Bois¹

Famous Historical Examples - Public Health by Florence Nightingale¹

Famous Historical Examples - Charles Minard¹

Planning¹

But now that it’s 2025, how can we do this work?

The coding language = R
The Integrated Development Environment (IDE) = Posit Cloud
How to think of this: You write an english paper in Google Docs. In that case english is the language and you use the Google Docs environment to actually type and edit that paper.

Posit Cloud Layout

RStudio Project Layout

Packages

Running the `Library` function = Getting the tools out of the shed that you’ll need to use

Start by Always Getting Your Toolboxes

You do this by running the library function for each package you are going to use.
For this tutorial, we will use.
- Tidyverse is made up of a collection of packages that include tools making it easier to visualize, summarize, and wrangle data.
- Openintro is a package referenced in our textbook, Introduction to Modern Statistics. Among other things, it includes a number of datasets. During this tutorial, you will generate similar visualizations to those seen when reading.
The code you type and run

library(tidyverse)
library(openintro)
?births14

Questions to Ask to Start Getting to know your data

What are the observational units/cases?
What are the variables?
How is my data organized?
- You want Tidy Data
  - Each row is an observation
  - Each column is a variable

Data for this Introduction to EDA

Throughout the rest of this tutorial we will explore the births14
- This dataset is included in the openintro package
- Since it is part of an R package can type ?births14 to access its documentation and learn more about it.
Description
- “Every year, the US releases to the public a large data set containing information on births recorded in the country. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from the data set released in 2014.”[@openintro]

Answering Questions to Start Getting to Know the `Births14` dataset

Running the head function allows you to see the first 6 rows of the data frame

Type and run this code

head(births14)

# A tibble: 6 × 13
   fage  mage mature      weeks premie visits gained weight lowbirthweight sex  
  <int> <dbl> <chr>       <dbl> <chr>   <dbl>  <dbl>  <dbl> <chr>          <chr>
1    34    34 younger mom    37 full …     14     28   6.96 not low        male 
2    36    31 younger mom    41 full …     12     41   8.86 not low        fema…
3    37    36 mature mom     37 full …     10     28   7.51 not low        fema…
4    NA    16 younger mom    38 full …     NA     29   6.19 not low        male 
5    32    31 younger mom    36 premie     12     48   6.75 not low        fema…
6    32    26 younger mom    39 full …     14     45   6.69 not low        fema…
# ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>

Answering Questions to Start Getting to Know the `Births14` dataset

The glimpse function allows you to see how many variables, what type each is, and how many observations are in the data frame.

Type and Run this code

glimpse(births14)

Rows: 1,000
Columns: 13
$ fage           <int> 34, 36, 37, NA, 32, 32, 37, 29, 30, 29, 30, 34, 28, 28,…
$ mage           <dbl> 34, 31, 36, 16, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31,…
$ mature         <chr> "younger mom", "younger mom", "mature mom", "younger mo…
$ weeks          <dbl> 37, 41, 37, 38, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39,…
$ premie         <chr> "full term", "full term", "full term", "full term", "pr…
$ visits         <dbl> 14, 12, 10, NA, 12, 14, 10, 13, 15, 11, 14, 16, 20, 15,…
$ gained         <dbl> 28, 41, 28, 29, 48, 45, 20, 65, 25, 22, 40, 30, 31, NA,…
$ weight         <dbl> 6.96, 8.86, 7.51, 6.19, 6.75, 6.69, 6.13, 6.74, 8.94, 9…
$ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
$ sex            <chr> "male", "female", "female", "male", "female", "female",…
$ habit          <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ marital        <chr> "married", "married", "married", "not married", "marrie…
$ whitemom       <chr> "white", "white", "not white", "white", "white", "white…

Now that you know a little about the data, start asking questions.

Keep it simple to start.
Know that, you’ll go around the wheel a lot and new questions will arise following your conclusions, but you have to start somewhere.
Examples:
- How much do babies in this sample tend to weigh?
- How many weeks did pregnancies tend to last?
- What percent of babies were born to mothers who were married?

How much do babies in this sample tend to weigh?

Start by visualizing the distribution of the weight variable.

births14 |>
  ggplot(aes(x = weight)) +
  geom_histogram(binwidth = 0.5) +
  labs(
    title = "How much do babies in this sample weigh at birth?",
    x = "Weight (Pounds)",
    y = "Number of Babies"
  )

Notes:
- |> is a pipe symbol and should be read as “AND THEN”
- + is used to add layers to a ggplot.

How much do babies in this sample weigh? Statistics

We now need to generate statistics to better describe the weights.
Central Tendency
- Mean
- Median
Spread of the data
- Standard Deviation: Average distance from the mean
- Interquartile Range: The middle 50% of the data

Code to generate these statistics

1births14 |>
2  filter(!is.na(weight)) |>
3  summarize(
4    Mean = mean(weight),
5    Median = median(weight),
6    Standard_Deviation = sd(weight),
7    IQR = IQR(weight)
  )

1: We’ll start by asking the computer to reference the births14 data
2: Next, we’ll use the filter function remove out any observations where the weight was not recorded
3: summarize is the function we will use to generate the statistics.
4: mean is what is doing the calculation of the weight variable, while Mean is the name we are giving this calculation.
5: median is what is doing the calculation of the weight variable, while Median is the name we are giving this calculation.
6: sd is what is doing the calculation of the weight variable, while Standard_Deviation is the name we are giving this calculation.
7: IQR is what is doing the calculation of the weight variable, while IQR is the name we are giving this calculation.

# A tibble: 1 × 4
   Mean Median Standard_Deviation   IQR
  <dbl>  <dbl>              <dbl> <dbl>
1  7.20   7.31               1.31  1.46

How much do babies in this sample tend to weigh?

# A tibble: 1 × 4
   Mean Median Standard_Deviation   IQR
  <dbl>  <dbl>              <dbl> <dbl>
1  7.20   7.31               1.31  1.46

Now that you’ve learned something, what new question comes to mind?

What is the relationship between an expecting mother’s smoking habits and the weight of their baby?

More about ggplot layers

What is the relationship between an expecting mother’s smoking habits and the weight of their baby? Visualize

births14 |>
  filter(!is.na(habit)) |>
  ggplot(aes(x = weight)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~habit, nrow = 2)

What is the relationship between an expecting mother’s smoking habits and the weight of their baby? Statistics

The group_by function does as it says. It puts babies, in this case, into two groups, one for those with moms who smoked and one for those with moms who did not smoke.

births14 |> 
  filter(!is.na(weight)) |> 
  group_by(habit) |>
  summarize( 
    Mean = mean(weight), 
    Median = median(weight), 
    Standard_Deviation = sd(weight), 
    IQR = IQR(weight) 
  )

# A tibble: 3 × 5
  habit      Mean Median Standard_Deviation   IQR
  <chr>     <dbl>  <dbl>              <dbl> <dbl>
1 nonsmoker  7.27   7.35               1.23  1.49
2 smoker     6.68   7.03               1.60  1.85
3 <NA>       7.05   7.1                1.91  1.99

Is there a relationship between the weight of a baby and length of the pregnancy?

births14 |>
1  ggplot(aes(x=weeks, y = weight)) +
2  geom_point() +
3  labs(
    title = "Weight vs Length of Pregnancy",
    x = "Length of Pregnancy (weeks)",
    y = "Weight (pounds)"
  )

1: We have to add an aesthetic to the y-axis.
2: geom_point will make the scatter plot
3: You can reuse the labs function and change what is inside the various parentheses

Adding a trend line to a scatter plot

births14 |>
  ggplot(aes(x=weeks, y = weight)) + 
  geom_point() + 
1  geom_smooth(method = 'lm') +
  labs(
    title = "Weight vs Length of Pregnancy",
    x = "Length of Pregnancy (weeks)",
    y = "Weight (pounds)"
  )

1: geom_smooth will add in a trend line. You dictate the type of trend line by defining the method. In this case, I’ve chosen a lm or linear method for purposes of instruction.

Bibliography

Rochette, R., & Lê, S. (2020). Grammar of graphics: gg basics. QCBS R Workshop Series. 1

Exploratory Data Analysis With R