Learning Objectives

In this lesson students will learn to apply categorical data analysis methods to data sets with fundamentally different structures.

  • Wrangle different types of data using base R and Tidyverse functions in R.
  • Implement exploratory data analysis techniques in R, including the following:
    • Data visualization using ggplot
    • Numerical data analysis (measures of central tendency, measures of precision & uncertainty, quantiles)
    • Categorical data analysis (proportions)
  • Perform a simple linear regression analysis and interpret the results in context of the problem

The tidyverse package is needed for these examples

library(tidyverse)

Palmer Penguins

The Palmer penguins dataset is a collection of body measurements for three penguin species that live on islands in the Palmer Archipelago in Antarctica:

  • Species: Adélie, Chinstrap, and Gentoo
  • Measurements: Bill length, bill depth, flipper length, body mass, island, sex, and year
  • Dataset creator: Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) team
  • Purpose: To study the relationship between Antarctic penguins’ foraging behavior and environmental variability

STEP 1: R Package

Install the R package and call it into your R environment.

## INSTALL PACKAGE
#install.packages("palmerpenguins")

## LOAD THE LIBRARY
library(palmerpenguins)

## DATA
data("penguins")

STEP 2: Look at the Data

Check out the structure of the data and take a look at it.

## STRUCTURE 
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## HEAD, TAIL, OR VIEW
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Questions of interest:

  • How many penguins are in these data?
  • What variables are included? Are they numeric or categorical?
SPACE FOR ANSWERS:

PART A: EDA for Categorical Variables

STEP 3: One-way Tables and Bar Graphs

Make a table to show how many penguins there are of each species.

## ONE WAY TABLE FOR SPECIES
table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

What are your observations?

SPACE FOR ANSWERS:

Now make a bar Graph using ggplot2.

## CUSTOM COLOR PALETTE
penCol<-c("darkorange","purple","cyan4")

## BAR GRAPH
ggplot(data = penguins, aes(x=species, fill=species))+
  geom_bar()+
  scale_fill_manual(values=penCol)

STEP 4: Two-way Tables, Distributions, and Bar Graphs

Make a two-way table to show how many penguins there are of each species on each island.

## TWO WAY TABLE FOR SPECIES and ISLAND
tab2<-table(penguins$species, penguins$island)
tab2
##            
##             Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

Create a stacked bar graph to to show the total number of penguins from each island represented in the data set. Fill by species.

## STACKED BAR GRAPH
ggplot(data = penguins, aes(x=island, fill=species))+
  geom_bar()+
  scale_fill_manual(values=penCol)

Now use the table to make conditional distributions for

    1. Distribution of location, given species
    1. Distribution of species, given location
## DISTRIBUTION OF LOCATION GIVEN SPECIES
prop.table(tab2, 1)
##            
##                Biscoe     Dream Torgersen
##   Adelie    0.2894737 0.3684211 0.3421053
##   Chinstrap 0.0000000 1.0000000 0.0000000
##   Gentoo    1.0000000 0.0000000 0.0000000
## DISTRIBUTION OF SPECIES GIVEN LOCATION
prop.table(tab2, 2)
##            
##                Biscoe     Dream Torgersen
##   Adelie    0.2619048 0.4516129 1.0000000
##   Chinstrap 0.0000000 0.5483871 0.0000000
##   Gentoo    0.7380952 0.0000000 0.0000000

Visualize the conditional distribution for species, given location

## FILLED BAR GRAPH
ggplot(data = penguins, aes(x=island, fill=species))+
  geom_bar(position = "fill")+
  scale_fill_manual(values=penCol)

Observations:

Can this possibly tell you anything about how well different species have adapted to different environments or cohabitating?

SPACE FOR ANSWERS:

PART B: EDA for Numeric Variables

STEP 5: Numeric Summaries

Calculate the means and standard deviations body mass and flipper length for each species.

## MEANS AND ST DEV
## BODY MASS AND FLIPPER LENGTH
## BY SPECIES
penguins%>%
  group_by(species)%>%
  summarise(n=n(),
            avg_bodyMass=mean(body_mass_g, na.rm=TRUE), 
            sd_bodyMass=sd(body_mass_g, na.rm=TRUE), 
            avg_flipperLen=mean(flipper_length_mm, na.rm=TRUE), 
            sd_flipperLen=sd(flipper_length_mm, na.rm=TRUE))
## # A tibble: 3 × 6
##   species       n avg_bodyMass sd_bodyMass avg_flipperLen sd_flipperLen
##   <fct>     <int>        <dbl>       <dbl>          <dbl>         <dbl>
## 1 Adelie      152        3701.        459.           190.          6.54
## 2 Chinstrap    68        3733.        384.           196.          7.13
## 3 Gentoo      124        5076.        504.           217.          6.48

STEP 6: Boxplots, Histograms, and Density Plots

Create plots to compare flipper length by species.

## BOXPLOT
ggplot(data=penguins, aes(x=species, y=flipper_length_mm, fill=species))+
  geom_boxplot()+
  scale_fill_manual(values=penCol)+
  theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

## HISTOGRAM (OVERLAPPING)
ggplot(data=penguins, aes(x=flipper_length_mm, fill=species))+
  geom_histogram(alpha = 0.5, 
                 position = "identity")+
  scale_fill_manual(values=penCol)+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

## DENSITY PLOT
ggplot(data=penguins, aes(x=flipper_length_mm, fill=species))+
  geom_density(alpha=0.5)+
  scale_fill_manual(values=penCol)+
  theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_density).

PART C: EDA for Relationships Between Two Numeric Variables

Definition:

Adaptive radiation is a process in evolutionary biology where a single species or small group of species rapidly diversifies into many new species. This process is characterized by the development of ecological and morphological diversity.

Let’s explore if the three species have different beak (bill) morphologies.

STEP 7: Scatter Plot

Make a scatter plot for the relationship between bill length (x-axis) vs bill depth (y-axis).

## SCATTERPLOT 
## X: BILL LENGTH
## Y: BILL DEPTH

ggplot(data = penguins, aes(x=bill_length_mm, y=bill_depth_mm))+
  geom_point()+
  xlab("Bill length (mm)")+
  ylab("Bill depth (mm)")
## Warning: Removed 2 rows containing missing values (geom_point).

Observations:

SPACE FOR ANSWERS:

STEP 8: Exploring Subroups

## SCATTERPLOT WITH SUBGROUPS 
## X: BILL LENGTH
## Y: BILL DEPTH
## COLOR: Species

ggplot(data = penguins, aes(x=bill_length_mm, y=bill_depth_mm, color=species))+
  geom_point()+
  scale_color_manual(values=penCol)+
  xlab("Bill length (mm)")+
  ylab("Bill depth (mm)")
## Warning: Removed 2 rows containing missing values (geom_point).

PART D: Modeling the Data

STEP 9: Adding Smooths

## GEOM_SMOOTH
ggplot(data = penguins, aes(x=bill_length_mm, y=bill_depth_mm, color=species))+
  geom_point()+
  scale_color_manual(values=penCol)+
  xlab("Bill length (mm)")+
  ylab("Bill depth (mm)")+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

STEP 10: Create a Linear Model with Interactions

## LINEAR MOD WITH INTERACTIONS
mod<-lm(bill_depth_mm~bill_length_mm*species, data=penguins)
summary(mod)
## 
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6574 -0.6675 -0.0524  0.5383  3.5032 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     11.40912    1.13812  10.025  < 2e-16 ***
## bill_length_mm                   0.17883    0.02927   6.110 2.76e-09 ***
## speciesChinstrap                -3.83998    2.05398  -1.870 0.062419 .  
## speciesGentoo                   -6.15812    1.75451  -3.510 0.000509 ***
## bill_length_mm:speciesChinstrap  0.04338    0.04558   0.952 0.341895    
## bill_length_mm:speciesGentoo     0.02601    0.04054   0.642 0.521590    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9548 on 336 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7697, Adjusted R-squared:  0.7662 
## F-statistic: 224.5 on 5 and 336 DF,  p-value: < 2.2e-16