In this lesson students will learn to apply categorical data analysis methods to data sets with fundamentally different structures.
The tidyverse
package is needed for these examples
library(tidyverse)
The Palmer penguins dataset is a collection of body measurements for three penguin species that live on islands in the Palmer Archipelago in Antarctica:
Install the R package and call it into your R environment.
## INSTALL PACKAGE
#install.packages("palmerpenguins")
## LOAD THE LIBRARY
library(palmerpenguins)
## DATA
data("penguins")
Check out the structure of the data and take a look at it.
## STRUCTURE
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## HEAD, TAIL, OR VIEW
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
Questions of interest:
SPACE FOR ANSWERS:
Make a table to show how many penguins there are of each species.
## ONE WAY TABLE FOR SPECIES
table(penguins$species)
##
## Adelie Chinstrap Gentoo
## 152 68 124
What are your observations?
SPACE FOR ANSWERS:
Now make a bar Graph using ggplot2.
## CUSTOM COLOR PALETTE
penCol<-c("darkorange","purple","cyan4")
## BAR GRAPH
ggplot(data = penguins, aes(x=species, fill=species))+
geom_bar()+
scale_fill_manual(values=penCol)
Make a two-way table to show how many penguins there are of each species on each island.
## TWO WAY TABLE FOR SPECIES and ISLAND
tab2<-table(penguins$species, penguins$island)
tab2
##
## Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
Create a stacked bar graph to to show the total number of penguins from each island represented in the data set. Fill by species.
## STACKED BAR GRAPH
ggplot(data = penguins, aes(x=island, fill=species))+
geom_bar()+
scale_fill_manual(values=penCol)
Now use the table to make conditional distributions for
## DISTRIBUTION OF LOCATION GIVEN SPECIES
prop.table(tab2, 1)
##
## Biscoe Dream Torgersen
## Adelie 0.2894737 0.3684211 0.3421053
## Chinstrap 0.0000000 1.0000000 0.0000000
## Gentoo 1.0000000 0.0000000 0.0000000
## DISTRIBUTION OF SPECIES GIVEN LOCATION
prop.table(tab2, 2)
##
## Biscoe Dream Torgersen
## Adelie 0.2619048 0.4516129 1.0000000
## Chinstrap 0.0000000 0.5483871 0.0000000
## Gentoo 0.7380952 0.0000000 0.0000000
Visualize the conditional distribution for species, given location
## FILLED BAR GRAPH
ggplot(data = penguins, aes(x=island, fill=species))+
geom_bar(position = "fill")+
scale_fill_manual(values=penCol)
Observations:
Can this possibly tell you anything about how well different species have adapted to different environments or cohabitating?
SPACE FOR ANSWERS:
Calculate the means and standard deviations body mass and flipper length for each species.
## MEANS AND ST DEV
## BODY MASS AND FLIPPER LENGTH
## BY SPECIES
penguins%>%
group_by(species)%>%
summarise(n=n(),
avg_bodyMass=mean(body_mass_g, na.rm=TRUE),
sd_bodyMass=sd(body_mass_g, na.rm=TRUE),
avg_flipperLen=mean(flipper_length_mm, na.rm=TRUE),
sd_flipperLen=sd(flipper_length_mm, na.rm=TRUE))
## # A tibble: 3 × 6
## species n avg_bodyMass sd_bodyMass avg_flipperLen sd_flipperLen
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 152 3701. 459. 190. 6.54
## 2 Chinstrap 68 3733. 384. 196. 7.13
## 3 Gentoo 124 5076. 504. 217. 6.48
Create plots to compare flipper length by species.
## BOXPLOT
ggplot(data=penguins, aes(x=species, y=flipper_length_mm, fill=species))+
geom_boxplot()+
scale_fill_manual(values=penCol)+
theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## HISTOGRAM (OVERLAPPING)
ggplot(data=penguins, aes(x=flipper_length_mm, fill=species))+
geom_histogram(alpha = 0.5,
position = "identity")+
scale_fill_manual(values=penCol)+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## DENSITY PLOT
ggplot(data=penguins, aes(x=flipper_length_mm, fill=species))+
geom_density(alpha=0.5)+
scale_fill_manual(values=penCol)+
theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_density).
Definition:
Adaptive radiation is a process in evolutionary biology where a single species or small group of species rapidly diversifies into many new species. This process is characterized by the development of ecological and morphological diversity.
Let’s explore if the three species have different beak (bill) morphologies.
Make a scatter plot for the relationship between bill length (x-axis) vs bill depth (y-axis).
## SCATTERPLOT
## X: BILL LENGTH
## Y: BILL DEPTH
ggplot(data = penguins, aes(x=bill_length_mm, y=bill_depth_mm))+
geom_point()+
xlab("Bill length (mm)")+
ylab("Bill depth (mm)")
## Warning: Removed 2 rows containing missing values (geom_point).
Observations:
SPACE FOR ANSWERS:
## SCATTERPLOT WITH SUBGROUPS
## X: BILL LENGTH
## Y: BILL DEPTH
## COLOR: Species
ggplot(data = penguins, aes(x=bill_length_mm, y=bill_depth_mm, color=species))+
geom_point()+
scale_color_manual(values=penCol)+
xlab("Bill length (mm)")+
ylab("Bill depth (mm)")
## Warning: Removed 2 rows containing missing values (geom_point).
## GEOM_SMOOTH
ggplot(data = penguins, aes(x=bill_length_mm, y=bill_depth_mm, color=species))+
geom_point()+
scale_color_manual(values=penCol)+
xlab("Bill length (mm)")+
ylab("Bill depth (mm)")+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
## LINEAR MOD WITH INTERACTIONS
mod<-lm(bill_depth_mm~bill_length_mm*species, data=penguins)
summary(mod)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6574 -0.6675 -0.0524 0.5383 3.5032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.40912 1.13812 10.025 < 2e-16 ***
## bill_length_mm 0.17883 0.02927 6.110 2.76e-09 ***
## speciesChinstrap -3.83998 2.05398 -1.870 0.062419 .
## speciesGentoo -6.15812 1.75451 -3.510 0.000509 ***
## bill_length_mm:speciesChinstrap 0.04338 0.04558 0.952 0.341895
## bill_length_mm:speciesGentoo 0.02601 0.04054 0.642 0.521590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9548 on 336 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7697, Adjusted R-squared: 0.7662
## F-statistic: 224.5 on 5 and 336 DF, p-value: < 2.2e-16