DATA151: Comparing Distributions

Learning Objectives

In this lesson students will compare distributions from multiple populations of interest using:

dplyr: group_by and summarise
side-by-side boxplots
side-by-side violin plots
dot plots
beeswarmplots
faceting

Step 0: Library Tidyverse

library(tidyverse)

Step 1: Load the Data

aqi<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/fireAQI_OrCoWa_10192022.csv", 
                   header=TRUE)

Step 2: Numeric Summaries

How does the air quality compare in Oregon, Washington, and Colorado?

## LET'S START WITH MEAN and STANDARD DEV
aqi%>%
  group_by(State)%>%
  summarise(n=n(), 
            avgAQI=mean(AQI, na.rm = TRUE), 
            sdAQI=sd(AQI, na.rm = TRUE), 
            NAs=sum(is.na(AQI)))

## # A tibble: 3 × 5
##   State          n avgAQI sdAQI   NAs
##   <fct>      <int>  <dbl> <dbl> <int>
## 1 Colorado      11   35.6  7.07     0
## 2 Oregon        47   81.0 57.9      3
## 3 Washington    52  117.  72.0      1

Now let’s add more metrics! Let’s use the quartiles. Cutting our data into quartiles means that we have split the data into four even parts.

\(Q_1\): The first quartile is the 25th percentile
\(Q_2\) (Median): The second quartile is also known as the median and is the 50th percentile
\(Q_3\): The third quartile is the 75th percentile

We can use the quantile() function to get any desired quantile from our data set. Quantile is synonymous with percentile. When we use the quantile function we specify the fraction of data below a desired cut off point.

# The first quartile
quantile(aqi$AQI, prob=.25, na.rm = TRUE)

## 25% 
##  39

We can do this for each state:

aqi%>%
  group_by(State)%>%
  summarise(n=n(), 
            min=min(AQI, na.rm = TRUE),
            Q1=quantile(AQI, prob=.25, na.rm = TRUE),
            med=median(AQI, na.rm = TRUE),
            Q3=quantile(AQI, prob=.75, na.rm = TRUE),
            max=max(AQI, na.rm = TRUE))

## # A tibble: 3 × 7
##   State          n   min    Q1   med    Q3   max
##   <fct>      <int> <int> <dbl> <dbl> <dbl> <int>
## 1 Colorado      11    19  33.5  36     38     49
## 2 Oregon        47    13  35.8  62.5  116.   245
## 3 Washington    52    15  67    84    170    359

What do you observe?

Step 3: Side-by-side Boxplots

# BOXPLOT

ggplot(aqi, aes(x=State, y=AQI, fill=State))+
  geom_boxplot()

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

Step 4: Side-by-side Violin Plots

# Violin

ggplot(aqi, aes(x=State, y=AQI, fill=State))+
  geom_violin()

## Warning: Removed 4 rows containing non-finite values (stat_ydensity).

Step 5: Dot Plots

You could make a dot plot!

# Dot plot

ggplot(aqi, aes(x=State, y=AQI, fill=State))+
  geom_dotplot(binaxis='y')

## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bindot).

Step 6: Bee swarm Plots

#install.packages('ggbeeswarm')
library(ggbeeswarm)

ggplot(aqi, aes(x=State, y=AQI, color=State))+
  geom_beeswarm()

## Warning: Removed 4 rows containing missing values (position_beeswarm).

Step 7: Faceting

Faceting can be used with any type of plot to separate subgroups.

Faceting Histograms

Here is a histogram where the y-axis is count.

### Histogram (Counts)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_histogram()+
  facet_grid(State~.)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

We can change the y-axis to density (proportions).

### Histogram (Density)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_histogram(aes(y=..density..))+
  facet_grid(State~.)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

Since there is a spike in the Colorado distribution, this distorts the scale for the distributions for Oregon and Washington. We can fix this by allowing the y-axis to freely vary.

### Histogram (Density - Free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_histogram(aes(y=..density..))+
  facet_grid(State~., scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

Faceting Density Plots

We can also make a density plot!

### Density Plot (free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_density()+
  facet_grid(State~., scales = "free_y")

## Warning: Removed 4 rows containing non-finite values (stat_density).