In this lesson students will compare distributions from multiple populations of interest using:
dplyr
: group_by
and summarise
boxplots
violin
plotsbeeswarm
plotslibrary(tidyverse)
aqi<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/fireAQI_OrCoWa_10192022.csv",
header=TRUE)
How does the air quality compare in Oregon, Washington, and Colorado?
## LET'S START WITH MEAN and STANDARD DEV
aqi%>%
group_by(State)%>%
summarise(n=n(),
avgAQI=mean(AQI, na.rm = TRUE),
sdAQI=sd(AQI, na.rm = TRUE),
NAs=sum(is.na(AQI)))
## # A tibble: 3 × 5
## State n avgAQI sdAQI NAs
## <fct> <int> <dbl> <dbl> <int>
## 1 Colorado 11 35.6 7.07 0
## 2 Oregon 47 81.0 57.9 3
## 3 Washington 52 117. 72.0 1
Now let’s add more metrics! Let’s use the quartiles. Cutting our data into quartiles means that we have split the data into four even parts.
We can use the quantile()
function to get any desired quantile from our data set. Quantile is synonymous with percentile. When we use the quantile function we specify the fraction of data below a desired cut off point.
# The first quartile
quantile(aqi$AQI, prob=.25, na.rm = TRUE)
## 25%
## 39
We can do this for each state:
aqi%>%
group_by(State)%>%
summarise(n=n(),
min=min(AQI, na.rm = TRUE),
Q1=quantile(AQI, prob=.25, na.rm = TRUE),
med=median(AQI, na.rm = TRUE),
Q3=quantile(AQI, prob=.75, na.rm = TRUE),
max=max(AQI, na.rm = TRUE))
## # A tibble: 3 × 7
## State n min Q1 med Q3 max
## <fct> <int> <int> <dbl> <dbl> <dbl> <int>
## 1 Colorado 11 19 33.5 36 38 49
## 2 Oregon 47 13 35.8 62.5 116. 245
## 3 Washington 52 15 67 84 170 359
What do you observe?
# BOXPLOT
ggplot(aqi, aes(x=State, y=AQI, fill=State))+
geom_boxplot()
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
# Violin
ggplot(aqi, aes(x=State, y=AQI, fill=State))+
geom_violin()
## Warning: Removed 4 rows containing non-finite values (stat_ydensity).
You could make a dot plot!
# Dot plot
ggplot(aqi, aes(x=State, y=AQI, fill=State))+
geom_dotplot(binaxis='y')
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bindot).
#install.packages('ggbeeswarm')
library(ggbeeswarm)
ggplot(aqi, aes(x=State, y=AQI, color=State))+
geom_beeswarm()
## Warning: Removed 4 rows containing missing values (position_beeswarm).
Faceting can be used with any type of plot to separate subgroups.
Here is a histogram where the y-axis is count.
### Histogram (Counts)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_histogram()+
facet_grid(State~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
We can change the y-axis to density (proportions).
### Histogram (Density)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_histogram(aes(y=..density..))+
facet_grid(State~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
Since there is a spike in the Colorado distribution, this distorts the scale for the distributions for Oregon and Washington. We can fix this by allowing the y-axis to freely vary.
### Histogram (Density - Free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_histogram(aes(y=..density..))+
facet_grid(State~., scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
We can also make a density plot!
### Density Plot (free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_density()+
facet_grid(State~., scales = "free_y")
## Warning: Removed 4 rows containing non-finite values (stat_density).