In this lesson students will learn to work with numeric data to create graphics and summaries.
These data were reported on AirNow on October 19, 2022 for the states of Oregon, Washington, and Colorado.
https://www.airnow.gov/state/?name=oregon
library(tidyverse)
aqi<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/fireAQI_OrCoWa_10192022.csv",
header=TRUE)
orAQI<-aqi%>%
filter(State=="Oregon")
ggplot(orAQI, aes(x=AQI))+
geom_histogram(bins=10)
## Warning: Removed 3 rows containing non-finite values (stat_bin).
# Try playing with changing the number of bins
When we look at a histogram, we want to describe the following characteristics:
Shape
Outliers
ggplot(orAQI, aes(x=AQI))+
geom_density()
## Warning: Removed 3 rows containing non-finite values (stat_density).
ggplot(orAQI, aes(x=AQI))+
geom_boxplot()
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
Quantiles split up a data set into four even parts given a relative ordering.
## This will give the five number summary
summary(orAQI$AQI)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.00 35.75 62.50 81.02 115.50 245.00 3
## If we only want a given quantiles use
quantile(orAQI$AQI, 0.25, na.rm = TRUE)
## 25%
## 35.75
q1<-35.75
q3<-115.50
iqr<-q3-q1
iqr
## [1] 79.75
We create “fences” to highlight possible outliers in our data.
A data point is highlighted as an outlier if
## upper fence
upper<-q3+1.5*iqr
upper
## [1] 235.125
## lower fence
lower<-q1-1.5*iqr
lower
## [1] -83.875
Can we find any outliers?
orAQI%>%
filter(AQI < lower | AQI > upper)
## State City AQI Level
## 1 Oregon Oakridge 245 Very Unhealthy
The average (or mean) is the most commonly used metric for center.
Math notation:
\[\bar{x}=\frac{1}{n} \sum_{i=1}^n x_i\]
Where the data points are denoted with \(x_i\) and \(n\) indicates the sample size.
mean(orAQI$AQI, na.rm=TRUE)
## [1] 81.02273
The standard deviation is the most common metric for spread. It is in the same units that the data are in gives a rough sense of how far data points are from the sample mean.
Math notation:
sd(orAQI$AQI, na.rm=TRUE)
## [1] 57.85969
How do we know when to use mean or median as a measure of center?
There is a relationship between mean, the median and the shape of the data.
IF THE DATA ARE SYMMETRIC (or approximately symmetric):
IF THE DATA ARE SKEWED:
How does the air quality compare in Oregon, Washington, and Colorado?
aqi%>%
group_by(State)%>%
summarise(n=n(),
medAQI=median(AQI, na.rm = TRUE),
avgAQI=mean(AQI, na.rm = TRUE),
sdAQI=sd(AQI, na.rm = TRUE))
## # A tibble: 3 × 5
## State n medAQI avgAQI sdAQI
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Colorado 11 36 35.6 7.07
## 2 Oregon 47 62.5 81.0 57.9
## 3 Washington 52 84 117. 72.0
What do you observe?
Here is a histogram where the y-axis is count.
### Histogram (Counts)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_histogram()+
facet_grid(State~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
We can change the y-axis to density (proportions).
### Histogram (Density)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_histogram(aes(y=..density..))+
facet_grid(State~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
Since there is a spike in the Colorado distribution, this distorts the scale for the distributions for Oregon and Washington. We can fix this by allowing the y-axis to freely vary.
### Histogram (Density - Free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_histogram(aes(y=..density..))+
facet_grid(State~., scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).
We can also make a density plot!
### Density Plot (free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
geom_density()+
facet_grid(State~., scales = "free_y")
## Warning: Removed 4 rows containing non-finite values (stat_density).
But, my favorite is a side-by-side boxplot
# BOXPLOT
ggplot(aqi, aes(x=State, y=AQI, fill=State))+
geom_boxplot()
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).