Learning Objectives

In this lesson students will learn to work with numeric data to create graphics and summaries.

Example : Oregon Air Quality Index (AQI)

These data were reported on AirNow on October 19, 2022 for the states of Oregon, Washington, and Colorado.

https://www.airnow.gov/state/?name=oregon

Step 0: Library Tidyverse

library(tidyverse)

Step 1: Load the Data

aqi<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA151/main/Data/fireAQI_OrCoWa_10192022.csv", 
                   header=TRUE)

orAQI<-aqi%>%
  filter(State=="Oregon")

Step 2: Histogram

ggplot(orAQI, aes(x=AQI))+
  geom_histogram(bins=10)
## Warning: Removed 3 rows containing non-finite values (stat_bin).

# Try playing with changing the number of bins

When we look at a histogram, we want to describe the following characteristics:

  • Shape

    • Skewness vs Symmetry
    • Modality: How many peaks
  • Center
    • Described by mean or median
  • Spread
    • Described by standard deviation or IQR
  • Outliers

Step 3: Density Plot

ggplot(orAQI, aes(x=AQI))+
  geom_density()
## Warning: Removed 3 rows containing non-finite values (stat_density).

Step 4: Box Plot

ggplot(orAQI, aes(x=AQI))+
  geom_boxplot()
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

Step 5: Quantiles

Quantiles split up a data set into four even parts given a relative ordering.

## This will give the five number summary
summary(orAQI$AQI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.00   35.75   62.50   81.02  115.50  245.00       3
## If we only want a given quantiles use
quantile(orAQI$AQI, 0.25, na.rm = TRUE)
##   25% 
## 35.75
IQR (Interquartile range)
q1<-35.75
q3<-115.50
iqr<-q3-q1
iqr
## [1] 79.75
Defining Outliers

We create “fences” to highlight possible outliers in our data.

A data point is highlighted as an outlier if

  • it is greater than \(Q_3+1.5\times IQR\)
  • it is less than \(Q_1-1.5\times IQR\)
## upper fence
upper<-q3+1.5*iqr
upper
## [1] 235.125
## lower fence
lower<-q1-1.5*iqr
lower
## [1] -83.875

Can we find any outliers?

orAQI%>%
  filter(AQI < lower | AQI > upper)
##    State     City AQI          Level
## 1 Oregon Oakridge 245 Very Unhealthy

Step 7: Critical Thinking about Metrics of Center

How do we know when to use mean or median as a measure of center?

There is a relationship between mean, the median and the shape of the data.

  • IF THE DATA ARE SYMMETRIC (or approximately symmetric):

    • In perfectly symmetric data, the mean and the median are equal.
    • In data that is approximately symmetric, the mean and the median are close to the same value.
  • IF THE DATA ARE SKEWED:

    • The mean is heavily influenced by very large (or small) values in the data set relative to the rest of the data, it is usually more appropriate to use the median when describing the center of skewed data.

Step 8: Comparing Groups

Numeric Summaries

How does the air quality compare in Oregon, Washington, and Colorado?

aqi%>%
  group_by(State)%>%
  summarise(n=n(), 
            medAQI=median(AQI, na.rm = TRUE),
            avgAQI=mean(AQI, na.rm = TRUE), 
            sdAQI=sd(AQI, na.rm = TRUE))
## # A tibble: 3 × 5
##   State          n medAQI avgAQI sdAQI
##   <fct>      <int>  <dbl>  <dbl> <dbl>
## 1 Colorado      11   36     35.6  7.07
## 2 Oregon        47   62.5   81.0 57.9 
## 3 Washington    52   84    117.  72.0

What do you observe?

Graphics

Here is a histogram where the y-axis is count.

### Histogram (Counts)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_histogram()+
  facet_grid(State~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).

We can change the y-axis to density (proportions).

### Histogram (Density)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_histogram(aes(y=..density..))+
  facet_grid(State~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).

Since there is a spike in the Colorado distribution, this distorts the scale for the distributions for Oregon and Washington. We can fix this by allowing the y-axis to freely vary.

### Histogram (Density - Free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_histogram(aes(y=..density..))+
  facet_grid(State~., scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).

We can also make a density plot!

### Density Plot (free_y)
ggplot(aqi, aes(x=AQI, fill=State))+
  geom_density()+
  facet_grid(State~., scales = "free_y")
## Warning: Removed 4 rows containing non-finite values (stat_density).

But, my favorite is a side-by-side boxplot

# BOXPLOT

ggplot(aqi, aes(x=State, y=AQI, fill=State))+
  geom_boxplot()
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).