Ferona Bustani (fab742)

Introduction

Through this project, a linear regression model, K-Nn model, and decision tree will be used to attempt to predict the average annual small particulate matter(pM2.5) in various locations within the United states excluding Alaska and Hawaii. Four potential explanatory variables were chosen for the models. These were chosen based on how linear their relationship was when compared to the pM25 values. Various scatter plots were created to visually select these predictors. For the three models, a RMSE of around 1 would indicate a well preforming model since the lowest value of the pM2.5 is three.

# Load packages and data
library(tidyverse)
library(plotROC)
library(tidymodels)

airqual_dat <- read_csv("https://github.com/rdpeng/stat322E_public/raw/main/data/pm25_data.csv.gz")

In order to select the best predictors, I ran a few correlations and created scatter plots for some of the variables to see if there was any visible relationship between the two variables.

head(airqual_dat)

## # A tibble: 6 × 50
##      id value  fips   lat   lon state   county  city        CMAQ  zcta zcta_area
##   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <chr>   <chr>      <dbl> <dbl>     <dbl>
## 1 1003.  9.60  1003  30.5 -87.9 Alabama Baldwin Fairhope    8.10 36532 190980522
## 2 1027. 10.8   1027  33.3 -85.8 Alabama Clay    Ashland     9.77 36251 374132430
## 3 1033. 11.2   1033  34.8 -87.7 Alabama Colbert Muscle Sh…  9.40 35660  16716984
## 4 1049. 11.7   1049  34.3 -86.0 Alabama DeKalb  Crossville  8.53 35962 203836235
## 5 1055. 12.4   1055  34.0 -86.0 Alabama Etowah  Gadsden     9.24 35901 154069359
## 6 1069. 10.5   1069  31.2 -85.4 Alabama Houston Dothan      9.12 36303 162685124
## # ℹ 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## #   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## #   county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
## #   log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
## #   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
## #   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
## #   log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …

summary(airqual_dat$value)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.024   9.268  11.153  10.808  12.369  23.161

colnames(airqual_dat)

##  [1] "id"                          "value"                      
##  [3] "fips"                        "lat"                        
##  [5] "lon"                         "state"                      
##  [7] "county"                      "city"                       
##  [9] "CMAQ"                        "zcta"                       
## [11] "zcta_area"                   "zcta_pop"                   
## [13] "imp_a500"                    "imp_a1000"                  
## [15] "imp_a5000"                   "imp_a10000"                 
## [17] "imp_a15000"                  "county_area"                
## [19] "county_pop"                  "log_dist_to_prisec"         
## [21] "log_pri_length_5000"         "log_pri_length_10000"       
## [23] "log_pri_length_15000"        "log_pri_length_25000"       
## [25] "log_prisec_length_500"       "log_prisec_length_1000"     
## [27] "log_prisec_length_5000"      "log_prisec_length_10000"    
## [29] "log_prisec_length_15000"     "log_prisec_length_25000"    
## [31] "log_nei_2008_pm25_sum_10000" "log_nei_2008_pm25_sum_15000"
## [33] "log_nei_2008_pm25_sum_25000" "log_nei_2008_pm10_sum_10000"
## [35] "log_nei_2008_pm10_sum_15000" "log_nei_2008_pm10_sum_25000"
## [37] "popdens_county"              "popdens_zcta"               
## [39] "nohs"                        "somehs"                     
## [41] "hs"                          "somecollege"                
## [43] "associate"                   "bachelor"                   
## [45] "grad"                        "pov"                        
## [47] "hs_orless"                   "urc2013"                    
## [49] "urc2006"                     "aod"

#selecting the best radius for nei_2008_pm2.5
cor(airqual_dat$value, airqual_dat$log_nei_2008_pm25_sum_10000)

## [1] 0.3611346

cor(airqual_dat$value, airqual_dat$log_nei_2008_pm25_sum_15000)

## [1] 0.3582421

cor(airqual_dat$value, airqual_dat$log_nei_2008_pm25_sum_25000)

## [1] 0.377129

# creating various scatterplots to identify relationships between possible predictors
airqual_dat %>% 
  ggplot(aes(x = CMAQ, y = value)) +
  geom_point()

airqual_dat %>% 
  ggplot(aes(x = log_nei_2008_pm25_sum_25000, y = value)) +
  geom_point()

airqual_dat %>% 
  ggplot(aes(x = lat, y = value)) +
  geom_point()

Wrangling

The data was split into a training and testing subset to get accurate and unbiased metrics jusging the quality of the models

# first select variables needed and then scale them
airqual_dat_sub = airqual_dat %>% 
  select(id, CMAQ, log_nei_2008_pm25_sum_25000, lat, value) %>% 
  mutate(across(-c(value, id), scale))
  

# split data into training and testing
airqual_dat_split <- initial_split(airqual_dat_sub)
airqual_dat_train <- training(airqual_dat_split)
airqual_dat_test <- testing(airqual_dat_split)

Results/ Development of Prediction Models

Linear Regression Model

The linear regression model was created by firest creating a recipe that is used for the rest of the models. Then a workflow specific to a linear regression model was created and used to fit a model. Then the model was used to create predictions on the test dataset and a room mean-squared error(RMSE) was calculated. For this plot as well as all the following, a plot was creating showing the predicted values compared against the actual pM2.5 values.

# create recipe
rec <- airqual_dat_train %>% 
    recipe(value ~.)

# create model
model <- linear_reg() %>% 
    set_engine("lm") %>% 
    set_mode("regression")

# define workflow with given recipe and model
wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(model)

# fit model
res <- fit(wf, data = airqual_dat_train)

# make predictions and plot
airqual_dat_test %>% 
  mutate(pred = predict(res, airqual_dat_test)$.pred) %>% 
  metrics(truth = value, estimate = pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.23 
## 2 rsq     standard       0.297
## 3 mae     standard       1.69

airqual_dat_test %>% 
  mutate(pred = predict(res, airqual_dat_test)$.pred) %>% 
  ggplot(aes(x = value, y = pred)) +
  geom_point() +
  labs(title = "Figure 1")

K-nearest Neighbors Model

The k-Nn model was created by first using the training set with a k = 10, then cross-validation was performed and the number of optimal neighbors was evaluated. The k parameter was then tuned to 15 based on it having the best RMSE value. Then the model was used to create predictions on the test dataset and a room mean-squared error(RMSE) was calculated.

# define recipe, model, and workflow, then fit model to test data set
rec <- airqual_dat_train %>% 
    recipe(value ~.) %>% 
    step_normalize(all_predictors()) 
model <- nearest_neighbor(neighbors = 10) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")
wf <- workflow() %>% 
    add_model(model) %>% 
    add_recipe(rec)

folds <- vfold_cv(airqual_dat_train, v = 5)

res <- fit_resamples(wf, resamples = folds)


## Tune for the optimal number of neighbors
model <- nearest_neighbor(neighbors = tune("k")) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")
wf <- workflow() %>% 
    add_model(model) %>% 
    add_recipe(rec)

folds <- vfold_cv(airqual_dat_train, v = 5)

res <- tune_grid(wf, resamples = folds,
                 grid = tibble(k = c(3, 5, 10, 15, 20, 25)))

res %>% 
    show_best(metric = "rmse")

## # A tibble: 5 × 7
##       k .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     5 rmse    standard    1.98     5   0.118 Preprocessor1_Model2
## 2    10 rmse    standard    1.99     5   0.129 Preprocessor1_Model3
## 3    15 rmse    standard    1.99     5   0.127 Preprocessor1_Model4
## 4    20 rmse    standard    2.00     5   0.127 Preprocessor1_Model5
## 5     3 rmse    standard    2.01     5   0.120 Preprocessor1_Model1

# create model again with k = 15
rec <- airqual_dat_train %>% 
    recipe(value ~.) %>% 
    step_normalize(all_predictors()) 
model <- nearest_neighbor(neighbors = 15) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")
wf <- workflow() %>% 
    add_model(model) %>% 
    add_recipe(rec)

res <- fit(wf, airqual_dat_train)

knn_predictions = airqual_dat_test %>% 
  mutate(pred = predict(res, airqual_dat_test)$.pred) %>% 
  mutate(delta = abs(pred - value)) 

knn_predictions %>% 
  metrics(truth = value, estimate = pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.81 
## 2 rsq     standard       0.536
## 3 mae     standard       1.26

knn_predictions %>% 
  ggplot(aes(x = value, y = pred)) +
  geom_point() +
  labs(title = "Figure 2")

Decision Tree

The final model was created by setting the workflow to be a decision tree model. Then the model was used to create predictions on the test dataset and a room mean-squared error(RMSE) was calculated.

# create model as decision tree
model = decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")

# reassign workflow to follow the new decision tree model
wf = workflow() %>%
  add_model(model) %>% 
  add_recipe(rec)

# fit model
res = wf %>% 
  fit(data = airqual_dat_train)

# create predictions based on model
airqual_dat_test %>% 
  mutate(pred = predict(res, airqual_dat_test)$.pred) %>% 
  metrics(truth = value, estimate = pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.92 
## 2 rsq     standard       0.469
## 3 mae     standard       1.36

# create plot
airqual_dat_test %>% 
  mutate(pred = predict(res, airqual_dat_test)$.pred) %>% 
  ggplot(aes(x = value, y = pred)) +
  geom_point() +
  labs(title = "Figure 3")

Discussion

Despite all being given the same infomation, there was variable performances. With the same predictors, the k-Nn model seemed to do the best. The linear model fared the worse having a RMSE of 2.18. The k-Nn did the best with an RMSE of 1.98, and the binary tree did only slightly worse, but better than the linear regression with an RMSE of 2.05.

knn_predictions %>% 
  arrange((id)) %>% 
  left_join(airqual_dat, by = c("id"))

## # A tibble: 219 × 56
##       id CMAQ.x[,1] log_nei_2008_pm25_sum_25000…¹ lat.x[,1] value.x  pred  delta
##    <dbl>      <dbl>                         <dbl>     <dbl>   <dbl> <dbl>  <dbl>
##  1 1049.     0.0406                        -0.372    -0.907   11.7  11.7  0.0806
##  2 1055.     0.279                         -0.717    -0.970   12.4  11.0  1.36  
##  3 1073.     0.613                          1.01     -1.11    12.4  13.2  0.818 
##  4 1089.     0.313                         -0.336    -0.820   12.2  11.7  0.512 
##  5 1117.     0.613                          1.10     -1.12    11.6  13.5  1.90  
##  6 4005.    -1.11                          -0.789    -0.708    5.93  9.97 4.05  
##  7 4013.     1.03                          -0.709    -1.10    10.9  10.5  0.382 
##  8 4014.     1.28                          -0.671    -1.08     8.85 10.6  1.80  
##  9 4021.     1.04                          -0.554    -1.18    19.5  11.4  8.04  
## 10 4025.    -0.572                         -2.11     -0.840    5.83  8.79 2.96  
## # ℹ 209 more rows
## # ℹ abbreviated name: ¹log_nei_2008_pm25_sum_25000.x[,1]
## # ℹ 49 more variables: value.y <dbl>, fips <dbl>, lat.y <dbl>, lon <dbl>,
## #   state <chr>, county <chr>, city <chr>, CMAQ.y <dbl>, zcta <dbl>,
## #   zcta_area <dbl>, zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## #   imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## #   county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>, …

The model seems to do the best on locations in Wisconsin and does the worst on monitors in Alabama. This could be because Alabama is in the south, but is more developed than other states in the area, so the latitude predictor might not be accurate for that area and vis versa for Wisconsin. It could be that there’s more

# find highest and lowest difference between true pm2.5 and predicted value
knn_predictions %>% 
  arrange((id)) %>% 
  left_join(airqual_dat, by = c("id")) %>% 
  group_by(state) %>% 
  summarise(N = mean(delta)) %>% 
  arrange((N))

## # A tibble: 47 × 2
##    state               N
##    <chr>           <dbl>
##  1 South Dakota   0.0995
##  2 Washington     0.213 
##  3 Maryland       0.289 
##  4 Iowa           0.395 
##  5 New Hampshire  0.413 
##  6 Missouri       0.476 
##  7 South Carolina 0.527 
##  8 Michigan       0.575 
##  9 Massachusetts  0.598 
## 10 New Jersey     0.618 
## # ℹ 37 more rows

The worst predictions look like they’re being made in the west coast. The mean difference in Arizona, California, Idaho, and New Mexico are the highest. The mean difference is lowest in Delaware, West Virginia, and Minnesota, so generally the northeast. I think it’s interesting that there is this difference in prediction validity latitudally especially since that was one of the predictors I included. I think it would definitely help to include more temperature related information. Things like amount of rainfall and heat greatly impact the flora and fauna of the area which can increase allergens or maybe even reduce pollution.

# first select CMAQ and aod variables and then scale them
airqual_dat_sub = airqual_dat %>% 
  select(id, CMAQ, aod, value) %>% 
  mutate(across(-c(value, id), scale))
  

# split data into training and testing
airqual_dat_split <- initial_split(airqual_dat_sub)
airqual_dat_train <- training(airqual_dat_split)
airqual_dat_test <- testing(airqual_dat_split)

# create recipe, model, wf, then fit the model, then make predictions
rec <- airqual_dat_train %>% 
    recipe(value ~.) %>% 
    step_normalize(all_predictors()) 
model <- nearest_neighbor(neighbors = 15) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")
wf <- workflow() %>% 
    add_model(model) %>% 
    add_recipe(rec)

res <- fit(wf, airqual_dat_train)

knn_predictions = airqual_dat_test %>% 
  mutate(pred = predict(res, airqual_dat_test)$.pred) %>% 
  mutate(delta = abs(pred - value)) 

knn_predictions %>% 
  metrics(truth = value, estimate = pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.33 
## 2 rsq     standard       0.362
## 3 mae     standard       1.65

With just CMAQ and AOD, this model performed much better than my selection of variables. This model had a RMSE of 1.9 which is only .07 less than my original knn model. Including CMAQ and AOC definitely improved the models, but overall, including the right number of parameters that are well correlated to the pm2.5 improves the performance.

Though Alaska and Hawaii are not a part of the contiguous United States, because these models are taking into consideration more general things like the latitude, and population, this model should fare relatively well when testing these other states.

Prediction models are very useful tool, but by far the most difficult part of the process are the numerous adjustments that can be made to make a model as accurate as possible. The model didn’t do entirely as well as I had expected, but an RMSE of 1.98 is still relatively low.

Investigating Models for Predicting Small Particulate Matter