Through this project, a linear regression model, K-Nn model, and decision tree will be used to attempt to predict the average annual small particulate matter(pM2.5) in various locations within the United states excluding Alaska and Hawaii. Four potential explanatory variables were chosen for the models. These were chosen based on how linear their relationship was when compared to the pM25 values. Various scatter plots were created to visually select these predictors. For the three models, a RMSE of around 1 would indicate a well preforming model since the lowest value of the pM2.5 is three.
# Load packages and data
library(tidyverse)
library(plotROC)
library(tidymodels)
airqual_dat <- read_csv("https://github.com/rdpeng/stat322E_public/raw/main/data/pm25_data.csv.gz")
In order to select the best predictors, I ran a few correlations and created scatter plots for some of the variables to see if there was any visible relationship between the two variables.
head(airqual_dat)
## # A tibble: 6 × 50
## id value fips lat lon state county city CMAQ zcta zcta_area
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1003. 9.60 1003 30.5 -87.9 Alabama Baldwin Fairhope 8.10 36532 190980522
## 2 1027. 10.8 1027 33.3 -85.8 Alabama Clay Ashland 9.77 36251 374132430
## 3 1033. 11.2 1033 34.8 -87.7 Alabama Colbert Muscle Sh… 9.40 35660 16716984
## 4 1049. 11.7 1049 34.3 -86.0 Alabama DeKalb Crossville 8.53 35962 203836235
## 5 1055. 12.4 1055 34.0 -86.0 Alabama Etowah Gadsden 9.24 35901 154069359
## 6 1069. 10.5 1069 31.2 -85.4 Alabama Houston Dothan 9.12 36303 162685124
## # ℹ 39 more variables: zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## # imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## # county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
## # log_pri_length_10000 <dbl>, log_pri_length_15000 <dbl>,
## # log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
## # log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
## # log_prisec_length_10000 <dbl>, log_prisec_length_15000 <dbl>, …
summary(airqual_dat$value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.024 9.268 11.153 10.808 12.369 23.161
colnames(airqual_dat)
## [1] "id" "value"
## [3] "fips" "lat"
## [5] "lon" "state"
## [7] "county" "city"
## [9] "CMAQ" "zcta"
## [11] "zcta_area" "zcta_pop"
## [13] "imp_a500" "imp_a1000"
## [15] "imp_a5000" "imp_a10000"
## [17] "imp_a15000" "county_area"
## [19] "county_pop" "log_dist_to_prisec"
## [21] "log_pri_length_5000" "log_pri_length_10000"
## [23] "log_pri_length_15000" "log_pri_length_25000"
## [25] "log_prisec_length_500" "log_prisec_length_1000"
## [27] "log_prisec_length_5000" "log_prisec_length_10000"
## [29] "log_prisec_length_15000" "log_prisec_length_25000"
## [31] "log_nei_2008_pm25_sum_10000" "log_nei_2008_pm25_sum_15000"
## [33] "log_nei_2008_pm25_sum_25000" "log_nei_2008_pm10_sum_10000"
## [35] "log_nei_2008_pm10_sum_15000" "log_nei_2008_pm10_sum_25000"
## [37] "popdens_county" "popdens_zcta"
## [39] "nohs" "somehs"
## [41] "hs" "somecollege"
## [43] "associate" "bachelor"
## [45] "grad" "pov"
## [47] "hs_orless" "urc2013"
## [49] "urc2006" "aod"
#selecting the best radius for nei_2008_pm2.5
cor(airqual_dat$value, airqual_dat$log_nei_2008_pm25_sum_10000)
## [1] 0.3611346
cor(airqual_dat$value, airqual_dat$log_nei_2008_pm25_sum_15000)
## [1] 0.3582421
cor(airqual_dat$value, airqual_dat$log_nei_2008_pm25_sum_25000)
## [1] 0.377129
# creating various scatterplots to identify relationships between possible predictors
airqual_dat %>%
ggplot(aes(x = CMAQ, y = value)) +
geom_point()
airqual_dat %>%
ggplot(aes(x = log_nei_2008_pm25_sum_25000, y = value)) +
geom_point()
airqual_dat %>%
ggplot(aes(x = lat, y = value)) +
geom_point()
The data was split into a training and testing subset to get accurate and unbiased metrics jusging the quality of the models
# first select variables needed and then scale them
airqual_dat_sub = airqual_dat %>%
select(id, CMAQ, log_nei_2008_pm25_sum_25000, lat, value) %>%
mutate(across(-c(value, id), scale))
# split data into training and testing
airqual_dat_split <- initial_split(airqual_dat_sub)
airqual_dat_train <- training(airqual_dat_split)
airqual_dat_test <- testing(airqual_dat_split)
The linear regression model was created by firest creating a recipe that is used for the rest of the models. Then a workflow specific to a linear regression model was created and used to fit a model. Then the model was used to create predictions on the test dataset and a room mean-squared error(RMSE) was calculated. For this plot as well as all the following, a plot was creating showing the predicted values compared against the actual pM2.5 values.
# create recipe
rec <- airqual_dat_train %>%
recipe(value ~.)
# create model
model <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# define workflow with given recipe and model
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(model)
# fit model
res <- fit(wf, data = airqual_dat_train)
# make predictions and plot
airqual_dat_test %>%
mutate(pred = predict(res, airqual_dat_test)$.pred) %>%
metrics(truth = value, estimate = pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.23
## 2 rsq standard 0.297
## 3 mae standard 1.69
airqual_dat_test %>%
mutate(pred = predict(res, airqual_dat_test)$.pred) %>%
ggplot(aes(x = value, y = pred)) +
geom_point() +
labs(title = "Figure 1")
The k-Nn model was created by first using the training set with a k = 10, then cross-validation was performed and the number of optimal neighbors was evaluated. The k parameter was then tuned to 15 based on it having the best RMSE value. Then the model was used to create predictions on the test dataset and a room mean-squared error(RMSE) was calculated.
# define recipe, model, and workflow, then fit model to test data set
rec <- airqual_dat_train %>%
recipe(value ~.) %>%
step_normalize(all_predictors())
model <- nearest_neighbor(neighbors = 10) %>%
set_engine("kknn") %>%
set_mode("regression")
wf <- workflow() %>%
add_model(model) %>%
add_recipe(rec)
folds <- vfold_cv(airqual_dat_train, v = 5)
res <- fit_resamples(wf, resamples = folds)
## Tune for the optimal number of neighbors
model <- nearest_neighbor(neighbors = tune("k")) %>%
set_engine("kknn") %>%
set_mode("regression")
wf <- workflow() %>%
add_model(model) %>%
add_recipe(rec)
folds <- vfold_cv(airqual_dat_train, v = 5)
res <- tune_grid(wf, resamples = folds,
grid = tibble(k = c(3, 5, 10, 15, 20, 25)))
res %>%
show_best(metric = "rmse")
## # A tibble: 5 × 7
## k .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 rmse standard 1.98 5 0.118 Preprocessor1_Model2
## 2 10 rmse standard 1.99 5 0.129 Preprocessor1_Model3
## 3 15 rmse standard 1.99 5 0.127 Preprocessor1_Model4
## 4 20 rmse standard 2.00 5 0.127 Preprocessor1_Model5
## 5 3 rmse standard 2.01 5 0.120 Preprocessor1_Model1
# create model again with k = 15
rec <- airqual_dat_train %>%
recipe(value ~.) %>%
step_normalize(all_predictors())
model <- nearest_neighbor(neighbors = 15) %>%
set_engine("kknn") %>%
set_mode("regression")
wf <- workflow() %>%
add_model(model) %>%
add_recipe(rec)
res <- fit(wf, airqual_dat_train)
knn_predictions = airqual_dat_test %>%
mutate(pred = predict(res, airqual_dat_test)$.pred) %>%
mutate(delta = abs(pred - value))
knn_predictions %>%
metrics(truth = value, estimate = pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.81
## 2 rsq standard 0.536
## 3 mae standard 1.26
knn_predictions %>%
ggplot(aes(x = value, y = pred)) +
geom_point() +
labs(title = "Figure 2")
The final model was created by setting the workflow to be a decision tree model. Then the model was used to create predictions on the test dataset and a room mean-squared error(RMSE) was calculated.
# create model as decision tree
model = decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
# reassign workflow to follow the new decision tree model
wf = workflow() %>%
add_model(model) %>%
add_recipe(rec)
# fit model
res = wf %>%
fit(data = airqual_dat_train)
# create predictions based on model
airqual_dat_test %>%
mutate(pred = predict(res, airqual_dat_test)$.pred) %>%
metrics(truth = value, estimate = pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.92
## 2 rsq standard 0.469
## 3 mae standard 1.36
# create plot
airqual_dat_test %>%
mutate(pred = predict(res, airqual_dat_test)$.pred) %>%
ggplot(aes(x = value, y = pred)) +
geom_point() +
labs(title = "Figure 3")
Despite all being given the same infomation, there was variable performances. With the same predictors, the k-Nn model seemed to do the best. The linear model fared the worse having a RMSE of 2.18. The k-Nn did the best with an RMSE of 1.98, and the binary tree did only slightly worse, but better than the linear regression with an RMSE of 2.05.
knn_predictions %>%
arrange((id)) %>%
left_join(airqual_dat, by = c("id"))
## # A tibble: 219 × 56
## id CMAQ.x[,1] log_nei_2008_pm25_sum_25000…¹ lat.x[,1] value.x pred delta
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1049. 0.0406 -0.372 -0.907 11.7 11.7 0.0806
## 2 1055. 0.279 -0.717 -0.970 12.4 11.0 1.36
## 3 1073. 0.613 1.01 -1.11 12.4 13.2 0.818
## 4 1089. 0.313 -0.336 -0.820 12.2 11.7 0.512
## 5 1117. 0.613 1.10 -1.12 11.6 13.5 1.90
## 6 4005. -1.11 -0.789 -0.708 5.93 9.97 4.05
## 7 4013. 1.03 -0.709 -1.10 10.9 10.5 0.382
## 8 4014. 1.28 -0.671 -1.08 8.85 10.6 1.80
## 9 4021. 1.04 -0.554 -1.18 19.5 11.4 8.04
## 10 4025. -0.572 -2.11 -0.840 5.83 8.79 2.96
## # ℹ 209 more rows
## # ℹ abbreviated name: ¹log_nei_2008_pm25_sum_25000.x[,1]
## # ℹ 49 more variables: value.y <dbl>, fips <dbl>, lat.y <dbl>, lon <dbl>,
## # state <chr>, county <chr>, city <chr>, CMAQ.y <dbl>, zcta <dbl>,
## # zcta_area <dbl>, zcta_pop <dbl>, imp_a500 <dbl>, imp_a1000 <dbl>,
## # imp_a5000 <dbl>, imp_a10000 <dbl>, imp_a15000 <dbl>, county_area <dbl>,
## # county_pop <dbl>, log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>, …
The model seems to do the best on locations in Wisconsin and does the worst on monitors in Alabama. This could be because Alabama is in the south, but is more developed than other states in the area, so the latitude predictor might not be accurate for that area and vis versa for Wisconsin. It could be that there’s more
# find highest and lowest difference between true pm2.5 and predicted value
knn_predictions %>%
arrange((id)) %>%
left_join(airqual_dat, by = c("id")) %>%
group_by(state) %>%
summarise(N = mean(delta)) %>%
arrange((N))
## # A tibble: 47 × 2
## state N
## <chr> <dbl>
## 1 South Dakota 0.0995
## 2 Washington 0.213
## 3 Maryland 0.289
## 4 Iowa 0.395
## 5 New Hampshire 0.413
## 6 Missouri 0.476
## 7 South Carolina 0.527
## 8 Michigan 0.575
## 9 Massachusetts 0.598
## 10 New Jersey 0.618
## # ℹ 37 more rows
The worst predictions look like they’re being made in the west coast. The mean difference in Arizona, California, Idaho, and New Mexico are the highest. The mean difference is lowest in Delaware, West Virginia, and Minnesota, so generally the northeast. I think it’s interesting that there is this difference in prediction validity latitudally especially since that was one of the predictors I included. I think it would definitely help to include more temperature related information. Things like amount of rainfall and heat greatly impact the flora and fauna of the area which can increase allergens or maybe even reduce pollution.
# first select CMAQ and aod variables and then scale them
airqual_dat_sub = airqual_dat %>%
select(id, CMAQ, aod, value) %>%
mutate(across(-c(value, id), scale))
# split data into training and testing
airqual_dat_split <- initial_split(airqual_dat_sub)
airqual_dat_train <- training(airqual_dat_split)
airqual_dat_test <- testing(airqual_dat_split)
# create recipe, model, wf, then fit the model, then make predictions
rec <- airqual_dat_train %>%
recipe(value ~.) %>%
step_normalize(all_predictors())
model <- nearest_neighbor(neighbors = 15) %>%
set_engine("kknn") %>%
set_mode("regression")
wf <- workflow() %>%
add_model(model) %>%
add_recipe(rec)
res <- fit(wf, airqual_dat_train)
knn_predictions = airqual_dat_test %>%
mutate(pred = predict(res, airqual_dat_test)$.pred) %>%
mutate(delta = abs(pred - value))
knn_predictions %>%
metrics(truth = value, estimate = pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.33
## 2 rsq standard 0.362
## 3 mae standard 1.65
With just CMAQ and AOD, this model performed much better than my selection of variables. This model had a RMSE of 1.9 which is only .07 less than my original knn model. Including CMAQ and AOC definitely improved the models, but overall, including the right number of parameters that are well correlated to the pm2.5 improves the performance.
Prediction models are very useful tool, but by far the most difficult part of the process are the numerous adjustments that can be made to make a model as accurate as possible. The model didn’t do entirely as well as I had expected, but an RMSE of 1.98 is still relatively low.