The dataset that I chose is sourced from the U.S. Environmental Protection Agency (EPA) and it’s about the pollution levels in select counties of select states within the US (also includes Mexico), specifically between 2000 and 2010. The pollution levels are based on the different types of gases that are harmful to the environment/atmosphere. The dataset includes variables such as State Code, County Code, Site Num, Address, State, County, City, and Data Local. It also includes the Units, Mean, 1st Max Value, 1st Max Hour, and AQI of each pollutant (NO3, O3, SO2, and CO). I plan to explore the highest mean NO2 value within each state.
Loaded the necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
Loaded in the dataset (csv) using the read_csv function
New names:
Rows: 1048575 Columns: 29
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): Address, State, County, City, Date Local, NO2 Units, O3 Units, SO2... dbl
(20): ...1, State Code, County Code, Site Num, NO2 Mean, NO2 1st Max Val...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Used a combination of the group_by and slice (researched how to use this and looked at examples on stackoverflow) functions in order to isolate the states and the maximum NO2 value because we want to focus on the maximum NO2 value in each state (it also provides the max values for other variables as well).
Used the ifelse function in order to group the different NO2 value ranges. Since the graph’s y values will be in increments of 25, I decided to change the colors according to that. Consulted ChatGPT on how to make multiple ranges for the legend.
pollution_data_maxes <- pollution_data_maxes %>%mutate(legend1 =ifelse(`NO2 Mean`<=25, "0 to 25",ifelse(`NO2 Mean`<=50, "25 to 50",ifelse(`NO2 Mean`<=75, "50 to 75", "75 or above"))))
Created the first scatterplot. I used the scale_color_brewer function to change the colors of the plot points. I decided to use Set 2 because when the graph is organized from least to greatest NO2 mean value, the colors go from green to red (associating green with good and red with bad). I also has to rotate the state names (x values) using angle= because they were overlapping. Once they were rotated, the names weren’t flush with the x axis so I used vjust and hjust in order to move them up. Used a minimal theme in order to lighten the background.
plot1 <-ggplot(pollution_data_maxes, aes(x = State, y =`NO2 Mean`, color = legend1)) +geom_point() +scale_color_brewer(palette ="Set2") +labs(title ="Highest Mean NO2 Level In Each Given State",x ="State",y ="Max NO2 Mean (Parts Per Billion)",caption ="Source: EPA") +theme_minimal() +#minimal theme in order to keep the focus on the colors of the plot points (aesthetic decision)theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust=1)) #Rotates the state names (x values)plot1
Used the order and factor functions to order the plot points from least to greatest mean NO2 value. If I change decreasing to = TRUE, the graph would order itself from greatest to least mean NO2 values
Same code as the first ifelse function, but it’s used for pollution_data_resort instead of pollution_data_maxes. This is so I can have individual legends for both graphs.
pollution_data_resort <- pollution_data_resort %>%mutate(legend2 =ifelse(`NO2 Mean`<=25, "0 to 25",ifelse(`NO2 Mean`<=50, "25 to 50",ifelse(`NO2 Mean`<=75, "50 to 75", "75 or above"))))
Same code as the first plot, but is put in least to greatest mean NO2 value instead.
plot2 <-ggplot(pollution_data_resort, aes(x = State, y =`NO2 Mean`, color = legend2)) +geom_point() +scale_color_brewer(palette ="Set2") +labs(title ="Highest Mean NO2 Level In Each Given State",x ="State",y ="Max NO2 Mean (Parts Per Billion)",caption ="Source: EPA") +theme_minimal() +#minimal theme in order to keep the focus on the colors of the plot points (aesthetic decision)theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust=1))plot2
Created a linear regression model using function lm(). The two variables that I chose to compare were the NO2 mean and the O3 mean.
poll_data <-lm(`NO2 Mean`~`O3 Mean`, data = pollution_data_maxes)
Calculated the correlation between the two variables using the cor() function. Got a value of -0.1577671 which means that this is a weak negative correlation (the linear relationship between NO2 mean and O3 mean is minimal at best).
Creates a summary of the linear regression model ‘poll_data’ that I created earlier.
summary(poll_data)
Call:
lm(formula = `NO2 Mean` ~ `O3 Mean`, data = pollution_data_maxes)
Residuals:
Min 1Q Median 3Q Max
-32.520 -17.269 -5.571 12.914 95.237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.825 6.612 7.990 2.12e-09 ***
`O3 Mean` -363.849 384.944 -0.945 0.351
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.12 on 35 degrees of freedom
Multiple R-squared: 0.02489, Adjusted R-squared: -0.00297
F-statistic: 0.8934 on 1 and 35 DF, p-value: 0.351
The equation for my model would be NO2 Mean = (-363.849 x O3 Mean) + 52.82. The p-value being greater that 0.05 indicates that there isn’t much of a relationship between these two variables. Based on the linear regression equation, the p-value, and the adjusted r^2 value, it is safe to assume that the O3 Mean does not effect the NO2 mean
With this dataset, I wanted to focus on the maximum mean values of NO2 by state. Specifically focusing on NO2, I cleaned the dataset up by removing all unwanted rows (anything that isn’t the maximum mean NO2 value within each state) using the group_by and slice functions. I was left with the maximum mean NO2 values of each state. Although there are some NA values within the cleaned dataset, I decided to keep them as a visually reminder of certain values for myself. I was also able to keep them because they didn’t impact the topic I was focusing on. The two visualizations I created are essentially the same exact thing. The only difference is that one is sorted alphabetically (x-values) and the second one is sorted from least to greatest mean NO2 value (y-values). Once I sorted the values from least to greatest, I was surprised to see a steady increase in the mean NO2 values up until the last three plot points.I found it interesting that out of all the states included in this dataset, Arizona was the state with the highest mean NO2 value. I would’ve assumed it would be somewhere like California (which had the second highest value) or New York (which ranked at number 8). In terms of the colors, I decided to color the plot points based on the range they were in. The ranges were decided based off the y values which are in increments of 25 (0-25 is the first color range, 25-50 is the second, etc.). I originally wanted to include a scatter plot of the relationship between NO2 and O3, but I ultimately decided against it after originally graphing and doing the linear regression line analysis. It didn’t provide much information or context for the rest of the projects, so I didn’t think it would be beneficial to include.