Like other large cities, Los Angeles experiences periods of high levels of ozone, especially during the summer months. Because ground level ozone is a known human health hazard, local governments and health agencies would like to be able to alert the public of unhealthy air conditions. This project will explore two questions. Is it possible to predict the concentration of ground level ozone based solely on local meteorological forecasts? Once meteorology is considered, do measurements of chemical precursors improve the predictions?
Ground level ozone (O\(_3\)) is produced when nitrogen oxides (NO\(_x\)) and volatile organic compounds (VOC’s) are mixed in the atmosphere in the presence of sunlight. These chemical precusors of ozone, NO\(_x\) and VOC’s, are emitted by automobiles and industrial processes, and to a very rough first approximation, these emissions are constant on a day to day basis. However, ozone levels fluctuate. Certain atmospheric conditions produce environments that are more conducive to ozone production than others. In general, ozone production increases when city air stagnates on sunny summer days.
The subject of this project is to examine the relatioship between ozone levels and commonly measured meteorlogical variables: local temperature, wind direction, wind speed, and humidity levels. Can a basic model of atmospheric conditions predict ozone levels? Once the basic model is established, measurements of NO\(_x\) and carbon monoxide (CO - a proxy for VOC’s) are added to examine their utility.
Los Angeles suffers periods of high ozone levels and there are several air quality monitoring sites in the region. The location chosen for this study is at 1630 N. Main St, in Los Angeles. This site is northeast of the downtown, between Dodger Stadium and the Los Angeles River. It is close to the geographical center of the Los Angeles basin and is considered to be urban. 1995 was randomly chosen as the time period for consideration.
This project will utilize “daily data” published by the US Environmental Protection Agency (EPA). The extensive EPA data base, available at https://www.epa.gov/outdoor-air-quality-data, includes meteorological and chemical measurements from locations all over the country, from 1980 to the present. The daily data values are the mean values measured over 24 hour periods.
The resultant variable: O\(_3\) concentration
The explanatory meteorological variables: temperature, relative humidity, wind direction, and wind speed
The explanatory chemical variables: NO\(_x\) concentration and CO concentration
CO concentration is used as a proxy for VOC concentration because VOC is measured much more infrequently than CO. In 1995, there are 358 days of CO measurements versus 60 days of VOC measurements. In the interest of maximizing available data, CO was chosen as a test explanatory variable. Thorough verification of CO as a valid proxy is outside the scope of this report.
In this section, the raw data files, which include measurements for the entire US, will be read and the relavant observations at the study site will be extracted. The annual cycle of each variable over the period Jan - Dec, 1995 will be plotted.
First, load the necessary libraries
library(rmarkdown)
library(knitr)
There is a separate file for each parameter of interest. Each file has the same general format.
#Read in the data for each variable: Ozone, Temperature, Humidity, Wind Direction + Speed, NOx, and CO.
OzoneUSA <- read.csv("./Air Data 1995/daily_OZONE_1995.csv")
TempUSA <- read.csv("./Air Data 1995/daily_TEMP_1995.csv")
HumidUSA <- read.csv("./Air Data 1995/daily_RH_DP_1995.csv")
WindUSA <- read.csv("./Air Data 1995/daily_WIND_1995.csv")
NOxUSA <- read.csv("./Air Data 1995/daily_NONOxNOy_1995.csv")
COUSA <- read.csv("./Air Data 1995/daily_CO_1995.csv")
#How big are these files and what do they look like? Look at ozone as an example.
dim(OzoneUSA)
## [1] 279013 29
head(OzoneUSA,3)
## State.Code County.Code Site.Num Parameter.Code POC Latitude Longitude
## 1 01 27 1 44201 1 33.28126 -85.80218
## 2 01 27 1 44201 1 33.28126 -85.80218
## 3 01 27 1 44201 1 33.28126 -85.80218
## Datum Parameter.Name Sample.Duration Pollutant.Standard
## 1 WGS84 Ozone 8-HR RUN AVG BEGIN HOUR Ozone 8-Hour 2008
## 2 WGS84 Ozone 8-HR RUN AVG BEGIN HOUR Ozone 8-Hour 2008
## 3 WGS84 Ozone 8-HR RUN AVG BEGIN HOUR Ozone 8-Hour 2008
## Date.Local Units.of.Measure Event.Type Observation.Count
## 1 1995-01-01 Parts per million None 24
## 2 1995-01-02 Parts per million None 24
## 3 1995-01-03 Parts per million None 24
## Observation.Percent Arithmetic.Mean X1st.Max.Value X1st.Max.Hour AQI
## 1 100 0.010375 0.018 14 15
## 2 100 0.024208 0.029 10 25
## 3 100 0.021167 0.027 8 23
## Method.Code Method.Name Local.Site.Name Address State.Name
## 1 NA - ASHLAND ROUTE 1, ASHLAND, AL Alabama
## 2 NA - ASHLAND ROUTE 1, ASHLAND, AL Alabama
## 3 NA - ASHLAND ROUTE 1, ASHLAND, AL Alabama
## County.Name City.Name CBSA.Name Date.of.Last.Change
## 1 Clay Ashland 2016-04-30
## 2 Clay Ashland 2016-04-30
## 3 Clay Ashland 2016-04-30
The OzoneUSA file is a collection of 270,913 observations. The rows are sorted first by location and then by calendar date. For example, the first ~365 rows are ozone measurements collected at Ashland, Alabama. The next ~365 rows correspond to measurements from Colbert, Alabama. The actual number of rows associated with each location is less than 365 because of missing data. This file contains the entire collection of US ozone measurements in the EPA database for the year 1995.
There are 29 columns associated with the observations. The most important pieces of information for this study are the State.Code (col 1), County.Code (col 2), Site.Num (col 3), Date.Local (col 12), and Arithmetic.Mean (col 17). The State, County, and Site codes will be used to extract data for 1630 N. Main, Los Angeles. Date.Local will be used to synchronize the different data files. Arithmetic.Mean is the value of the variable, in this case, ozone concentration. Other columns contain important information that doesn’t vary between measurements. For example, Units.of.Measure tells us that the ozone concentration is measured in parts per million. Here’s a list of the units of measure from each file.
Ozone (units = parts per million)
Temperature (\(^{\circ}\)F)
Relative Humidity (%)
Wind Speed (knots)
Wind Direction (\(^{\circ}\)Compass)
NO\(_x\) (parts per billion)
CO (parts per million)
The structure of each file is the same:
str(OzoneUSA)
## 'data.frame': 279013 obs. of 29 variables:
## $ State.Code : Factor w/ 52 levels "01","02","04",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County.Code : int 27 27 27 27 27 27 27 27 27 27 ...
## $ Site.Num : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Parameter.Code : int 44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
## $ POC : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Latitude : num 33.3 33.3 33.3 33.3 33.3 ...
## $ Longitude : num -85.8 -85.8 -85.8 -85.8 -85.8 ...
## $ Datum : Factor w/ 4 levels "NAD27","NAD83",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Parameter.Name : Factor w/ 1 level "Ozone": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sample.Duration : Factor w/ 1 level "8-HR RUN AVG BEGIN HOUR": 1 1 1 1 1 1 1 1 1 1 ...
## $ Pollutant.Standard : Factor w/ 1 level "Ozone 8-Hour 2008": 1 1 1 1 1 1 1 1 1 1 ...
## $ Date.Local : Factor w/ 365 levels "1995-01-01","1995-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Units.of.Measure : Factor w/ 1 level "Parts per million": 1 1 1 1 1 1 1 1 1 1 ...
## $ Event.Type : Factor w/ 2 levels "Included","None": 2 2 2 2 2 2 2 2 2 2 ...
## $ Observation.Count : int 24 24 24 24 24 24 24 24 24 24 ...
## $ Observation.Percent: num 100 100 100 100 100 100 100 100 100 100 ...
## $ Arithmetic.Mean : num 0.0104 0.0242 0.0212 0.0221 0.0159 ...
## $ X1st.Max.Value : num 0.018 0.029 0.027 0.027 0.026 0.026 0.021 0.032 0.043 0.039 ...
## $ X1st.Max.Hour : int 14 10 8 11 8 15 21 9 9 9 ...
## $ AQI : int 15 25 23 23 22 22 18 27 36 33 ...
## $ Method.Code : logi NA NA NA NA NA NA ...
## $ Method.Name : Factor w/ 1 level " - ": 1 1 1 1 1 1 1 1 1 1 ...
## $ Local.Site.Name : Factor w/ 758 levels ""," Davidsonville",..: 55 55 55 55 55 55 55 55 55 55 ...
## $ Address : Factor w/ 1010 levels " 6100 ARLINGTON BLVD MONTG WARD",..: 880 880 880 880 880 880 880 880 880 880 ...
## $ State.Name : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County.Name : Factor w/ 475 levels "Abbeville","Adams",..: 87 87 87 87 87 87 87 87 87 87 ...
## $ City.Name : Factor w/ 618 levels "Adams","Agawam",..: 27 27 27 27 27 27 27 27 27 27 ...
## $ CBSA.Name : Factor w/ 305 levels "","Adrian, MI",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date.of.Last.Change: Factor w/ 10 levels "2012-08-03","2012-08-04",..: 7 7 7 7 7 7 7 7 7 7 ...
In order to ease data manipulation, the first three columns (State.Code, County.Code, and Site.Num) are used to extract the observations corresponding to 1630 N. Main st., Los Angeles. These data will be put in new data frames.
The other thing that needs to be done to these data frames is convert the Date.Local column from “class = factor”" to “class = date”. This is done for two reasons. Firstly, R creates better plots when the class is “date”. The more important reason to convert this column is to facilitate merging the data into one data frame. Some of the original raw data files have the date recorded in the format YYYY-MM-DD and some have date as MM/DD/YYYY. These differing formats are incompatible when trying to merge.
#Extract the rows specfic to the site 1630 N. Main St. Its unique identifier is State.Code = 6, County.Code = 37, Site.Num = 1103. In the second step for every file, convert the class for the date from factor to date.
#The State.Code data in the original file for ozone is character rather than numerical, so the logical for the state has quotes. All the other files have state as numerical
OzoneLA <- subset(OzoneUSA, OzoneUSA$State.Code=="06" &
OzoneUSA$County.Code==37 & OzoneUSA$Site.Num==1103)
#Convert the class of Date.Local to date
OzoneLA$Date.Local <- as.Date(OzoneLA$Date.Local, "%Y-%m-%d")
TempLA <- subset(TempUSA, TempUSA$State.Code==6 &
TempUSA$County.Code==37 & TempUSA$Site.Num==1103)
TempLA$Date.Local <- as.Date(TempLA$Date.Local, "%m/%d/%Y")
HumidLA <- subset(HumidUSA, HumidUSA$State.Code==6 &
HumidUSA$County.Code==37 & HumidUSA$Site.Num==1103)
HumidLA$Date.Local <- as.Date(HumidLA$Date.Local, "%Y-%m-%d")
#The WindLA dataframe contains both Wind Direction and Wind Speed. These must be split into two separate data frames.
WindDirLA <- subset(WindUSA, WindUSA$State.Code==6 &
WindUSA$County.Code==37 & WindUSA$Site.Num==1103 &
WindUSA$Parameter.Name == "Wind Direction - Resultant")
WindDirLA$Date.Local <- as.Date(WindDirLA$Date.Local, "%Y-%m-%d")
WindSpLA <- subset(WindUSA, WindUSA$State.Code==6 &
WindUSA$County.Code==37 & WindUSA$Site.Num==1103 &
WindUSA$Parameter.Name == "Wind Speed - Resultant")
WindSpLA$Date.Local <- as.Date(WindSpLA$Date.Local, "%Y-%m-%d")
NOxLA <- subset(NOxUSA, NOxUSA$State.Code==6 &
NOxUSA$County.Code==37 & NOxUSA$Site.Num==1103 &
NOxUSA$Parameter.Name == "Oxides of nitrogen (NOx)")
NOxLA$Date.Local <- as.Date(NOxLA$Date.Local, "%Y-%m-%d")
COLA <- subset(COUSA, COUSA$State.Code==6 &
COUSA$County.Code==37 & COUSA$Site.Num==1103)
COLA$Date.Local <- as.Date(COLA$Date.Local, "%m/%d/%Y")
The data frames for each variable are of dimension ~365 rows X 29 columns. Missing data reduces the number of rows from 365.
dim(OzoneLA)
## [1] 365 29
dim(WindDirLA)
## [1] 355 29
#Ozone
plot(Arithmetic.Mean ~ Date.Local, data = OzoneLA, main = "Ozone at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Mean Ozone Conc (ppm)")
#Temperature
plot(Arithmetic.Mean ~ Date.Local, data =TempLA, main = "Temperature at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Mean Daily Temp (degrees F)")
#Relative Humidity
plot(Arithmetic.Mean ~ Date.Local, data = HumidLA, main = "Relative Humidity at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Mean Daily Relative Humidity (%)")
#Wind Direction
plot(Arithmetic.Mean ~ Date.Local, data = WindDirLA, main = "Wind Direction at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Most Common Wind Dir (degrees Compass)")
#Wind Speed
plot(Arithmetic.Mean ~ Date.Local, data = WindSpLA, main = "Wind Speed at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Mean Daily Wind Speed (knots)")
#NOx
plot(Arithmetic.Mean ~ Date.Local, data = NOxLA, main = "NOx at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Mean Daily NOx conc (ppb)")
#CO
plot(Arithmetic.Mean ~ Date.Local, data = COLA, main = "CO at 1630 N. Main St., Los Angeles", xlab = "Date (1995)", ylab = "Mean Daily CO conc (ppm)")
All of the variables, except wind speed, exhibit a difference in concentration or behavior between winter and summer. Ozone concentration and temperature increase during the warmer months. The wind direction shifts from southerly in the winter to westerly in July, August, and September. Relative humidity, NO\(_x\) concentrations and CO concentrations decrease during the summer months. It makes sense that NO\(_x\) and CO levels might vary inversely with O\(_3\) levels because they are some of the precusors necessary for ozone production and are consumed during the reactions. Wind speed doesn’t appear to have a trend during the year, other than reduced variability during the summer.
At this point, there is a data frame for the resultant variable: Ozone, and there are six separate data frames for the explanatory variables: Temperature, Humidity, Wind Direction, Wind Speed, NO\(_x\), and CO. The goal of this section is to combine the seven data frames into one data frame for efficient analysis.
Only two columns from the XXXLA data frames are needed in the analysis. These are column 12 - the measurement date and column 17 - the measurement value. These are extracted from each XXXLA file in preparation for creating a combined data frame. Column 17 is named Arithmetic.Mean in each of the XXXLA files. These columns are renamed with the approprate variable name.
#Ozone: Extract the date and daily mean ozone concentration
Ozone_cleaned <- OzoneLA[ ,c(12,17)]
names(Ozone_cleaned)[2] <- "Ozone"
#Convert the ozone units from parts per million to parts per billion because it makes the coefficients in the fits nicer.
Ozone_cleaned$Ozone <- Ozone_cleaned$Ozone *1000
#Repeat the same procedure with all the variables
Temp_cleaned <- TempLA[ ,c(12,17)]
names(Temp_cleaned)[2] <- "Temp"
Humid_cleaned <- HumidLA[ ,c(12,17)]
names(Humid_cleaned)[2] <- "RelHum"
WindDir_cleaned <- WindDirLA[ ,c(12,17)]
names(WindDir_cleaned)[2] <- "WindDir"
WindSp_cleaned <- WindSpLA[ ,c(12,17)]
names(WindSp_cleaned)[2] <- "WindSp"
NOx_cleaned <- NOxLA[ ,c(12,17)]
names(NOx_cleaned)[2] <- "NOx"
CO_cleaned <- COLA[ ,c(12,17)]
names(CO_cleaned)[2] <- "CO"
There is still a separate data frame for each variable, but the number of columns is two, reduced from 29.
Using the merge function with the Date.Local column as the common identifier, the individual variables are combined into one data frame named LA_1995. Dates that are missing from any of the individual data frames are dropped by merge.
#First, combine the ozone data and the temperature data.
LA_1995 <- merge(Ozone_cleaned, Temp_cleaned, by = "Date.Local")
#Add the humidity data
LA_1995 <- merge(LA_1995, Humid_cleaned, by = c("Date.Local"))
#Add Wind Direction
LA_1995 <- merge(LA_1995, WindDir_cleaned, by = c("Date.Local"))
#Add Wind Speed
LA_1995 <- merge(LA_1995, WindSp_cleaned, by = c("Date.Local"))
#Add NOx
LA_1995 <- merge(LA_1995, NOx_cleaned, by = c("Date.Local"))
#Add CO
LA_1995 <- merge(LA_1995, CO_cleaned, by = c("Date.Local"))
str(LA_1995)
## 'data.frame': 339 obs. of 8 variables:
## $ Date.Local: Date, format: "1995-01-03" "1995-01-04" ...
## $ Ozone : num 4.21 8.08 10.75 3.5 8.79 ...
## $ Temp : num 53.1 53.2 55.4 52.6 53.8 ...
## $ RelHum : num 67.4 75.8 61.9 61.4 72.7 ...
## $ WindDir : num 69.3 101.7 266.7 90.6 79.2 ...
## $ WindSp : num 5.67 8.93 5.67 4.08 7.02 ...
## $ NOx : num 91.5 63.4 48.5 131 56.2 ...
## $ CO : num 1.756 1.229 0.708 1.567 1.033 ...
Relative humidity may not be an appropriate variable for the fit because it is not independent of temperature. It would be interesting to test absolute humidity, which is related to relative humidity, as a different explanatory variable.
Relative humidity, also known as percent saturation, is the measured absolute humidity divided by the saturation absolute humidity (which is temperature dependent and a constant at a given temperature). \[RelHum = \frac{AbsHum}{AbsHum_{sat}(T)}\] Absolute humidity is found by multiplying relative humidity by the saturation humidity at the average daily temperature.
The formula for saturation absolute humidity in millibar is \[AbsHum_{sat}(T) = 0.06116 \cdot 10^{\left(\frac{7.591 T}{T + 240.7}\right)} \] where T is temperature in \(^\circ\)C.
#First convert temperature in Fahrenheit to Celsius
LA_1995$Celsius <- 5/9*(LA_1995$Temp-32)
#Now calculate the absolute humidity
LA_1995$AbsHum <- LA_1995$RelHum*0.06116*10^(7.591*LA_1995$Celsius/(LA_1995$Celsius + 240.7))
#Let's look at a plot of absolute humidity
plot(LA_1995$Date.Local,LA_1995$AbsHum, main = "Absolute Humidity", xlab = "1995", ylab = "Absolute Humidity (millibar)")
There is a seasonal cycle in absolute humidity with larger values in summer than in winter. Perhaps there will be a connection with ozone.
With the data in one data frame, it is easy to plot ozone against each explanatory variable.
#Temperature
plot(Ozone ~ Temp, data = LA_1995, main = "Ozone v. Temperature", xlab = "Mean Daily Temperature (degrees F)", ylab = "Mean Daily Ozone Conc (ppb)")
#Relative Humidity
plot(Ozone ~ RelHum, data = LA_1995, main = "Ozone v. Relative Humidity", xlab = "Mean Daily Relative Humidity (%)", ylab = "Mean Daily Ozone Conc (ppb)")
#Absolute Humidity
plot(Ozone ~ AbsHum, data = LA_1995, main = "Ozone v. Absolute Humidity", xlab = "Mean Daily Absolute Humidity (millibar)", ylab = "Mean Daily Ozone Conc (ppb)")
#Wind Direction
plot(Ozone ~ WindDir, data = LA_1995, main = "Ozone v. Wind Direction", xlab = "Most Common Wind Dir (degrees Compass)", ylab = "Mean Daily Ozone Conc (ppb)")
#Wind Speed
plot(Ozone ~ WindSp, data = LA_1995, main = "Ozone v. Wind Speed", xlab = "Mean Daily Wind Speed (knots)", ylab = "Mean Daily Ozone Conc (ppb)")
#NOx
plot(Ozone ~ NOx, data = LA_1995, main = "Ozone v. NOx", xlab = "Mean Daily NOx Conc (ppb)", ylab = "Mean Daily Ozone Conc (ppb)")
#CO
plot(Ozone ~ CO, data = LA_1995, main = "Ozone v. CO", xlab = "Mean Daily CO Conc (ppm)", ylab = "Mean Daily Ozone Conc (ppb)")
Ozone has a clear increasing relationship with temperature over the entire range of the explanatory variable. Ozone increases with absolute humidity, but only at values above 12 millibar. Ozone levels are low at high NO\(_x\) and CO concentrations. There appears to be no relationship between ozone and NO\(_x\) at NO\(_x\) concentrations below 150ppb and no relationship between ozone and CO at CO concentrations below 3ppm.
In this model, wind direction must be treated differently than the other variables. Ozone levels are not expected to increase or decrease linearly with wind direction. In the simplest case, we expect to see a maximum at a particular wind direction, with decreases as wind direction moves away from that compass point. The decrease is gradual, rather than abrupt, because of the diffusive nature of transport in the air.
Because the shape of the most basic relationship between ozone concentration and wind direction is expected to increase and decrease, a sinusoidal relationship of the form \(\beta_nsin\left(\frac{2\pi}{360}WindDir\right) + \beta_{n+1}cos\left(\frac{2\pi}{360}WindDir\right)\) will be used in the analysis.
#Let's see what a sample fit looks like
#Sort the data by Wind Direction for plotting
LA_1995Wind <- LA_1995[order(LA_1995$WindDir),]
mlrWind <- lm(Ozone ~ sin(2*pi/360*WindDir) + cos(2*pi/360*WindDir), data = LA_1995Wind)
summary(mlrWind)
##
## Call:
## lm(formula = Ozone ~ sin(2 * pi/360 * WindDir) + cos(2 * pi/360 *
## WindDir), data = LA_1995Wind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.8986 -5.3977 -0.5215 5.5787 28.1224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.549 1.092 14.240 <2e-16 ***
## sin(2 * pi/360 * WindDir) -7.824 0.721 -10.853 <2e-16 ***
## cos(2 * pi/360 * WindDir) -3.553 1.409 -2.521 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.182 on 336 degrees of freedom
## Multiple R-squared: 0.3007, Adjusted R-squared: 0.2966
## F-statistic: 72.25 on 2 and 336 DF, p-value: < 2.2e-16
plot(LA_1995Wind$WindDir,LA_1995Wind$Ozone, main = "Ozone with best fit curve (mlrWind)", xlab = "WindDir", ylab = "Ozone conc (ppb)")
lines(LA_1995Wind$WindDir,fitted.values(mlrWind))
The sinusoidal fit returns the expected shape.
This analysis is broken into two parts. First, regression is performed with the meteorlogical variables, beginning with the full set, folowed by backward elimination to find the relative importance of each parameter. Relative humidity and absolute humidity are not used in the same fit.
After the best combination of explanatory meteorological variables is established, the two chemical species are added to discover their utility in developing a model.
Generate a best fit with the original meteorological variables (Temperature + Wind Speed + Relative Humidity + Wind Direction)
mlr1 <- lm(Ozone ~ Temp + WindSp + RelHum + sin(2*pi/360*WindDir) + cos(2*pi/360*WindDir), data = LA_1995)
summary(mlr1)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp + RelHum + sin(2 * pi/360 *
## WindDir) + cos(2 * pi/360 * WindDir), data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4007 -4.9154 -0.4591 4.2647 23.4400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.11683 6.90864 -3.780 0.000186 ***
## Temp 0.59630 0.07139 8.353 1.82e-15 ***
## WindSp 2.11702 0.39198 5.401 1.26e-07 ***
## RelHum -0.08093 0.04035 -2.006 0.045700 *
## sin(2 * pi/360 * WindDir) -5.73929 0.70452 -8.146 7.64e-15 ***
## cos(2 * pi/360 * WindDir) -4.68061 1.25440 -3.731 0.000224 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.844 on 333 degrees of freedom
## Multiple R-squared: 0.515, Adjusted R-squared: 0.5077
## F-statistic: 70.72 on 5 and 333 DF, p-value: < 2.2e-16
The adjusted R-squared is 0.508. According to the Pr column, relative humidity is the least important variable in the fit. The coefficients for temperature and wind speed are positive, indicating a positive relationship between ozone and these variables.
Replace relative humidity with absolute humidity.
mlr2 <- lm(Ozone ~ Temp + WindSp + AbsHum + sin(2*pi/360*WindDir) + cos(2*pi/360*WindDir), data = LA_1995)
summary(mlr2)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp + AbsHum + sin(2 * pi/360 *
## WindDir) + cos(2 * pi/360 * WindDir), data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9146 -4.7662 -0.4794 4.4282 23.2290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.11612 4.50548 -8.016 1.87e-14 ***
## Temp 0.71805 0.07136 10.062 < 2e-16 ***
## WindSp 2.23024 0.39264 5.680 2.94e-08 ***
## AbsHum -0.24090 0.20004 -1.204 0.22935
## sin(2 * pi/360 * WindDir) -5.58860 0.73543 -7.599 3.05e-13 ***
## cos(2 * pi/360 * WindDir) -4.68025 1.26053 -3.713 0.00024 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.871 on 333 degrees of freedom
## Multiple R-squared: 0.5113, Adjusted R-squared: 0.5039
## F-statistic: 69.68 on 5 and 333 DF, p-value: < 2.2e-16
The adjusted R-squared decreased slightly to 0.504. Again the humidity variable is least significant.
Remove the humidity data
mlr3 <- lm(Ozone ~ Temp + WindSp + sin(2*pi/360*WindDir) + cos(2*pi/360*WindDir), data = LA_1995)
summary(mlr3)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
## cos(2 * pi/360 * WindDir), data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2247 -4.7943 -0.5557 4.5619 22.4490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.69451 4.48283 -8.186 5.78e-15 ***
## Temp 0.67269 0.06065 11.091 < 2e-16 ***
## WindSp 2.39466 0.36838 6.500 2.92e-10 ***
## sin(2 * pi/360 * WindDir) -5.16642 0.64693 -7.986 2.27e-14 ***
## cos(2 * pi/360 * WindDir) -4.76688 1.25933 -3.785 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.875 on 334 degrees of freedom
## Multiple R-squared: 0.5092, Adjusted R-squared: 0.5033
## F-statistic: 86.62 on 4 and 334 DF, p-value: < 2.2e-16
The adjusted R-squared value didn’t changefrom the mlr2 (absolute humidity) and decreased very little from mlr1 (relative humidity). Relative humidity and absolute humidity aren’t important in the fit.
Now, cycle through the remaining three variables in combinations of two. How good is the fit based on only temperature and wind speed?
mlr4 <- lm(Ozone ~ Temp + WindSp, data = LA_1995)
summary(mlr4)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp, data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.091 -5.649 -1.028 5.556 24.637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.28604 4.59085 -10.082 < 2e-16 ***
## Temp 0.88024 0.06382 13.794 < 2e-16 ***
## WindSp 2.17558 0.39313 5.534 6.31e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.769 on 336 degrees of freedom
## Multiple R-squared: 0.3695, Adjusted R-squared: 0.3657
## F-statistic: 98.45 on 2 and 336 DF, p-value: < 2.2e-16
The adjusted R-squared drops to 0.366.
Replace wind speed with wind direction.
mlr5 <- lm(Ozone ~ Temp + sin(2 * pi/360 * WindDir) +
cos(2 * pi/360 * WindDir), data = LA_1995)
summary(mlr5)
##
## Call:
## lm(formula = Ozone ~ Temp + sin(2 * pi/360 * WindDir) + cos(2 *
## pi/360 * WindDir), data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1131 -5.0243 -0.6142 4.9230 23.3451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.65696 4.06941 -5.322 1.88e-07 ***
## Temp 0.59257 0.06294 9.416 < 2e-16 ***
## sin(2 * pi/360 * WindDir) -5.73561 0.67930 -8.443 9.46e-16 ***
## cos(2 * pi/360 * WindDir) -2.13557 1.26379 -1.690 0.092 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.286 on 335 degrees of freedom
## Multiple R-squared: 0.4471, Adjusted R-squared: 0.4421
## F-statistic: 90.28 on 3 and 335 DF, p-value: < 2.2e-16
This is a little better with an adjusted R-squared of 0.442, but isn’t as good as the fit with all three variables.
What about wind speed + wind direction?
mlr6 <- lm(Ozone ~ WindSp + sin(2 * pi/360 * WindDir) +
cos(2 * pi/360 * WindDir), data = LA_1995)
summary(mlr6)
##
## Call:
## lm(formula = Ozone ~ WindSp + sin(2 * pi/360 * WindDir) + cos(2 *
## pi/360 * WindDir), data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.900 -4.989 -0.792 5.379 27.797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0114 2.0611 4.372 1.64e-05 ***
## WindSp 1.5644 0.4213 3.713 0.000240 ***
## sin(2 * pi/360 * WindDir) -7.6369 0.7094 -10.765 < 2e-16 ***
## cos(2 * pi/360 * WindDir) -5.3966 1.4694 -3.673 0.000279 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.03 on 335 degrees of freedom
## Multiple R-squared: 0.3284, Adjusted R-squared: 0.3224
## F-statistic: 54.6 on 3 and 335 DF, p-value: < 2.2e-16
Again, this adjusted R-squared (0.322) isn’t as good as the adjusted R-squared (0.503) from the fit with the three variables: temperature, wind direction, and wind speed. The best model uses all three.
Add the two chemical measurements to the three meteorological variables. (Temperature + Wind Speed + Wind Direction + NO\(_x\) + CO)
mlr7 <- lm(Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
cos(2 * pi/360 * WindDir) + NOx + CO, data = LA_1995)
summary(mlr7)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
## cos(2 * pi/360 * WindDir) + NOx + CO, data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8698 -4.5242 0.0866 4.2547 20.2761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.06444 4.35094 -8.519 5.71e-16 ***
## Temp 0.80903 0.06132 13.195 < 2e-16 ***
## WindSp 1.64009 0.37900 4.327 2.00e-05 ***
## sin(2 * pi/360 * WindDir) -2.27737 0.76100 -2.993 0.00297 **
## cos(2 * pi/360 * WindDir) -2.38972 1.24745 -1.916 0.05626 .
## NOx -0.05818 0.01372 -4.239 2.91e-05 ***
## CO 0.75817 0.79209 0.957 0.33918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.504 on 332 degrees of freedom
## Multiple R-squared: 0.5634, Adjusted R-squared: 0.5555
## F-statistic: 71.4 on 6 and 332 DF, p-value: < 2.2e-16
Adding NO\(_x\) and CO improved the adjusted R-squared to the highest value so far, 0.556.
Are both species required? First drop NO\(_x\).
mlr8 <- lm(Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
cos(2 * pi/360 * WindDir) + CO, data = LA_1995)
summary(mlr8)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
## cos(2 * pi/360 * WindDir) + CO, data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.252 -4.826 -0.136 4.229 21.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.80801 4.39034 -7.701 1.56e-13 ***
## Temp 0.73643 0.06036 12.201 < 2e-16 ***
## WindSp 1.67690 0.38843 4.317 2.09e-05 ***
## sin(2 * pi/360 * WindDir) -3.38466 0.73274 -4.619 5.51e-06 ***
## cos(2 * pi/360 * WindDir) -3.43122 1.25379 -2.737 0.00654 **
## CO -2.06634 0.43904 -4.706 3.70e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.667 on 333 degrees of freedom
## Multiple R-squared: 0.5398, Adjusted R-squared: 0.5329
## F-statistic: 78.11 on 5 and 333 DF, p-value: < 2.2e-16
The adjusted R-squared decreased to 0.533, indicating that NO\(_x\) might be important.
Replace CO with NO\(_x\) to confirm that NO\(_x\) is important.
mlr9 <- lm(Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
cos(2 * pi/360 * WindDir) + NOx, data = LA_1995)
summary(mlr9)
##
## Call:
## lm(formula = Ozone ~ Temp + WindSp + sin(2 * pi/360 * WindDir) +
## cos(2 * pi/360 * WindDir) + NOx, data = LA_1995)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5880 -4.6847 0.0235 4.3122 20.2953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.13626 4.24097 -8.521 5.58e-16 ***
## Temp 0.80208 0.06088 13.176 < 2e-16 ***
## WindSp 1.57010 0.37183 4.223 3.12e-05 ***
## sin(2 * pi/360 * WindDir) -2.29659 0.76063 -3.019 0.00273 **
## cos(2 * pi/360 * WindDir) -2.44430 1.24599 -1.962 0.05063 .
## NOx -0.04713 0.00742 -6.352 6.99e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.503 on 333 degrees of freedom
## Multiple R-squared: 0.5622, Adjusted R-squared: 0.5556
## F-statistic: 85.52 on 5 and 333 DF, p-value: < 2.2e-16
Without CO, the adjusted R-squared returned to 0.556. CO is unimportant in the fit. The negative coefficient associated with NO\(_x\) indicates a negative relationship between ozone and NO\(_x\). The best fit is found with the explanatory variables of temperature, wind speed, wind direction, and NO\(_x\) concentration.
What does a general correlation matrix look like?
#Create a data frame that contains only the explanatory variables, i.e. excludes ozone, date and degrees Celsius
LA_1995_stripped <- LA_1995[,c(-1,-2,-9)]
round(cor(LA_1995_stripped), 2)
## Temp RelHum WindDir WindSp NOx CO AbsHum
## Temp 1.00 -0.36 0.36 -0.22 0.09 0.05 0.63
## RelHum -0.36 1.00 0.16 -0.20 -0.35 -0.15 0.48
## WindDir 0.36 0.16 1.00 -0.01 -0.53 -0.47 0.52
## WindSp -0.22 -0.20 -0.01 1.00 -0.27 -0.33 -0.35
## NOx 0.09 -0.35 -0.53 -0.27 1.00 0.90 -0.23
## CO 0.05 -0.15 -0.47 -0.33 0.90 1.00 -0.11
## AbsHum 0.63 0.48 0.52 -0.35 -0.23 -0.11 1.00
The two variables eliminated, absolute humidity and CO, have relatively strong correlations with other variables and may not be truly independent. The correlation between temperature and absolute humidity is 0.63. The correlation between NO\(_x\) and CO is 0.90. Relative humidity doesn’t have a strong correlation with any other variables and yet, it wasn’t important in the overall fit to the data.
A plot of ozone levels with the fit generated in mlr9 is below.
plot(LA_1995$Date.Local,LA_1995$Ozone, main = "Ozone with best fit curve (mlr9)", xlab = "1995", ylab = "Ozone conc (ppb)")
lines(LA_1995$Date.Local,fitted.values(mlr9))
The general features of ozone levels are reproduced by the model. The predicted ozone levels increase during the summer months with a maximum in August. The timing of the minimum is harder to distinguish because the fit is noisier in the winter months.
Based on 1995 meteorological data from 1630 N. Main St., Los Angeles, a predictive model was developed based on multiple regression. In the model, the combination of temperature, wind speed, and wind direction account for approximately 50% of the variability in average daily ground level ozone concentrations at the observation site. When measurements of NO\(_x\) concentrations are added to the regression, approximately 56% of the variablility in ozone concentration is explained. Humidity levels (relative and absolute) and CO concentrations did not appear to be meaningful explanatory variables.
Considering the complexity of atmospheric chemistry and physics, it is surprising that half of the variation in ozone concentrations can be predicted by daily averages of a few variables. It would be interesting to investigate similar models at other urban centers to find out if the results can be applied universally.
This is a very simple model of ozone chemistry in Los Angeles – a good start to understanding conditions that affect ozone pollution. To expand the model, several of its limitations must be addressed.
Sunlight is one of the key ingredients in ozone production and it was not included in this model. Two new variables would be helpful: number of daylight hours and cloud cover conditions, neither of which are recorded in the EPA data base, but are likely to be available in other public data sets.
There is a daily cycle to ozone production that is not accounted for in this model. NO\(_x\) and VOC’s build up during the night and the morning commute, when ozone concentrations are low. During the middle of the day and the afternoon, as the sunlight intensifies and the temperature increases, ozone concentrations increase. Incorporation of variations throughout the day may be important to understanding the true behavior of ozone.
The assumption of linear relationships between the explanatory variables and the resultant variable may not be the most appropriate. As seen with wind direction, there may be physical reasons to assume different forms for the best fit.
Extrapolation of the results of this study to other locations and time periods is inappropriate. Ozone contamination is a regional phenomenon, often with production happening in one place and ozone transported to another. Regional studies that improve on these results are needed in order to form recommendations for the general public. In addition, predictions based on conditions from 1995 may no longer be appropriate. Since recognizing the connection between the chemical precusors and ozone formation, governments have formulated policies to reduce NO\(_x\) and VOC emissions. The relationships between explanatory and resultant variables found here may not apply at these lower levels.
All and all, there is still a lot to do.
This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Melissa Hendricks
Semester: Spring 2018