Motivation

Ground-level ozone (smog) is a major air pollutant created by chemical reactions between oxides of nitrogen (NOx) and volatile organic compounds (VOC) in the presence of sunlight. Major emission sources of NOx and VOC are industrial facilities, motor vehicle exhaust, electric utilities, chemical solvents, and gasoline vapors. Harmful exposure to ozone can trigger a large number of health problems, is a leading cause of premature death, and is detrimental to vegetation and our ecosystems.

For this project, I wish to conduct a geostatistical analysis of ozone data in California for the year 2015.

The Data

The Environmental Protection Agency (EPA) provides ozone data at various monitoring locations across the USA. For each monitor, an 8-hour ozone average is calculated for every clock hour. Only complete data (e.g., with 6 or more valid hourly samples in the 8 hour block, or 75% completeness) is included.

Dataset Name: 8hour_44201_2015.csv

Download Link: http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/8hour_44201_2015.zip

Variables: For variable names and descriptions please visit the EPA AirData Download Files Documentation (Sections 7.1 and 7.2) at http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/FileFormats.html#_8_hour_average_files

Size: After reading in and subsetting the 2015 ozone data to only observations in California, there are 759105 observations and 23 variables. Originally, I thought there were 70 monitoring locations but later found out this is not the case. There are numerous cases in the data where the monitor ID is the same, the date is the same, but the lat and lon cooridnates are DIFFERENT. One of the first steps of the project will be figuring out why this is the case.

Additional Data Sources: Ideally, I would also like to integrate population density data, land cover data, and elevation data to enhance my analysis. I haven’t located these data sources yet, and thus data integration will be a large component of the project as well.

Plots

Plot of Ozone Values at Monitor Locations

To get an idea of the distribution of ozone values across the various monitoring locations in California, here I plot the mean of the 8-hour average ozone values on July 13, 2015.

Empirical Variogram for July 13, 2015

These plots suggest that narrowing the scope of the analysis to either Northern or Southern California may be appropriate, or that I may want to analyze each separately.

Research Questions

Methods

The number of methods to be used within data pre-processing, exploratory spatial data analysis, modeling/prediction, etc. is very large. Some main examples include:

Discussion/Limitations

The best way to analyze ozone data is with spatial temporal statistics, due to the naturally high variability of ozone concentration values from season to season and year by year.

Since I am not familiar with spatial temporal theory, I plan on analyzing every day separately. I acknowledge that this is the not the ideal analysis, but that I should still get lots of valuable information and have plenty of work for the project.