# load data
library(tidyverse)

data("nasa")

Part 1 - Introduction

For this project I chose to look at NASA spatio-temporal data. The data are geographic and atmospheric measurements of temperature (surface and air), ozone, air pressure, cloud cover and etc. across the Central American region. Climate change being a growing global concern and a hot topic in recent times, here I thought to use an opportunity to investigate what may affect or explain levels of ozone, which is essential for healthy and safe living environment.

Part 2 - Data

These data were obtained from the NASA Langley Research Center Atmospheric Sciences Data Center. It can also be conveniently loaded in R-Studio as part of the dplyr package. There are 41472 total observations, where variables, such as temperature and ozone, are monthly averages, with observations spanning 6 year period from Jan 1995 to Dec 2000. The Central American region was divided into very coarse 24 by 24 grid and so the observations were made over each grid region. Therefore, the data is represented in a tbl_cube structure [24x24x12x6] in order to record observation for each grid region, for each month, and over the span of 6 years.

The temperature variable is measurement of air temperature and is numerical in Kelvin units. For this study I chose to analyze this variable in relation to the ozone level. There is also surftemp variable for the surface temperature measures. However this variable will not be considered here, since it is likely to be hightly correlated with the air temperature and is therefore not likely to prove significant in the outcome of the study.

Here is a small snapshot of the dataset:

head(as.data.frame(nasa))
##        lat   long month year cloudhigh cloudlow cloudmid ozone pressure
## 1 36.20000 -113.8     1 1995      26.0      7.5     34.5   304      835
## 2 33.70435 -113.8     1 1995      20.0     11.5     32.5   304      940
## 3 31.20870 -113.8     1 1995      16.0     16.5     26.0   298      960
## 4 28.71304 -113.8     1 1995      13.0     20.5     14.5   276      990
## 5 26.21739 -113.8     1 1995       7.5     26.0     10.5   274     1000
## 6 23.72174 -113.8     1 1995       8.0     30.0      9.5   264     1000
##   surftemp temperature
## 1    272.7       272.1
## 2    279.5       282.2
## 3    284.7       285.2
## 4    289.3       290.7
## 5    292.2       292.7
## 6    294.1       293.6

The measurement of ozone appear to be in Doubson Units (DU). Information about ozone and its measurements can be found on the NASA Ozone Watch site. Ozone levels of less than 220 Dobson Units are considered to be abnormal and categorized to be at “ozone hole” levels.

Part 3 - Exploratory data analysis

The tables below show summary of statistics for the temperature and ozone measurements.

Summary table for temperature

summary(nasa[["mets"]][["temperature"]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   269.1   295.5   299.2   297.9   301.4   310.0

Summary table for ozone

summary(nasa[["mets"]][["ozone"]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   232.0   254.0   264.0   267.2   276.0   390.0

The below two histograms show distribution of temperature and ozone measurements. Both appear to have quite evident outliers and skew compared to the overlaid blue normal curve. However, given that the sampling size is quite large, these deviations from the normal distribution may be acceptable. Interestingly, the two graphs appear to be near mirror images of each other, which intuitively may [but not necessarily correctly] suggest that there could be a strong association between the two variables, with possibly an inverse linear relationship.

hist(nasa[["mets"]][["temperature"]], main = "Distribution of Temperature", xlab = "Temperature (Kelvin)", probability = TRUE)
t_sd <- sd(nasa[["mets"]][["temperature"]])
t_mean <- mean(nasa[["mets"]][["temperature"]])
t_x <- 270:340
t_y <- dnorm(x = t_x, mean = t_mean, sd = t_sd)
lines(x = t_x, y = t_y, col = "blue")

hist(nasa[["mets"]][["ozone"]], main = "Distribution of Ozone Levels", xlab = "Ozone (DU)", probability = TRUE)
o_sd <- sd(nasa[["mets"]][["ozone"]])
o_mean <- mean(nasa[["mets"]][["ozone"]])
o_x <- 230:310
o_y <- dnorm(x = o_x, mean = o_mean, sd = o_sd)
lines(x = o_x, y = o_y, col = "blue")

Part 4 - Inference

A simple linear regression model can be used, in order to show if there is a statistically significant association between these two variables.

Summary of the Linear Model

m <- lm(nasa[["mets"]][["ozone"]] ~ nasa[["mets"]][["temperature"]])
summary(m)
## 
## Call:
## lm(formula = nasa[["mets"]][["ozone"]] ~ nasa[["mets"]][["temperature"]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.091 -12.343  -3.839   8.717 101.651 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     681.51073    5.50697  123.75   <2e-16 ***
## nasa[["mets"]][["temperature"]]  -1.39074    0.01848  -75.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.79 on 41470 degrees of freedom
## Multiple R-squared:  0.1201, Adjusted R-squared:  0.1201 
## F-statistic:  5662 on 1 and 41470 DF,  p-value: < 2.2e-16

The below plot shows levels of ozone plotted against temperature measurements with a negative slope line, fitted by the model. The equation for the line is shown below the plot.

plot(nasa[["mets"]][["ozone"]] ~ nasa[["mets"]][["temperature"]], xlab = "Temperature", ylab = "Ozone")
abline(m)

\[\widehat{ozone} = 681.51 -1.39 \times temperature\] Based on this formula, we can say that for every temperature increase by one degree Kelvin, the level of ozone will drop by \(-1.39\) DU. The model also indicates that the confidence level for the strength of the association is well beyond 95% (given that p-value is much less than the \(\alpha\) of 0.05). Therefore, we can confidently state that the linear relationship does indeed exist, according to the model.

To trust this model, however, it is necessary to perfom model diagnostics in order to check for required conditions, which are outlined below:

1. Linearity and Constant Variability: The plot of residuals against the temperature does show a nearly linear relationship and for the most part constant variability around the dotted line at zero.

d <- as.data.frame(nasa)
plot(m$residuals ~ d$temperature)
abline(h = 0, lty = 3)

2. Nearly Normal Residuals: To check this condition, we can look at a histogram of residuals and a normal probability plot.

hist(m$residuals)

qqnorm(m$residuals)
qqline(m$residuals) 

The two plots show some skew and deviation from the normal distribution, but given that the sample size if very large, the degree of deviation may be acceptable.

Part 5 - Conclusion

Another important indicator to conside from the model is R-squared - R\(^2\). This tells us that only 12% variability in temperature accounts for variability in ozone observations. Even though there is statistically strong evidence that a linear relation exists, the R\(^2\) level is relatively low in order to say that temperature plays a significant enough role in determining ozone level. Perhaps the association simply reinforces a physical phenomenon, that temperature effects gas density and thus the amount of ozone measured because of that. Studies show that other factors, such as concentration of certain chemicals in the air may have a much stronger effect on the ozone. A better future research idea may perhaps be running a multi-variable regression model, considering measurements of numerous chemicals present in the air.

References

This data comes from the ASA 2007 data expo, http://stat-computing.org/dataexpo/2006/ and from dplyr package.