# load data
library(tidyverse)
data("nasa")
For this project I chose to look at NASA spatio-temporal data. The data are geographic and atmospheric measurements of temperature (surface and air), ozone, air pressure, cloud cover and etc. across the Central American region. Climate change being a growing global concern and a hot topic in recent times, here I thought to use an opportunity to investigate what may affect or explain levels of ozone, which is essential for healthy and safe living environment.
These data were obtained from the NASA Langley Research Center Atmospheric Sciences Data Center. It can also be conveniently loaded in R-Studio as part of the dplyr package. There are 41472 total observations, where variables, such as temperature and ozone, are monthly averages, with observations spanning 6 year period from Jan 1995 to Dec 2000. The Central American region was divided into very coarse 24 by 24 grid and so the observations were made over each grid region. Therefore, the data is represented in a tbl_cube structure [24x24x12x6] in order to record observation for each grid region, for each month, and over the span of 6 years.
The temperature variable is measurement of air temperature and is numerical in Kelvin units. For this study I chose to analyze this variable in relation to the ozone level. There is also surftemp variable for the surface temperature measures. However this variable will not be considered here, since it is likely to be hightly correlated with the air temperature and is therefore not likely to prove significant in the outcome of the study.
Here is a small snapshot of the dataset:
head(as.data.frame(nasa))
## lat long month year cloudhigh cloudlow cloudmid ozone pressure
## 1 36.20000 -113.8 1 1995 26.0 7.5 34.5 304 835
## 2 33.70435 -113.8 1 1995 20.0 11.5 32.5 304 940
## 3 31.20870 -113.8 1 1995 16.0 16.5 26.0 298 960
## 4 28.71304 -113.8 1 1995 13.0 20.5 14.5 276 990
## 5 26.21739 -113.8 1 1995 7.5 26.0 10.5 274 1000
## 6 23.72174 -113.8 1 1995 8.0 30.0 9.5 264 1000
## surftemp temperature
## 1 272.7 272.1
## 2 279.5 282.2
## 3 284.7 285.2
## 4 289.3 290.7
## 5 292.2 292.7
## 6 294.1 293.6
The measurement of ozone appear to be in Doubson Units (DU). Information about ozone and its measurements can be found on the NASA Ozone Watch site. Ozone levels of less than 220 Dobson Units are considered to be abnormal and categorized to be at “ozone hole” levels.
The tables below show summary of statistics for the temperature and ozone measurements.
Summary table for temperature
summary(nasa[["mets"]][["temperature"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 269.1 295.5 299.2 297.9 301.4 310.0
Summary table for ozone
summary(nasa[["mets"]][["ozone"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 232.0 254.0 264.0 267.2 276.0 390.0
The below two histograms show distribution of temperature and ozone measurements. Both appear to have quite evident outliers and skew compared to the overlaid blue normal curve. However, given that the sampling size is quite large, these deviations from the normal distribution may be acceptable. Interestingly, the two graphs appear to be near mirror images of each other, which intuitively may [but not necessarily correctly] suggest that there could be a strong association between the two variables, with possibly an inverse linear relationship.
hist(nasa[["mets"]][["temperature"]], main = "Distribution of Temperature", xlab = "Temperature (Kelvin)", probability = TRUE)
t_sd <- sd(nasa[["mets"]][["temperature"]])
t_mean <- mean(nasa[["mets"]][["temperature"]])
t_x <- 270:340
t_y <- dnorm(x = t_x, mean = t_mean, sd = t_sd)
lines(x = t_x, y = t_y, col = "blue")
hist(nasa[["mets"]][["ozone"]], main = "Distribution of Ozone Levels", xlab = "Ozone (DU)", probability = TRUE)
o_sd <- sd(nasa[["mets"]][["ozone"]])
o_mean <- mean(nasa[["mets"]][["ozone"]])
o_x <- 230:310
o_y <- dnorm(x = o_x, mean = o_mean, sd = o_sd)
lines(x = o_x, y = o_y, col = "blue")
A simple linear regression model can be used, in order to show if there is a statistically significant association between these two variables.
Summary of the Linear Model
m <- lm(nasa[["mets"]][["ozone"]] ~ nasa[["mets"]][["temperature"]])
summary(m)
##
## Call:
## lm(formula = nasa[["mets"]][["ozone"]] ~ nasa[["mets"]][["temperature"]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.091 -12.343 -3.839 8.717 101.651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 681.51073 5.50697 123.75 <2e-16 ***
## nasa[["mets"]][["temperature"]] -1.39074 0.01848 -75.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.79 on 41470 degrees of freedom
## Multiple R-squared: 0.1201, Adjusted R-squared: 0.1201
## F-statistic: 5662 on 1 and 41470 DF, p-value: < 2.2e-16
The below plot shows levels of ozone plotted against temperature measurements with a negative slope line, fitted by the model. The equation for the line is shown below the plot.
plot(nasa[["mets"]][["ozone"]] ~ nasa[["mets"]][["temperature"]], xlab = "Temperature", ylab = "Ozone")
abline(m)
\[\widehat{ozone} = 681.51 -1.39 \times temperature\] Based on this formula, we can say that for every temperature increase by one degree Kelvin, the level of ozone will drop by \(-1.39\) DU. The model also indicates that the confidence level for the strength of the association is well beyond 95% (given that p-value is much less than the \(\alpha\) of 0.05). Therefore, we can confidently state that the linear relationship does indeed exist, according to the model.
To trust this model, however, it is necessary to perfom model diagnostics in order to check for required conditions, which are outlined below:
1. Linearity and Constant Variability: The plot of residuals against the temperature does show a nearly linear relationship and for the most part constant variability around the dotted line at zero.
d <- as.data.frame(nasa)
plot(m$residuals ~ d$temperature)
abline(h = 0, lty = 3)
2. Nearly Normal Residuals: To check this condition, we can look at a histogram of residuals and a normal probability plot.
hist(m$residuals)
qqnorm(m$residuals)
qqline(m$residuals)
The two plots show some skew and deviation from the normal distribution, but given that the sample size if very large, the degree of deviation may be acceptable.
Another important indicator to conside from the model is R-squared - R\(^2\). This tells us that only 12% variability in temperature accounts for variability in ozone observations. Even though there is statistically strong evidence that a linear relation exists, the R\(^2\) level is relatively low in order to say that temperature plays a significant enough role in determining ozone level. Perhaps the association simply reinforces a physical phenomenon, that temperature effects gas density and thus the amount of ozone measured because of that. Studies show that other factors, such as concentration of certain chemicals in the air may have a much stronger effect on the ozone. A better future research idea may perhaps be running a multi-variable regression model, considering measurements of numerous chemicals present in the air.
This data comes from the ASA 2007 data expo, http://stat-computing.org/dataexpo/2006/ and from dplyr package.