The atmos data set resides in the nasaweather package of the R programming language. It contains a collection of atmospheric variables measured between 1995 and 2000 on a grid of 576 coordinates in the western hemisphere. The data set comes from the 2006 ASA Data Expo.
Some of the variables in the atmos data set are:
temp - The mean monthly air temperature near the surface of the Earth (measured in degrees kelvin (K))
pressure - The mean monthly air pressure at the surface of the Earth (measured in millibars (mb))
ozone - The mean monthly abundance of atmospheric ozone (measured in Dobson units (DU))
You can convert the temperature unit from Kelvin to Celsius with the formula
\[ celsius = kelvin - 273.15 \]
And you can convert the result to Fahrenheit with the formula
\[ fahrenheit = celsius \times \frac{5}{9} + 32 \]
For the remainder of the report, we will look only at data from the year 1995. We aggregate our data by location, using the R code below.
load(url("http://assets.datacamp.com/course/rmarkdown/atmos.RData")) # working with a subset
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
library(ggvis)
## Warning: package 'ggvis' was built under R version 3.2.5
year <- 1995
means <- atmos %>%
filter(year == year) %>%
group_by(long, lat) %>%
summarize(temp = mean(temp, na.rm = TRUE),
pressure = mean(pressure, na.rm = TRUE),
ozone = mean(ozone, na.rm = TRUE),
cloudlow = mean(cloudlow, na.rm = TRUE),
cloudmid = mean(cloudmid, na.rm = TRUE),
cloudhigh = mean(cloudhigh, na.rm = TRUE)) %>%
ungroup()
Is the relationship between ozone and temperature useful for understanding fluctuations in ozone? A scatterplot of the variables shows a strong, but unusual relationship.
means %>%
ggvis(~temp, ~ozone) %>%
layer_points()
We suspect that group level effects are caused by environmental conditions that vary by locale. To test this idea, we sort each data point into one of four geographic regions:
means$locale <- "north america"
means$locale[means$lat < 10] <- "south pacific"
means$locale[means$long > -80 & means$lat < 10] <- "south america"
means$locale[means$long > -80 & means$lat > 10] <- "north atlantic"
We suggest that ozone is highly correlated with temperature, but that a different relationship exists for each geographic region. We capture this relationship with a second order linear model of the form
\[ ozone = \alpha + \beta_{1} temperature + \sum_{locales} \beta_{i} locale_{i} + \sum_{locales} \beta_{j} interaction_{j} + \epsilon\]
This yields the following coefficients and model lines.
lm(ozone ~ temp + locale + temp:locale, data = means)
##
## Call:
## lm(formula = ozone ~ temp + locale + temp:locale, data = means)
##
## Coefficients:
## (Intercept) temp
## 917.7961 -2.1519
## localenorth atlantic localesouth america
## 495.7904 -633.7339
## localesouth pacific temp:localenorth atlantic
## -52.2260 -1.6498
## temp:localesouth america temp:localesouth pacific
## 2.0587 0.1126
means %>%
group_by(locale) %>%
ggvis(~temp, ~ozone) %>%
layer_points(fill = ~locale) %>%
layer_model_predictions(model = "lm", stroke = ~locale) %>%
hide_legend("stroke") %>%
scale_nominal("stroke", range = c("darkorange", "darkred", "darkgreen", "darkblue"))
## Guessing formula = ozone ~ temp
An anova test suggests that both locale and the interaction effect of locale and temperature are useful for predicting ozone (i.e., the p-value that compares the full model to the reduced models is statistically significant).
mod <- lm(ozone ~ temp, data = means)
mod2 <- lm(ozone ~ temp + locale, data = means)
mod3 <- lm(ozone ~ temp + locale + temp:locale, data = means)
anova(mod, mod2, mod3)
## Analysis of Variance Table
##
## Model 1: ozone ~ temp
## Model 2: ozone ~ temp + locale
## Model 3: ozone ~ temp + locale + temp:locale
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 559 132709
## 2 556 72584 3 60124 191.368 < 2.2e-16 ***
## 3 553 57914 3 14670 46.694 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Have you ever wondered whether there is a clear correlation between the gas consumption of a car and its weight? To answer this question, we first have to load the dplyr and ggvis packages.
library(dplyr)
library(ggvis)
chainedmtcars %>%
group_by(factor(cyl)) %>%
ggvis(~mpg, ~wt, fill = ~cyl) %>%
layer_points()
The ggvis plot gives us a nice visualization of the mtcars data set: