Data

The atmos data set resides in the nasaweather package of the R programming language. It contains a collection of atmospheric variables measured between 1995 and 2000 on a grid of 576 coordinates in the western hemisphere. The data set comes from the 2006 ASA Data Expo.

Some of the variables in the atmos data set are:

You can convert the temperature unit from Kelvin to Celsius with the formula

\[ celsius = kelvin - 273.15 \]

And you can convert the result to Fahrenheit with the formula

\[ fahrenheit = celsius \times \frac{5}{9} + 32 \]

Cleaning

For the remainder of the report, we will look only at data from the year 1995. We aggregate our data by location, using the R code below.

load(url("http://assets.datacamp.com/course/rmarkdown/atmos.RData")) # working with a subset
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
library(ggvis)
## Warning: package 'ggvis' was built under R version 3.2.5
year <- 1995
means <- atmos %>% 
  filter(year == year) %>%
  group_by(long, lat) %>%
  summarize(temp = mean(temp, na.rm = TRUE), 
         pressure = mean(pressure, na.rm = TRUE),
         ozone = mean(ozone, na.rm = TRUE),
         cloudlow = mean(cloudlow, na.rm = TRUE),
         cloudmid = mean(cloudmid, na.rm = TRUE),
         cloudhigh = mean(cloudhigh, na.rm = TRUE)) %>%
  ungroup()

Ozone and temperature

Is the relationship between ozone and temperature useful for understanding fluctuations in ozone? A scatterplot of the variables shows a strong, but unusual relationship.

means %>% 
  ggvis(~temp, ~ozone) %>%
  layer_points() 

We suspect that group level effects are caused by environmental conditions that vary by locale. To test this idea, we sort each data point into one of four geographic regions:

means$locale <- "north america"  
means$locale[means$lat < 10] <- "south pacific"
means$locale[means$long > -80 & means$lat < 10] <- "south america"
means$locale[means$long > -80 & means$lat > 10] <- "north atlantic"

Model

We suggest that ozone is highly correlated with temperature, but that a different relationship exists for each geographic region. We capture this relationship with a second order linear model of the form

\[ ozone = \alpha + \beta_{1} temperature + \sum_{locales} \beta_{i} locale_{i} + \sum_{locales} \beta_{j} interaction_{j} + \epsilon\]

This yields the following coefficients and model lines.

lm(ozone ~ temp + locale + temp:locale, data = means)
## 
## Call:
## lm(formula = ozone ~ temp + locale + temp:locale, data = means)
## 
## Coefficients:
##               (Intercept)                       temp  
##                  917.7961                    -2.1519  
##      localenorth atlantic        localesouth america  
##                  495.7904                  -633.7339  
##       localesouth pacific  temp:localenorth atlantic  
##                  -52.2260                    -1.6498  
##  temp:localesouth america   temp:localesouth pacific  
##                    2.0587                     0.1126
means %>% 
  group_by(locale) %>%
  ggvis(~temp, ~ozone) %>%
  layer_points(fill = ~locale) %>%
  layer_model_predictions(model = "lm", stroke = ~locale) %>%
  hide_legend("stroke") %>%
  scale_nominal("stroke", range = c("darkorange", "darkred", "darkgreen", "darkblue"))
## Guessing formula = ozone ~ temp

Diagnostics

An anova test suggests that both locale and the interaction effect of locale and temperature are useful for predicting ozone (i.e., the p-value that compares the full model to the reduced models is statistically significant).

mod <- lm(ozone ~ temp, data = means)
mod2 <- lm(ozone ~ temp + locale, data = means)
mod3 <- lm(ozone ~ temp + locale + temp:locale, data = means)

anova(mod, mod2, mod3)
## Analysis of Variance Table
## 
## Model 1: ozone ~ temp
## Model 2: ozone ~ temp + locale
## Model 3: ozone ~ temp + locale + temp:locale
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1    559 132709                                   
## 2    556  72584  3     60124 191.368 < 2.2e-16 ***
## 3    553  57914  3     14670  46.694 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exploring the mtcars data set

Have you ever wondered whether there is a clear correlation between the gas consumption of a car and its weight? To answer this question, we first have to load the dplyr and ggvis packages.

library(dplyr)
library(ggvis)
mtcars %>% 
  group_by(factor(cyl)) %>%
  ggvis(~mpg, ~wt, fill = ~cyl) %>% 
  layer_points()

The ggvis plot gives us a nice visualization of the mtcars data set: