Topics

In this sesssion we’ll be covering:

Basic Linear Regression
Diagnosing non-normality
Transforming Data

Linear regression

We use linear regression to figure out what the relationship is between certain parameters commonly measured (e.g. temperature, solar radiation, humidity, wind, etc) and chemicals of interest (e.g. ozone).

So we may want to know, what is the relationship between ozone and temperature for our Chicago data? We will use linear regression, namely the lm() function in R, to explore this relationship.

First, we should plot our variables of interest and inspect them visually to see if there appears to be a relationship. The plot(x,y) function is a tool we previously learned about that can quickly doing this.

library(region5air)
data(chicago_air)
chicago_air$ozone <- chicago_air$ozone*1000  ##Convert ozone to ppb for easier interpretation

plot(chicago_air$temp, chicago_air$ozone, 
     xlab = expression("Temperature (" * degree * "F)"), 
     ylab ='Ozone (ppb)')

From this plot we can see that there appears to be a linear relationship between these two variables. The lm()function is used to fit linear models and is used to perform regression analysis on simple (one explanatory variable) or complex (multi-variate) models. This function takes the form: lm(formula = response variable ~ explanatory variable). So we can fit a linear model to our data using this formula. The summary function, which we have seen before, will give us summary info about our model.

my.mod <- lm(chicago_air$ozone ~ chicago_air$temp)
summary(my.mod)

## 
## Call:
## lm(formula = chicago_air$ozone ~ chicago_air$temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.489  -6.561   0.509   6.795  32.868 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      21.70646    1.75110    12.4   <2e-16 ***
## chicago_air$temp  0.35710    0.03051    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.761 on 239 degrees of freedom
##   (124 observations deleted due to missingness)
## Multiple R-squared:  0.3643, Adjusted R-squared:  0.3617 
## F-statistic:   137 on 1 and 239 DF,  p-value: < 2.2e-16

We can now add the regression line from our model to our original scatterplot

plot(chicago_air$temp, chicago_air$ozone, xlab='Temp(deg F)', ylab='Ozone (ppb)')
abline(my.mod, col="red")

From the summary output above we can grab the intercept and slope information, as well as the standard errors and significance of the estimates. This information can be added to the plot as a text box.

Additionally, if you plot the linear model output (which we saved as the variable my.mod), you get four plots of interest. These include: - Residuals vs Fitted values - Residual Quantile-Quantile (Q-Q) plot - Standardized residuals vs Fitted values - Residuals vs. Leverage

plot(my.mod)

A few other plots which will help you visually inspect your model residuals are:

hist(my.mod$residuals)  #Histogram of model residuals

d <- density(my.mod$residuals)  #Save the kernel density plot information of model residuals
plot(d, main="Kernel Density of model 1 residuals")  #plot kernel density
polygon(d, col="red", border="blue")

You can also obtain the 95% Confidence Intervals around your model parameters and plot them.

This is the easy way using the stat_smooth options in ggplot2

library(ggplot2)
p <- ggplot(chicago_air, aes(x = temp, y = ozone)) + geom_point()
print(p)

p + stat_smooth(method = "lm", formula = y ~ x, size = 1)

Can spot check to see if your model parameters give the same approximation as ggplot.

temp <- 50
ozone = (.35710 * temp) + 21.70646
ozone

## [1] 39.56146

Diagnosing non-normality

data_exp <- rexp(10)
plot(data_exp)

hist(data_exp)

d <- density(data_exp, na.rm=T)
plot(d, main="Kernel Density plot of data")
polygon(d, col="red", border="blue")

qqnorm(data_exp)

There are some statistical tests we can use to diagnose non-normality of our data.

shapiro.test(data_exp) ##Shapiro-wilkes

## 
##  Shapiro-Wilk normality test
## 
## data:  data_exp
## W = 0.90639, p-value = 0.2571

# See the Kolomogorov-Smirnov Test: ks.test() and Anderson-Darling test: library(nortest), ad.test(), as well

Transforming Data

We can do a transform on our dataset and then use the same tests to see if we can visually detect (or through testing), whether or not the data is normal and therefore is valid for further assumptions.

log.data<- log(data_exp)
plot(log.data)

hist(log.data)

shapiro.test(log.data)

## 
##  Shapiro-Wilk normality test
## 
## data:  log.data
## W = 0.91829, p-value = 0.3429

d <- density(log.data)
plot(d, main="Kernel Density plot of data with natural log transformation")
polygon(d, col="red", border="blue")

qqnorm(log.data)

Now let’s try some exercises to test our understanding of using linear regression in R.

Regression and Data Transformation

Wednesday, February 3, 2016

Topics

Linear regression

Diagnosing non-normality

Transforming Data

Exercise 7

Next session:

Other sessions: