Back to cover material for the second setup, this time focusing on fitting nonlinear curves using least squares criterion. Beyond straight lines, I’ll consider parameters of a statistical model, the least-squares criterion, fitted values and residuals in a LM, their interpretation, and overall goals of the analysis. I’ll be going through a few data sets, one on utility bills vs temperatures, another on infant mortality, and one on a market model.
Simple linear regressions are easy, so let’s learn how to fit nonlinear models by least squares using a simple trick: adding powers (squared, cubed, etc.) of the original predictor variable (X). We’ll see that the new variables can be defined in terms of the old ones.
To get started, let’s load all the necessary packages and the utilities data set:
utilities <- read.csv("http://jgscott.github.io/teaching/data/utilities.csv")
As good practice, I’ll take a quick look at the summary of variables
summary(utilities)
## month day year temp
## Min. : 1.000 Min. :23.00 Min. :1999 Min. : 9.00
## 1st Qu.: 4.000 1st Qu.:26.00 1st Qu.:2002 1st Qu.:30.00
## Median : 7.000 Median :26.00 Median :2005 Median :51.00
## Mean : 6.564 Mean :26.59 Mean :2005 Mean :48.66
## 3rd Qu.: 9.000 3rd Qu.:28.00 3rd Qu.:2007 3rd Qu.:69.00
## Max. :12.000 Max. :36.00 Max. :2010 Max. :78.00
##
## kwh ccf thermsPerDay billingDays
## Min. : 160 Min. : 0.00 Min. :0.000 Min. :10.00
## 1st Qu.: 583 1st Qu.: 16.00 1st Qu.:0.500 1st Qu.:29.00
## Median : 790 Median : 60.00 Median :2.000 Median :30.00
## Mean : 751 Mean : 83.44 Mean :2.746 Mean :30.29
## 3rd Qu.: 893 3rd Qu.:144.00 3rd Qu.:4.700 3rd Qu.:32.00
## Max. :1213 Max. :242.00 Max. :8.100 Max. :36.00
##
## totalbill gasbill elecbill
## Min. : 31.55 Min. : 3.42 Min. : 17.43
## 1st Qu.:106.65 1st Qu.: 23.42 1st Qu.: 53.56
## Median :134.50 Median : 55.74 Median : 78.82
## Mean :157.74 Mean : 81.24 Mean : 76.67
## 3rd Qu.:192.67 3rd Qu.:124.18 3rd Qu.: 94.51
## Max. :332.09 Max. :240.90 Max. :152.20
##
## notes
## :102
## housesitters : 4
## empty house : 2
## 24.05 interim elec refund : 1
## 5.46 credit for cost of gas: 1
## 9.57 escrow refund : 1
## (Other) : 6
Okay, looks good. But notice billing days? We need to define a near variable because of the nature of this data set.
The goal of looking at this data set is to model the monthly gas bill in terms of temperature. So the lower the temp, the higher the bill. There’s a curve ball here, however. Different billing periods have different numbers of billing days. See?
hist(utilities$billingDays, breaks = 20)
We’ll probably want to be measuring gas usage per day, rather than over the ENTIRE billing period. To do so, we need to define a new variable, called daily.average.gasbill:
utilities$daily.average.gasbill <- utilities$gasbill/utilities$billingDays
Now that we’ve done that, we need to fit a linear model and add a line to the plot. Let’s see how it looks:
plot(daily.average.gasbill ~ temp, data = utilities)
lm1 <- lm(daily.average.gasbill ~ temp, data = utilities)
points(fitted(lm1) ~ temp, data = utilities, col = "red", pch = 19)
abline(lm1)
plot(resid(lm1) ~ temp, data = utilities)
Look again, because I’d say that this model doesn’t do a good job. That is, we’re fitting a linear model to obviously non-linear data (see the curvature at the end?). In the second plot with the residuals it’s a bit more obvious. There is still some systematic variation in the residuals as a function of the predictor value, temp.
One approach to address this shortcoming is to fit a parabola (remember from middle school?). That is, y versus x and x^2. To fit a model with a quadratic term:
lm2 <- lm(daily.average.gasbill ~ temp + I(temp^2), data = utilities)
plot(daily.average.gasbill ~ temp, data = utilities)
points(fitted(lm2) ~ temp, data = utilities, col = "blue", pch = 19)
Nice! In the above model statement, the I(temp^2) is the way we tell R to treat temperature-squared as an additional variable in the model. We could also add higher powers of temperature, although the quadratic fit looks sensible here. As a next step, we could draw a nice smooth curve by plugging in the coefficent of the model directly to the curve function:
plot(daily.average.gasbill ~ temp, data = utilities)
mybeta <- coef(lm2)
curve(mybeta[1] + mybeta[2]*x + mybeta[3]*x^2, col = "blue", add = TRUE)
Now on to the next data set that looks economic development by transforming variables.
In this walk through, I show how to fit a power law to data using linear least squares and log transformations. The log transform is useful for data bounded below 0 OR data that span many orders of magnitude. First let’s import the files and look at them:
infmort <- read.csv("http://jgscott.github.io/teaching/data/infmort.csv")
summary(infmort)
## country mortality gdp
## Afghanistan : 1 Min. : 2.00 Min. : 36
## Albania : 1 1st Qu.: 12.00 1st Qu.: 442
## Algeria : 1 Median : 30.00 Median : 1779
## American.Samoa: 1 Mean : 43.48 Mean : 6262
## Andorra : 1 3rd Qu.: 66.00 3rd Qu.: 7272
## Angola : 1 Max. :169.00 Max. :42416
## (Other) :201 NA's :6 NA's :10
Okay, perfect. As we see, the variables are infnant deaths per 1000 live births and GDP per capita in US dollars. We’ll start by plotting the data.
plot(mortality ~ gdp, data = infmort)
What do we observe? Well, for started, on the left, there’s some unusual bunching of the data. This is because GDP is a highly skewed variable that ranges from small economies and only a few large ones. This is fairly obvious in a histogram:
hist(infmort$gdp, breaks = 20)
Above, we observe a long right tail, notably the few countries with large GDPs and a larger frequency of countries with small GDP. This suggests that we should try using the logarithm of GDP, which will have the main effect of unbunching the data. I’ll plot the infant mortality versus log GDP and specify that I want the X variable transformed to a log-scale.
plot(mortality ~ log(gdp), data = infmort)
Okay, so the plot looks a little better, but notice how the points are bunched toward the bottom of the plot? This suggests that we might also try taking the log of the Y variable, too:
plot(log(mortality) ~ log(gdp), data = infmort)
See how a fit line would fit here?! Let’s go ahead and find the intercepts and add a line straight to the plot on the log scale:
lm1 <- lm(log(mortality) ~ log(gdp), data = infmort)
coef(lm1)
## (Intercept) log(gdp)
## 7.0452008 -0.4932026
plot(log(mortality) ~ log(gdp), data = infmort)
abline(lm1)
Supper we wanted to show the model on the original scale? We know that linear model on the log-log scale corresponds to a power on the original scale. Let’s exploit this to generate a plot of the fitted curve. First, let’s remember what the data looks like on the original scale.
plot(mortality ~ gdp, data = infmort)
Okay, so the above is the original. Remember the bunching in the left corner? Next, let’s get the fitted values on the log-log scale and then transform them back to the original scale. Because the exponential is the inverse of the log transform, we do this by exponentiating the fitted values. Here’s the math:
#Extract the coefficients and compute the fitted values "by hand"
mybeta <- coef(lm1)
logmort.pred <- mybeta[1] + mybeta[2]*log(infmort$gdp)
#Tranform the fitted values to the original scale
mort.pred <- exp(logmort.pred)
Now we can go ahead and add the predicted points to the plot in a different color and point style:
plot(mortality ~ gdp, data = infmort)
points(mort.pred ~ gdp, data = infmort, col = "blue", pch = 18)
Interestingly, if we type in “?points” to the console, we can see the options for point style (pch). And last but not least, we could also add the fitted curve directly to the scatter plot using the curve function. Based on what we know about power laws and log transformations, we do the following:
plot(mortality ~ gdp, data = infmort)
curve(exp(mybeta[1]) * x^(mybeta[2]), add = TRUE, col = "blue")
For this exercise, we look at the marketmodel.csv data set. Each row is a week of data on the stock market, and each column is an asset:
SPY: the S&P 500 AAPL: Apple GOOG: Google MRK: Merck JNJ: Johnson and Johnson WMT: Wal Mart TGT: Target
Let’s get started:
Market <- read.csv("https://jgscott.github.io/STA371H_Spring2018/data/marketmodel.csv")
summary(Market)
## X SPY AAPL
## 2007-01-08: 1 Min. :-0.197934 Min. :-0.243060
## 2007-01-15: 1 1st Qu.:-0.008800 1st Qu.:-0.020245
## 2007-01-22: 1 Median : 0.002589 Median : 0.007736
## 2007-01-29: 1 Mean : 0.001532 Mean : 0.005615
## 2007-02-05: 1 3rd Qu.: 0.014090 3rd Qu.: 0.030951
## 2007-02-12: 1 Max. : 0.132923 Max. : 0.141157
## (Other) :571
## GOOG MRK JNJ
## Min. :-0.153506 Min. :-0.1746382 Min. :-0.155834
## 1st Qu.:-0.018324 1st Qu.:-0.0154512 1st Qu.:-0.010080
## Median : 0.004251 Median : 0.0006289 Median : 0.002352
## Mean : 0.003507 Mean : 0.0011347 Mean : 0.001532
## 3rd Qu.: 0.024088 3rd Qu.: 0.0181433 3rd Qu.: 0.013959
## Max. : 0.269368 Max. : 0.1904134 Max. : 0.121755
##
## WMT TGT
## Min. :-0.146995 Min. :-0.1638304
## 1st Qu.:-0.011828 1st Qu.:-0.0171057
## Median : 0.001897 Median : 0.0009128
## Mean : 0.001728 Mean : 0.0012941
## 3rd Qu.: 0.016162 3rd Qu.: 0.0210000
## Max. : 0.096456 Max. : 0.2187121
##
The individual entries are the weekly returns: that is, the change in that stock’s price from the close of one Monday to the close of the next Monday. These are on a 0-to-1 scale, so that 0.1 is a 10% return, etc.
#1) Apple
lm1 <- lm(AAPL ~ SPY, data = Market)
coef(lm1)
## (Intercept) SPY
## 0.004110659 0.982373131
#2) Google
lm2 <- lm(GOOG ~ SPY, data = Market)
coef(lm2)
## (Intercept) SPY
## 0.001968038 1.004653848
#3) Merck
lm3 <- lm(MRK ~ SPY, data = Market)
coef(lm3)
## (Intercept) SPY
## -3.969396e-05 7.668062e-01
#4) Johnson and Johnson
lm4 <- lm(JNJ ~ SPY, data = Market)
coef(lm4)
## (Intercept) SPY
## 0.0007200754 0.5303379369
#5) Walmart
lm5 <- lm(WMT ~ SPY, data = Market)
coef(lm5)
## (Intercept) SPY
## 0.0009503926 0.5076431002
#6) Target
lm6 <- lm(TGT ~ SPY, data = Market)
coef(lm6)
## (Intercept) SPY
## -0.0002622751 1.0162332841
Using these data, we can ask ourselves a series of questions, such as:
Which stock seems to be the most tightly coupled to the movements of the wider market? (And what does “most tightly coupled” means?)
What do we notice about the intercepts? Are they mostly small, or mostly large? How would I interpret these intercepts in terms of whether any of the individual stocks appear to be outperforming the market on a systematic basis?
Does the above estimate of the slope for Walmart versus the S&P 500 agree (roughly) with the “beta” reported by Yahoo Finance? If you notice a discrepancy, what would be a possible explanation.
Consider the 6 models you fit in Part (A). Each model leads to a set of residuals for one particular stock regressed against the S&P 500. Which set of residuals has the largest correlation with the Walmart residuals – that is, the residuals from the model having Wal-Mart as the response variable? Why would this be?