Linear Regression

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Regression, much like t-tests and correlations, is all about relationships. What is the relationship between X and Y? Or between X, Y and Z?

For very simple data, this is easy enough to see. You can just plot it:

x<-(1:50)
y<-(100:149)
example_data<-data.frame(x=x,y=y)

ggplot(example_data,aes(x=x,y=y)) + geom_point()

We see here that as “X” goes up, “Y” goes up. In fact, they seem to go with each other exactly. Just for fun, we can add a regression model to this data just to make sure:

ggplot(example_data,aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm',color='blue')
## `geom_smooth()` using formula = 'y ~ x'

In this model, there are no errors: both X and Y match the expected model (the line) perfectly.

Remember last week when we did some work in regression? What would this perfect relationship look like, mathematically?

example_model<-lm(x~y,data=example_data)

summary(example_model)
## Warning in summary.lm(example_model): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = x ~ y, data = example_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.502e-14 -2.417e-15  2.218e-15  5.553e-15  9.230e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -9.900e+01  1.802e-14 -5.492e+15   <2e-16 ***
## y            1.000e+00  1.438e-16  6.953e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468e-14 on 48 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.835e+31 on 1 and 48 DF,  p-value: < 2.2e-16

#The first thing you’ll notice is that we get a warning: “essentially perfect fit: summary may be unreliable”. This is the computer telling you that the data has such a perfect relationship, that you should be deeply suspicious of it. And you should – we made it up, after all. In real life, data is never this perfectly related.

The second thing you might notice is that there is an adjusted R-squared of “1” – this means that the model (the line) explains 100% of the variance in the data. Or, to put it another way, the model perfectly explains the data. Again, in real life, this is unlikely-to-impossible.

#adjusted R squared explains what % of the variance can be predicted by the data

How about data that looks a bit more realistic:

ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + ggtitle("MPG vs. HP in MtCars")

This is real-world data. What do you notice? The relationship between mpg and hp is broadly negative: as HP goes up, MPG tends to go down. But it’s not a perfectly straight line, is it?

ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line")
## `geom_smooth()` using formula = 'y ~ x'

Notice that the regression line barely touches on only a few dots. This means that the model only perfectly predicts a few values. The rest of the values are sitting outside the predicted line. #for a dot to be above/below the blue line, there is another variable affecting it meeting the prediciton on the blue line.either it is over or underperforming #geom smooth will always insert a straight line; this is also how you will visualize linear models moving forward.

We can do this mathematically as well:

hp_mpg_model<-lm(hp~mpg,data=mtcars)
summary(hp_mpg_model)
## 
## Call:
## lm(formula = hp ~ mpg, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.26 -28.93 -13.45  25.65 143.36 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   324.08      27.43  11.813 8.25e-13 ***
## mpg            -8.83       1.31  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

The adjusted R-squared is only 59% (so this regression explains 59% of the data). It’s not bad, but its far from perfect. #mgp estimate of -8.83 means that as MPG increases by 1, HP decreases by almost 9

We can further visualize the prediction and error of the model by attaching it to the mtcars dataframe:

cars_test<-mtcars %>% mutate(predicted_hp_value=predict(hp_mpg_model)) %>% select(hp,mpg,predicted_hp_value)

# and now the error:

cars_test<-cars_test %>% mutate(predicted_error=hp-predicted_hp_value)

head(cars_test)
##                    hp  mpg predicted_hp_value predicted_error
## Mazda RX4         110 21.0           138.6580       -28.65796
## Mazda RX4 Wag     110 21.0           138.6580       -28.65796
## Datsun 710         93 22.8           122.7644       -29.76445
## Hornet 4 Drive    110 21.4           135.1261       -25.12607
## Hornet Sportabout 175 18.7           158.9663        16.03366
## Valiant           105 18.1           164.2642       -59.26418

If you wanted to view your residuals (the predicted error), you don’t even need to calculate them (like we did above). You can just call “residuals()” on the model that we created

cars_test$model_error<-residuals(hp_mpg_model)
head(cars_test)
##                    hp  mpg predicted_hp_value predicted_error model_error
## Mazda RX4         110 21.0           138.6580       -28.65796   -28.65796
## Mazda RX4 Wag     110 21.0           138.6580       -28.65796   -28.65796
## Datsun 710         93 22.8           122.7644       -29.76445   -29.76445
## Hornet 4 Drive    110 21.4           135.1261       -25.12607   -25.12607
## Hornet Sportabout 175 18.7           158.9663        16.03366    16.03366
## Valiant           105 18.1           164.2642       -59.26418   -59.26418

They’re the same!

The errors can now be visualized like so:

ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed') 
## `geom_smooth()` using formula = 'y ~ x'

You may be wondering: how does R “know” which line fits the data the best? Without getting too technical, R tries to calculate a bunch of different lines and then chooses the one that has the least amount of errors.

The “R squared” measurement we have been talking about can be derived manually. I do that here just to give you an idea of where the measurement is coming from:

# lets add the square of the model error, along with the mean horsepower for the whole dataset

cars_test<-cars_test %>% mutate(residual_sq=model_error*model_error)
cars_test<-cars_test %>% mutate(mean_hp=mean(hp))

# now we can get the total sum of squares by subtracting mean HP from total HP, squared:

total_sum_of_squares<-sum((cars_test$hp-cars_test$mean_hp)^2)

# now lets get the residual sum of squares (hp minus the predicted HP value, squared):

residual_sum_of_squares<-sum((cars_test$hp-cars_test$predicted_hp_value)^2)

# by dividing the residual by the total sum of squares, we get the r-squared: a measurement of how much variance the model "explains":

hp_r_squared<-1-(residual_sum_of_squares/total_sum_of_squares)
print(hp_r_squared)
## [1] 0.6024373

Of course, there is an easier way to get it:

summary(hp_mpg_model)$r.squared
## [1] 0.6024373

This is another way of saying “the regression model here explains 60% of the variance”, which can be said even more simply: “the regression model explains 60% of the data”.

So, practically speaking, 60% of horsepower can be “explained” by miles per gallon (when only factoring in these two variables).

MULTIPLE REGRESSION

The really useful thing about regression (as we discussed in the last lecture) is that we can start to fit models not just on one variable (hp vs. mpg, for example), but upon multiple variables (hp vs. mpg, wt, disp, and so forth).

Think about this: does mpg really explain hp? That is, does a car’s mpg really cause the horsepower to go up or down?

It makes more sense when you realize that mpg is probably a proxy for different variables: weight, or maybe engine power. More weight and more engine power tend to make mpg go down, while making HP go up. Engine power is probably the biggest contributor to HP, while weight and HP are only somewhat related.

But how can we get this from the data?

# lets add "wt" and "cyl" (number of cylinders) to the linear regression. They appear after the tilde (~).

cars_multiple<-lm(hp~mpg+wt+cyl,data=mtcars)

summary(cars_multiple)
## 
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.041 -21.622  -7.177  15.235 125.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  115.662    113.207   1.022  0.31567   
## mpg           -4.220      2.778  -1.519  0.14002   
## wt           -12.135     14.382  -0.844  0.40596   
## cyl           25.025      7.486   3.343  0.00237 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.6861 
## F-statistic: 23.58 on 3 and 28 DF,  p-value: 8.064e-08

As we expected earlier, mpg and wt have totally evaporated in significance (they have p-values far larger than 0.05). The only variable that is still significant is cyl (number of cylinders – basically, engine power). We also see that the r-squared has increased from around 60% to almost 70%.

Again, this is because “mpg” was actually a bad variable to predict HP. MPG doesn’t really explain HP – mpg was actually just a substitute for CYL – the variable that actually causes HP.

How do you know if a variable is causing something – or just a proxy for the actual cause? This is extremely complex and there are no great answers, however this question will ultimately be answered in your literature review. You have to convince the reader that “variables x, y and z” are actually causing “effect a” rather than just being associated with it, or rather than being a proxy for something else.

The whole kitchen sink

Here’s an idea: why not just throw the whole kitchen sink into the linear regression, and see what “sticks”? Isn’t this a good way to come up with cause and effect?

Short answer: no.

Longer answer: eh. maybe. sort of. it depends on who you ask.

Even longer answer:

kitchen_sink_model<-lm(hp~wt+cyl+disp+drat+qsec+vs+am+gear+carb+mpg,data=mtcars)

summary(kitchen_sink_model)
## 
## Call:
## lm(formula = hp ~ wt + cyl + disp + drat + qsec + vs + am + gear + 
##     carb + mpg, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.681 -15.558   0.799  18.106  34.718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  79.0484   184.5041   0.428  0.67270   
## wt          -27.6600    19.2704  -1.435  0.16591   
## cyl           8.2037    10.0861   0.813  0.42513   
## disp          0.4390     0.1492   2.942  0.00778 **
## drat         -4.6185    16.0829  -0.287  0.77680   
## qsec         -1.7844     7.3639  -0.242  0.81089   
## vs           25.8129    19.8512   1.300  0.20758   
## am            9.4863    20.7599   0.457  0.65240   
## gear          7.2164    14.6160   0.494  0.62662   
## carb         18.7487     7.0288   2.667  0.01441 * 
## mpg          -2.0631     2.0906  -0.987  0.33496   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.97 on 21 degrees of freedom
## Multiple R-squared:  0.9028, Adjusted R-squared:  0.8565 
## F-statistic:  19.5 on 10 and 21 DF,  p-value: 1.898e-08

On the surface, it appears that we’ve created a better model of HP: when we throw everything into the linear model, we discover that the only two significant variables are “disp” (engine displacement - size) and “carb” (number of carburetors in the engine). R-squared has increased to almost 90%. So what’s the problem?

In this specific case, engine size and number of carburetors are still just a proxy for engine power. If you think about it, increasing an engine’s size doesn’t automatically increase HP. Imagine if you made a huge engine 100 feet long but instead of extra cylinders, you just filled it up full of sand. You get the idea. Size is still a proxy for “something else” that is not being fully captured by the measurements.

Probably the best thing we can say about this model is that “disp” and “carb” are the best proxies for what HP actually is: the ability of the engine to move the weight of the car a certain distance.

tl;dr: if you try to “kitchen sink” everything, you might be fooled into thinking that the effect is being “caused” by something that is itself a proxy for something else.

The best thing to do is have a good theoretical understanding of which variables are most important and why. This typically comes out in your literature review.

A Note on Interpreting Coefficients

In addition to seeing which variables are most significant, multiple regression allows us to “estimate” the size of a significant effect upon a dependent variable. Our as our cars_multiple model shows:

summary(cars_multiple)
## 
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.041 -21.622  -7.177  15.235 125.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  115.662    113.207   1.022  0.31567   
## mpg           -4.220      2.778  -1.519  0.14002   
## wt           -12.135     14.382  -0.844  0.40596   
## cyl           25.025      7.486   3.343  0.00237 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.6861 
## F-statistic: 23.58 on 3 and 28 DF,  p-value: 8.064e-08

“CYL” has an “Estimate” (sometimes called a “Beta” or “Beta coefficient”) of 25.025. Because CYL is significant, this means that, all things being equal, if you add an extra cylinder to a car, it increases the HP by 25 points.

“WT” is not significant in this model, but lets pretend it was. WT has an Estimate of -12.1. This means that every time “WT” goes up by 1, “HP” goes down by 12.1.

In this case, “WT” is the weight in tons, so it makes sense that an extra ton of weight reduces HP.

An easy way to interpret the Estimate is by saying “when [independent variable] does up by”1”, the [dependent variable] goes [up/down] by [Estimate]“.

diamonds_price<-lm(price~carat+depth+table,data=diamonds)
summary(diamonds_price)
## 
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18288.0   -785.9    -33.2    527.2  12486.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13003.441    390.918   33.26   <2e-16 ***
## carat        7858.771     14.151  555.36   <2e-16 ***
## depth        -151.236      4.820  -31.38   <2e-16 ***
## table        -104.473      3.141  -33.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared:  0.8537, Adjusted R-squared:  0.8537 
## F-statistic: 1.049e+05 on 3 and 53936 DF,  p-value: < 2.2e-16

An increase of 1 carat (a measurement of weight), tends to make the price of a diamond go up by how much? ~$7858

An increase in the depth measurement by “1” tends to make the price of a diamond go down by how much? ~$151

Testing assumptions

Linear Regression (single and multiple) has some assumptions that must be met for the p-values and R squared to be accurate. They are:

  1. Linearity of variables; as one goes up/down the other goes up/down
  2. Independence of errors
  3. Homoscedasticity; errors in model have to be similar across the model
  4. Normality of residuals;
  5. No multicollinearity; independent variables are not too closely related to one another

We could spend a whole semester discussing these, and we will spend most of the next class discussing them, but for now here is the most important thing to know:

  1. Linearity of variables: the variables in the model should have a linear relationship with each other. This is easy to visualize with two variables (just do a dot-plot), but with multiple variables it’s a bit harder to visualize. The best way is calling “plot(x,which=1)” where “x” is the linear model.

Here is the hp_mpg_model:

plot(cars_multiple,which=1)

Here we see that the line of residuals versus fitted (the red line) is mostly straight. This suggests that the relationship between variables is mostly linear.

For an example of one that isn’t linear, we can look at the “kitchen sink” model we created earlier:

plot(kitchen_sink_model,which=1)

The residuals versus fitted is clearly curved (and much more curved than the previous model). This means that the assumption of linearity is probably violated. There is no strict mathematically definition – it’s more a matter of “eyeing” it. Consequently, because the “kitchen sink” model does not meet the linearity assumption, the (higher) r-squared values will be inaccurate.

ADVANCED TECHNIQUE

BONUS: there is probably a method which fits the data more correctly than a linear model:

hp_mpg_poly<-lm(hp~poly(mpg,degree = 2),data=mtcars)
cars_test$pred_poly<-predict(hp_mpg_poly,newdata=cars_test)


ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed') 
## `geom_smooth()` using formula = 'y ~ x'

ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='gam',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With polynomial regression and errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=pred_poly),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

This is called a polynomial model. It’s similar to a linear model but it accounts for curvature in the relationship between hp and mpg.You can tell that the error bars are a bit shorter in the polynomial model, which means the curve fits the dots a bit better than just a straight line.

There is also a mathematical way to determine how well a model fits the data:

AIC(hp_mpg_model,hp_mpg_poly)
##              df      AIC
## hp_mpg_model  3 336.8553
## hp_mpg_poly   4 335.3593

hp_mpg_poly is slightly lower than hp_mpg_model, which means that hp_mpg_poly is a slightly better model.

The downside to polynomial models is that they are more difficult to interpret than a linear model, so don’t worry too much about them right now.

HOMEWORK– due April 1st

  1. Load your chosen dataset into Rmarkdown
  2. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
  3. create a linear model using the “lm()” command, save it to some object
  4. call a “summary()” on your new model
  5. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
  6. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?
  7. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”