library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
Regression, much like t-tests and correlations, is all about relationships. What is the relationship between X and Y? Or between X, Y and Z?
For very simple data, this is easy enough to see. You can just plot it:
x<-(1:50)
y<-(100:149)
example_data<-data.frame(x=x,y=y)
ggplot(example_data,aes(x=x,y=y)) + geom_point()
We see here that as “X” goes up, “Y” goes up. In fact, they seem to go with each other exactly. Just for fun, we can add a regression model to this data just to make sure:
ggplot(example_data,aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm',color='blue')
## `geom_smooth()` using formula = 'y ~ x'
In this model, there are no errors: both X and Y match the expected model (the line) perfectly.
Remember last week when we did some work in regression? What would this perfect relationship look like, mathematically?
example_model<-lm(x~y,data=example_data)
summary(example_model)
## Warning in summary.lm(example_model): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y, data = example_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.502e-14 -2.417e-15 2.218e-15 5.553e-15 9.230e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.900e+01 1.802e-14 -5.492e+15 <2e-16 ***
## y 1.000e+00 1.438e-16 6.953e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.468e-14 on 48 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.835e+31 on 1 and 48 DF, p-value: < 2.2e-16
#The first thing you’ll notice is that we get a warning: “essentially perfect fit: summary may be unreliable”. This is the computer telling you that the data has such a perfect relationship, that you should be deeply suspicious of it. And you should – we made it up, after all. In real life, data is never this perfectly related.
The second thing you might notice is that there is an adjusted R-squared of “1” – this means that the model (the line) explains 100% of the variance in the data. Or, to put it another way, the model perfectly explains the data. Again, in real life, this is unlikely-to-impossible.
#adjusted R squared explains what % of the variance can be predicted by the data
How about data that looks a bit more realistic:
ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + ggtitle("MPG vs. HP in MtCars")
This is real-world data. What do you notice? The relationship between mpg and hp is broadly negative: as HP goes up, MPG tends to go down. But it’s not a perfectly straight line, is it?
ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line")
## `geom_smooth()` using formula = 'y ~ x'
Notice that the regression line barely touches on only a few dots. This means that the model only perfectly predicts a few values. The rest of the values are sitting outside the predicted line. #for a dot to be above/below the blue line, there is another variable affecting it meeting the prediciton on the blue line.either it is over or underperforming #geom smooth will always insert a straight line; this is also how you will visualize linear models moving forward.
We can do this mathematically as well:
hp_mpg_model<-lm(hp~mpg,data=mtcars)
summary(hp_mpg_model)
##
## Call:
## lm(formula = hp ~ mpg, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.26 -28.93 -13.45 25.65 143.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 324.08 27.43 11.813 8.25e-13 ***
## mpg -8.83 1.31 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
The adjusted R-squared is only 59% (so this regression explains 59% of the data). It’s not bad, but its far from perfect. #mgp estimate of -8.83 means that as MPG increases by 1, HP decreases by almost 9
We can further visualize the prediction and error of the model by attaching it to the mtcars dataframe:
cars_test<-mtcars %>% mutate(predicted_hp_value=predict(hp_mpg_model)) %>% select(hp,mpg,predicted_hp_value)
# and now the error:
cars_test<-cars_test %>% mutate(predicted_error=hp-predicted_hp_value)
head(cars_test)
## hp mpg predicted_hp_value predicted_error
## Mazda RX4 110 21.0 138.6580 -28.65796
## Mazda RX4 Wag 110 21.0 138.6580 -28.65796
## Datsun 710 93 22.8 122.7644 -29.76445
## Hornet 4 Drive 110 21.4 135.1261 -25.12607
## Hornet Sportabout 175 18.7 158.9663 16.03366
## Valiant 105 18.1 164.2642 -59.26418
If you wanted to view your residuals (the predicted error), you don’t even need to calculate them (like we did above). You can just call “residuals()” on the model that we created
cars_test$model_error<-residuals(hp_mpg_model)
head(cars_test)
## hp mpg predicted_hp_value predicted_error model_error
## Mazda RX4 110 21.0 138.6580 -28.65796 -28.65796
## Mazda RX4 Wag 110 21.0 138.6580 -28.65796 -28.65796
## Datsun 710 93 22.8 122.7644 -29.76445 -29.76445
## Hornet 4 Drive 110 21.4 135.1261 -25.12607 -25.12607
## Hornet Sportabout 175 18.7 158.9663 16.03366 16.03366
## Valiant 105 18.1 164.2642 -59.26418 -59.26418
They’re the same!
The errors can now be visualized like so:
ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ x'
You may be wondering: how does R “know” which line fits the data the best? Without getting too technical, R tries to calculate a bunch of different lines and then chooses the one that has the least amount of errors.
The “R squared” measurement we have been talking about can be derived manually. I do that here just to give you an idea of where the measurement is coming from:
# lets add the square of the model error, along with the mean horsepower for the whole dataset
cars_test<-cars_test %>% mutate(residual_sq=model_error*model_error)
cars_test<-cars_test %>% mutate(mean_hp=mean(hp))
# now we can get the total sum of squares by subtracting mean HP from total HP, squared:
total_sum_of_squares<-sum((cars_test$hp-cars_test$mean_hp)^2)
# now lets get the residual sum of squares (hp minus the predicted HP value, squared):
residual_sum_of_squares<-sum((cars_test$hp-cars_test$predicted_hp_value)^2)
# by dividing the residual by the total sum of squares, we get the r-squared: a measurement of how much variance the model "explains":
hp_r_squared<-1-(residual_sum_of_squares/total_sum_of_squares)
print(hp_r_squared)
## [1] 0.6024373
Of course, there is an easier way to get it:
summary(hp_mpg_model)$r.squared
## [1] 0.6024373
This is another way of saying “the regression model here explains 60% of the variance”, which can be said even more simply: “the regression model explains 60% of the data”.
So, practically speaking, 60% of horsepower can be “explained” by miles per gallon (when only factoring in these two variables).
The really useful thing about regression (as we discussed in the last lecture) is that we can start to fit models not just on one variable (hp vs. mpg, for example), but upon multiple variables (hp vs. mpg, wt, disp, and so forth).
Think about this: does mpg really explain hp? That is, does a car’s mpg really cause the horsepower to go up or down?
It makes more sense when you realize that mpg is probably a proxy for different variables: weight, or maybe engine power. More weight and more engine power tend to make mpg go down, while making HP go up. Engine power is probably the biggest contributor to HP, while weight and HP are only somewhat related.
But how can we get this from the data?
# lets add "wt" and "cyl" (number of cylinders) to the linear regression. They appear after the tilde (~).
cars_multiple<-lm(hp~mpg+wt+cyl,data=mtcars)
summary(cars_multiple)
##
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.041 -21.622 -7.177 15.235 125.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.662 113.207 1.022 0.31567
## mpg -4.220 2.778 -1.519 0.14002
## wt -12.135 14.382 -0.844 0.40596
## cyl 25.025 7.486 3.343 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6861
## F-statistic: 23.58 on 3 and 28 DF, p-value: 8.064e-08
As we expected earlier, mpg and wt have totally evaporated in significance (they have p-values far larger than 0.05). The only variable that is still significant is cyl (number of cylinders – basically, engine power). We also see that the r-squared has increased from around 60% to almost 70%.
Again, this is because “mpg” was actually a bad variable to predict HP. MPG doesn’t really explain HP – mpg was actually just a substitute for CYL – the variable that actually causes HP.
How do you know if a variable is causing something – or just a proxy for the actual cause? This is extremely complex and there are no great answers, however this question will ultimately be answered in your literature review. You have to convince the reader that “variables x, y and z” are actually causing “effect a” rather than just being associated with it, or rather than being a proxy for something else.
Here’s an idea: why not just throw the whole kitchen sink into the linear regression, and see what “sticks”? Isn’t this a good way to come up with cause and effect?
Short answer: no.
Longer answer: eh. maybe. sort of. it depends on who you ask.
Even longer answer:
kitchen_sink_model<-lm(hp~wt+cyl+disp+drat+qsec+vs+am+gear+carb+mpg,data=mtcars)
summary(kitchen_sink_model)
##
## Call:
## lm(formula = hp ~ wt + cyl + disp + drat + qsec + vs + am + gear +
## carb + mpg, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.681 -15.558 0.799 18.106 34.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0484 184.5041 0.428 0.67270
## wt -27.6600 19.2704 -1.435 0.16591
## cyl 8.2037 10.0861 0.813 0.42513
## disp 0.4390 0.1492 2.942 0.00778 **
## drat -4.6185 16.0829 -0.287 0.77680
## qsec -1.7844 7.3639 -0.242 0.81089
## vs 25.8129 19.8512 1.300 0.20758
## am 9.4863 20.7599 0.457 0.65240
## gear 7.2164 14.6160 0.494 0.62662
## carb 18.7487 7.0288 2.667 0.01441 *
## mpg -2.0631 2.0906 -0.987 0.33496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.97 on 21 degrees of freedom
## Multiple R-squared: 0.9028, Adjusted R-squared: 0.8565
## F-statistic: 19.5 on 10 and 21 DF, p-value: 1.898e-08
On the surface, it appears that we’ve created a better model of HP: when we throw everything into the linear model, we discover that the only two significant variables are “disp” (engine displacement - size) and “carb” (number of carburetors in the engine). R-squared has increased to almost 90%. So what’s the problem?
In this specific case, engine size and number of carburetors are still just a proxy for engine power. If you think about it, increasing an engine’s size doesn’t automatically increase HP. Imagine if you made a huge engine 100 feet long but instead of extra cylinders, you just filled it up full of sand. You get the idea. Size is still a proxy for “something else” that is not being fully captured by the measurements.
Probably the best thing we can say about this model is that “disp” and “carb” are the best proxies for what HP actually is: the ability of the engine to move the weight of the car a certain distance.
tl;dr: if you try to “kitchen sink” everything, you might be fooled into thinking that the effect is being “caused” by something that is itself a proxy for something else.
The best thing to do is have a good theoretical understanding of which variables are most important and why. This typically comes out in your literature review.
In addition to seeing which variables are most significant, multiple regression allows us to “estimate” the size of a significant effect upon a dependent variable. Our as our cars_multiple model shows:
summary(cars_multiple)
##
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.041 -21.622 -7.177 15.235 125.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.662 113.207 1.022 0.31567
## mpg -4.220 2.778 -1.519 0.14002
## wt -12.135 14.382 -0.844 0.40596
## cyl 25.025 7.486 3.343 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6861
## F-statistic: 23.58 on 3 and 28 DF, p-value: 8.064e-08
“CYL” has an “Estimate” (sometimes called a “Beta” or “Beta coefficient”) of 25.025. Because CYL is significant, this means that, all things being equal, if you add an extra cylinder to a car, it increases the HP by 25 points.
“WT” is not significant in this model, but lets pretend it was. WT has an Estimate of -12.1. This means that every time “WT” goes up by 1, “HP” goes down by 12.1.
In this case, “WT” is the weight in tons, so it makes sense that an extra ton of weight reduces HP.
An easy way to interpret the Estimate is by saying “when [independent variable] does up by”1”, the [dependent variable] goes [up/down] by [Estimate]“.
diamonds_price<-lm(price~carat+depth+table,data=diamonds)
summary(diamonds_price)
##
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18288.0 -785.9 -33.2 527.2 12486.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13003.441 390.918 33.26 <2e-16 ***
## carat 7858.771 14.151 555.36 <2e-16 ***
## depth -151.236 4.820 -31.38 <2e-16 ***
## table -104.473 3.141 -33.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.8537
## F-statistic: 1.049e+05 on 3 and 53936 DF, p-value: < 2.2e-16
An increase of 1 carat (a measurement of weight), tends to make the price of a diamond go up by how much? ~$7858
An increase in the depth measurement by “1” tends to make the price of a diamond go down by how much? ~$151
Linear Regression (single and multiple) has some assumptions that must be met for the p-values and R squared to be accurate. They are:
We could spend a whole semester discussing these, and we will spend most of the next class discussing them, but for now here is the most important thing to know:
Here is the hp_mpg_model:
plot(cars_multiple,which=1)
Here we see that the line of residuals versus fitted (the red line) is mostly straight. This suggests that the relationship between variables is mostly linear.
For an example of one that isn’t linear, we can look at the “kitchen sink” model we created earlier:
plot(kitchen_sink_model,which=1)
The residuals versus fitted is clearly curved (and much more curved than the previous model). This means that the assumption of linearity is probably violated. There is no strict mathematically definition – it’s more a matter of “eyeing” it. Consequently, because the “kitchen sink” model does not meet the linearity assumption, the (higher) r-squared values will be inaccurate.
BONUS: there is probably a method which fits the data more correctly than a linear model:
hp_mpg_poly<-lm(hp~poly(mpg,degree = 2),data=mtcars)
cars_test$pred_poly<-predict(hp_mpg_poly,newdata=cars_test)
ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ x'
ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='gam',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With polynomial regression and errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=pred_poly),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
This is called a polynomial model. It’s similar to a linear model but it accounts for curvature in the relationship between hp and mpg.You can tell that the error bars are a bit shorter in the polynomial model, which means the curve fits the dots a bit better than just a straight line.
There is also a mathematical way to determine how well a model fits the data:
AIC(hp_mpg_model,hp_mpg_poly)
## df AIC
## hp_mpg_model 3 336.8553
## hp_mpg_poly 4 335.3593
hp_mpg_poly is slightly lower than hp_mpg_model, which means that hp_mpg_poly is a slightly better model.
The downside to polynomial models is that they are more difficult to interpret than a linear model, so don’t worry too much about them right now.