Previously you learned how to do a simple linear regression
Call:
lm(formula = Feature_1 ~ Agreeableness, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.52482 -0.24147 -0.00173 0.24892 0.51633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47397 0.01254 37.802 < 2e-16 ***
Agreeableness 0.06119 0.02196 2.787 0.00538 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2868 on 1998 degrees of freedom
Multiple R-squared: 0.003871, Adjusted R-squared: 0.003373
F-statistic: 7.765 on 1 and 1998 DF, p-value: 0.005378
What if you wanted to see how your outcome variable relates to more than one predictor?
That is what we will answering in today’s lab section
Data
Go to the canvas page and download the handwriting.csv file from the website. This dataset was obtained online from Kaggle (link here), and measures five personality traits and fifteen handwriting features.
This dataset is quite a bit larger than what we use previously (n=2000). However, it should still not take too long to run analyses on it.
For this class, I will be exploring the relationship between the personality traits and Feature_1. However, feel free to choose whatever traits you want!
In addition, also copy over the body_image_data.csv from either a previous assignment or the canvas page. We will also be using the iris dataset from R, but there is no need to download any additional files as it is preinstalled in R.
Formula
In R, adding predictors to a model is pretty easy. In a simple regression model, the formula looks like this:
y ~ x1
If you wanted to add another predictor (let’s say x2) simply use the + symbol.
y ~ x1 + x2
This can work as many predictors as you woud like
y ~ x1 + x2 + x3 + x4 ...
Let’s see this in action in the next slide.
Model and Output (two predictors)
Let’s say that I’m interested in seeing the effect of agreeableness and neuroticism on feature 1.
Call:
lm(formula = Feature_1 ~ Openness + Conscientiousness + Extraversion +
Agreeableness + Neuroticism, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.52692 -0.23912 -0.00349 0.24649 0.52444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.52111 0.02547 20.463 <2e-16 ***
Openness -0.02980 0.02231 -1.336 0.1818
Conscientiousness -0.02057 0.02198 -0.936 0.3493
Extraversion -0.02339 0.02189 -1.069 0.2853
Agreeableness 0.06010 0.02198 2.735 0.0063 **
Neuroticism -0.01933 0.02192 -0.882 0.3780
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2868 on 1994 degrees of freedom
Multiple R-squared: 0.006178, Adjusted R-squared: 0.003686
F-statistic: 2.479 on 5 and 1994 DF, p-value: 0.03009
Compared to the original model:
Did the overall p-value go up or down?
What about the Multiple R-squared?
Which one should you use?*
summary(model)
Call:
lm(formula = Feature_1 ~ Agreeableness, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.52482 -0.24147 -0.00173 0.24892 0.51633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47397 0.01254 37.802 < 2e-16 ***
Agreeableness 0.06119 0.02196 2.787 0.00538 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2868 on 1998 degrees of freedom
Multiple R-squared: 0.003871, Adjusted R-squared: 0.003373
F-statistic: 7.765 on 1 and 1998 DF, p-value: 0.005378
Correlation Matrix
Review
When doing MLR it’s especially helpful to analyze the correlations between variables to make sure that our predictors are actually measuring different constructs. Previously you used cor() to accomplish this.
bi_data <-read.csv("body_image_data.csv")bi_data1 <- bi_data[, 4:18] # this extracts only the 'aa' columnscor(bi_data1)
== comapres the values on either side to see if they are equal. This allows us to check every row and obtain only those rows in which iris$Species is equal to "setosa".