We need a little bit of algebra. But don’t get angry, we’ll use the computer. As we have done before, let us make our own “small” data set. Choose a variable \(y\) and two \(x\) variables. For instance
y<-c(7,3,9,1,8,2,1,5,4,3,9)
x1<-c(1,2,3,1,0,3,1,8,7,1,1)
x2<-c(0,1,1,0,2,1,-1,2,1,0,3)
Well, we want to estimate the coefficients for the model
\[ y_i=\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+u_i \]
before doing it with [R], let us do it “manually”. We can write the model using matrices:
\[ \underbrace{\left[\begin{array}{c} 7\\ 3\\ 9\\ 1\\ 8\\ 2\\ 1\\ 5\\ 4\\ 3\\ 9 \end{array}\right]}_{\mathbf{Y}}=\underbrace{\left[\begin{array}{ccc} 1 & 1 & 0\\ 1 & 2 & 1\\ 1 & 3 & 1\\ 1 & 1 & 0\\ 1 & 0 & 2\\ 1 & 3 & 1\\ 1 & 1 & -1\\ 1 & 8 & 2\\ 1 & 7 & 1\\ 1 & 1 & 0\\ 1 & 1 & 3 \end{array}\right]}_{\mathbf{X}}\underbrace{\left[\begin{array}{c} \beta_{0}\\ \beta_{1}\\ \beta_{2} \end{array}\right]}_{\mathbf{\beta}}+\underbrace{\left[\begin{array}{c} u_{1}\\ u_{2}\\ u_{3}\\ u_{4}\\ u_{5}\\ u_{6}\\ u_{7}\\ u_{8}\\ u_{9}\\ u_{10}\\ u_{11} \end{array}\right]}_{\mathbf{U}} \]
so, in “compact form” can be expressed as:
\[ \mathbf{Y}=\mathbf{X}\beta+\mathbf{U} \]
It can be proven that we get the estimation of the parameters by using the expression \[ \hat{\beta}=\left(\mathbf{X}'\mathbf{X}\right)^{-1}\mathbf{X}'\mathbf{Y} \]
Don’t worry, let’s use [R] to get it:
library(matlib)
y<-c(7,3,9,1,8,2,1,5,4,3,9)
x1<-c(1,2,3,1,0,3,1,8,7,1,1)
x2<-c(0,1,1,0,2,1,-1,2,1,0,3)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)
X<-matrix(cbind(ones,x1,x2),11,3)
Y<-matrix(cbind(y),11,1)
beta<-inv(t(X)%*%X)%*%t(X)%*%Y
beta
## [,1]
## [1,] 3.6794047
## [2,] -0.2743896
## [3,] 1.9209462
mod1<-lm(y~x1+x2)
summary(mod1)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7772 -1.2678 -0.3262 0.3995 4.2228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6794 1.1594 3.174 0.0131 *
## x1 -0.2744 0.3096 -0.886 0.4014
## x2 1.9209 0.7144 2.689 0.0275 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.483 on 8 degrees of freedom
## Multiple R-squared: 0.4763, Adjusted R-squared: 0.3453
## F-statistic: 3.637 on 2 and 8 DF, p-value: 0.07524
As you can see, we get the same! Later we’ll come again to use this tool to check some properties.
Qualitative factors often come in the form of binary information: a person is female or male; a person does or does not own a personal computer…In these examples, the relevant information can be captured by defining a binary variable or a zero-one variable. In Data Science, binary variables are most commonly called dummy variables, although this name is not especially descriptive. In defining a dummy variable, we must decide which event is assigned the value one and which is assigned the value zero. For example,
\[ \mathrm{Female}\begin{cases} 1 & \mathrm{the\:individual\:is\:female}\\ 0 & \mathrm{Otherwise} \end{cases} \]
In the same way, we can define
\[ \mathrm{male}\begin{cases} 1 & \mathrm{the\:individual\:is\:male}\\ 0 & \mathrm{Otherwise} \end{cases} \]
The name of the variable, in this case, indicates the event with the value one. The real benefit of capturing qualitative information using zero-one variables is that it leads to regression models where the parameters have very natural interpretations, as we will see now.
| Important |
| We can use several dummy independent variables in the same equation and, of course, dummy variables with multiple possible values. For instance, let’s say that we have a variable called “eye color”. You should transcript this variable into a set of dummy variables: |
| \[ \mathrm{blue}\begin{cases} 1 & \mathrm{the\:individual\:has\:blue\:eyes}\\ 0 & \mathrm{Otherwise} \end{cases} \] |
| \[ \mathrm{brown}\begin{cases} 1 & \mathrm{the\:individual\:has\:brown\:eyes}\\ 0 & \mathrm{Otherwise} \end{cases} \] |
| \[ \mathrm{grey}\begin{cases} 1 & \mathrm{the\:individual\:has\:grey\:eyes}\\ 0 & \mathrm{Otherwise} \end{cases} \] |
Try to fit a linear model with the following data. What happens? Once you fitted it with [R], try to use the linear algebra method we explained. What happens? Try to deliver an explanation
library(matlib)
y<-c(7,3,9,1,8,2,1,5,4,3,9)
male<- c(1,0,0,1,0,1,0,0,1,1,1)
female<-c(0,1,1,0,1,0,1,1,0,0,0)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)
X<-matrix(cbind(ones,male,female),11,3)
Y<-matrix(cbind(y),11,1)
beta<-inv(t(X)%*%X)%*%t(X)%*%Y
beta
As you can see, you get an error. It is not possible to make the inverse of the matrix product. The same thing you find it when you try to estimate the effect with [R]
y<-c(7,3,9,1,8,2,1,5,4,3,9)
male<- c(1,0,0,1,0,1,0,0,1,1,1)
female<-c(0,1,1,0,1,0,1,1,0,0,0)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)
mod2<-lm(y~male+female)
summary(mod2)
##
## Call:
## lm(formula = y ~ male + female)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2000 -2.2667 -0.3333 2.7333 4.6667
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2000 1.4309 3.634 0.00545 **
## male -0.8667 1.9374 -0.447 0.66521
## female NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.2 on 9 degrees of freedom
## Multiple R-squared: 0.02175, Adjusted R-squared: -0.08694
## F-statistic: 0.2001 on 1 and 9 DF, p-value: 0.6652
You get “NA” in female, for instance. What if we “eliminate” one of them?
library(matlib)
y<-c(7,3,9,1,8,2,1,5,4,3,9)
male<- c(1,0,0,1,0,1,0,0,1,1,1)
female<-c(0,1,1,0,1,0,1,1,0,0,0)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)
X<-matrix(cbind(ones,male),11,2)
Y<-matrix(cbind(y),11,1)
beta<-inv(t(X)%*%X)%*%t(X)%*%Y
beta
## [,1]
## [1,] 5.2000000
## [2,] -0.8666666
As you can see, we get rid of the problem. Why?
If you can build a variable (male/female) with other (female/male) you have, currently, only one variable.
Let’s see an example with real data using the data set called “Credit”. In the following regression, where we want to explain the balance in the current account, we use the variable Gender which is a binary variable that takes value 1 if the individual is a female and 0 if it is a male.
library(ISLR)
attach(Credit)
mod1<-lm(Balance~Gender)
summary(mod1)
##
## Call:
## lm(formula = Balance ~ Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## GenderFemale 19.73 46.05 0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
- How do you interpret the coefficient associated to “Gender”?
On average, a female has 19.73 $ more than a male in the current account.
- How do you interpret the coefficient associated to “intercept”?
The average money that a male has in the current account.
- Using a statistic test: Are there any statistically significative difference between female and males?
Since the p-value associated to the coefficient “Gender” is 0.669 we do not reject the null hypothesis that the coefficient is zero.
Let’s use the variable called Ethnicity. First of all, we can summarize the possible outcomes
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
summary(Ethnicity)
## African American Asian Caucasian
## 99 102 199
We have three possible values that can be codified as
\[ African\:American\begin{cases} 1 & if\:the\:person\:is\:African\:American\\ 0 & otherwise \end{cases} \]
\[ Asian\begin{cases} 1 & if\:the\:person\:is\:from\:Asia\\ 0 & otherwise \end{cases} \]
\[ Caucasian\begin{cases} 1 & if\:the\:person\:is\:Caucasian\\ 0 & otherwise \end{cases} \]
Again, we need to get rid of one of them. [R], by default, eliminates one of them.
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
##
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
mod2<-lm(Balance~Ethnicity)
summary(mod2)
##
## Call:
## lm(formula = Balance ~ Ethnicity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## EthnicityAsian -18.69 65.02 -0.287 0.774
## EthnicityCaucasian -12.50 56.68 -0.221 0.826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
How to interpret each coefficient?
- Intercept: 531 is the average balance in the current account for (the reference level): an African American
- An Asian has, on average, 18.69 $ less than an African American
- A Caucasian has, on average, 12.50$ less than an African American.
*If you want to change the reference level, you can procceed as follows:
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
mod3<-lm(Balance ~ relevel(Ethnicity , ref = "Caucasian"))
summary(mod3)
##
## Call:
## lm(formula = Balance ~ relevel(Ethnicity, ref = "Caucasian"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 518.497 32.670
## relevel(Ethnicity, ref = "Caucasian")African American 12.503 56.681
## relevel(Ethnicity, ref = "Caucasian")Asian -6.184 56.122
## t value Pr(>|t|)
## (Intercept) 15.871 <2e-16 ***
## relevel(Ethnicity, ref = "Caucasian")African American 0.221 0.826
## relevel(Ethnicity, ref = "Caucasian")Asian -0.110 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
The figures are different but, you can check, the results are the same.
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
mod3<-lm(Balance ~ log(Income)+Age+Gender+Ethnicity )
summary(mod3)
##
## Call:
## lm(formula = Balance ~ log(Income) + Age + Gender + Ethnicity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -820.95 -352.71 -51.85 336.52 1128.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -382.525 130.562 -2.930 0.00359 **
## log(Income) 280.632 30.897 9.083 < 2e-16 ***
## Age -1.840 1.241 -1.483 0.13900
## GenderFemale 18.392 42.098 0.437 0.66243
## EthnicityAsian -4.177 59.540 -0.070 0.94410
## EthnicityCaucasian -6.960 51.779 -0.134 0.89313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 420.6 on 394 degrees of freedom
## Multiple R-squared: 0.1737, Adjusted R-squared: 0.1632
## F-statistic: 16.57 on 5 and 394 DF, p-value: 7.461e-15
Exercise Explain the meaning of all the coefficients.
Exercise
Run the following model
\[ Balance_{i}=\beta_0+\beta_1 female_i+\beta_2 Asian+\beta_3 Caucasian+\beta_4 female_i\times Asian_i+\beta_5 female_i\times Caucasian_i+u_i \]
Interpret the meaning of each coefficient in the model
Answer
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 8):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
mod4<-lm(Balance ~ Gender+Ethnicity+Gender*Ethnicity )
summary(mod4)
##
## Call:
## lm(formula = Balance ~ Gender + Ethnicity + Gender * Ethnicity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -563.11 -445.84 -57.75 333.87 1483.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 553.45 65.95 8.392 8.65e-16 ***
## GenderFemale -44.45 92.80 -0.479 0.632
## EthnicityAsian -100.58 94.25 -1.067 0.287
## EthnicityCaucasian -38.11 80.91 -0.471 0.638
## GenderFemale:EthnicityAsian 154.69 130.46 1.186 0.236
## GenderFemale:EthnicityCaucasian 50.61 113.57 0.446 0.656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 461.6 on 394 degrees of freedom
## Multiple R-squared: 0.004472, Adjusted R-squared: -0.008161
## F-statistic: 0.354 on 5 and 394 DF, p-value: 0.8796
The interpretation is as follows:
- \(\beta_0\) is the “reference group”. We are omitting an African-American male. So, we can say that- on average- an African-American male has a balance of 553.45$.
- \(\beta_1\) is the difference in the balance of an African-American female with respect to the reference group (-44.45$)
- \(\beta_2\) is the difference in the balance of an Asian male with respect to the reference group (African-American male) -100$
- \(\beta_3\) is the difference in the balance of a caucasian male with respect to an African-American male (-38.11$)
- \(\beta_4\) is the difference in the balance of an Asian female with respect to an African-American male (154.69$)
- \(\beta_5\) is the difference in the balance of a caucasian female with respect to an African-American male (50.61$)
First of all, we need to define a measure that appears in the regression outputs: the R-squared. It means the percentage of the target variable that we can explain with the “predictor” variables we have chosen.
Could it be a measure of predictability? We’ll see today that we need to be cautious.
Exercise
Run the following models and explain the R-squared. What can you appreciate about the behaviour of the R-squared?
M1 \[ Balance_{i}=\beta_0+\beta_1 Rating_i+ \beta_2 Income_i+u_i \]
M2 \[ Balance_{i}=\beta_0+\beta_1 Rating_i+\beta_2 Income_i+ u_i \]
M3 \[ Balance_{i}=\beta_0+\beta_1 Rating_i+\beta_2 Income_i+\beta_3 Age_i+ u_i \]
M4 \[ Balance_{i}=Balance_{i}=\beta_0+\beta_1 Rating_i+\beta_2 Income_i+\beta_3 Age_i+\beta_4 Cards_i+\beta_5 Educ_i+\beta_6 Married_i+\beta_7 Ethnicity_i+\beta_8 Gender_i+\beta_9 Gender_i \times Ethnicity_i+\beta_{10} Gender_i\times Education_i+\beta_{11} Gender_i\times Rating_i+\beta_{12}Gender_i\times Income_i+ u_i \]
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 8):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 9):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
mod1<-lm(Balance ~ Rating+Income)
summary(mod1)
##
## Call:
## lm(formula = Balance ~ Rating + Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -278.57 -112.69 -36.21 57.92 575.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -534.81215 21.60270 -24.76 <2e-16 ***
## Rating 3.94926 0.08621 45.81 <2e-16 ***
## Income -7.67212 0.37846 -20.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162.9 on 397 degrees of freedom
## Multiple R-squared: 0.8751, Adjusted R-squared: 0.8745
## F-statistic: 1391 on 2 and 397 DF, p-value: < 2.2e-16
mod2<-lm(Balance ~ Rating + Income+Student )
summary(mod2)
##
## Call:
## lm(formula = Balance ~ Rating + Income + Student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -226.126 -80.445 -5.018 65.192 293.234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -581.07889 13.83463 -42.00 <2e-16 ***
## Rating 3.98747 0.05471 72.89 <2e-16 ***
## Income -7.87493 0.24021 -32.78 <2e-16 ***
## StudentYes 418.76028 17.23025 24.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.3 on 396 degrees of freedom
## Multiple R-squared: 0.9499, Adjusted R-squared: 0.9495
## F-statistic: 2502 on 3 and 396 DF, p-value: < 2.2e-16
mod3<-lm(Balance ~ Rating + Income + Student+Age)
summary(mod3)
##
## Call:
## lm(formula = Balance ~ Rating + Income + Student + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -217.606 -79.887 -8.163 62.680 292.009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -547.30470 21.46064 -25.503 <2e-16 ***
## Rating 3.98073 0.05458 72.927 <2e-16 ***
## Income -7.79773 0.24218 -32.198 <2e-16 ***
## StudentYes 417.50564 17.17164 24.314 <2e-16 ***
## Age -0.62418 0.30407 -2.053 0.0408 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.9 on 395 degrees of freedom
## Multiple R-squared: 0.9504, Adjusted R-squared: 0.9499
## F-statistic: 1892 on 4 and 395 DF, p-value: < 2.2e-16
mod4<-lm(Balance~Rating+Income+Student+Cards+Age+Education+Married+Ethnicity+Gender+Gender*Ethnicity+Gender*Education+Gender*Rating+Gender*Income)
summary(mod4)
##
## Call:
## lm(formula = Balance ~ Rating + Income + Student + Cards + Age +
## Education + Married + Ethnicity + Gender + Gender * Ethnicity +
## Gender * Education + Gender * Rating + Gender * Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -199.53 -79.26 -15.28 63.10 312.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -526.85016 46.78319 -11.262 <2e-16 ***
## Rating 3.89643 0.07925 49.168 <2e-16 ***
## Income -7.51981 0.35562 -21.145 <2e-16 ***
## StudentYes 417.51539 17.44407 23.935 <2e-16 ***
## Cards 3.72291 3.84268 0.969 0.3332
## Age -0.66694 0.31073 -2.146 0.0325 *
## Education -0.43612 2.42660 -0.180 0.8575
## MarriedYes -15.79724 10.90514 -1.449 0.1483
## EthnicityAsian 21.10327 21.28005 0.992 0.3220
## EthnicityCaucasian 7.60749 18.11693 0.420 0.6748
## GenderFemale -49.54959 57.16849 -0.867 0.3866
## EthnicityAsian:GenderFemale -0.03646 29.75393 -0.001 0.9990
## EthnicityCaucasian:GenderFemale 5.16678 25.50986 0.203 0.8396
## Education:GenderFemale 0.06164 3.35862 0.018 0.9854
## Rating:GenderFemale 0.16065 0.11010 1.459 0.1454
## Income:GenderFemale -0.47704 0.48604 -0.981 0.3270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.2 on 384 degrees of freedom
## Multiple R-squared: 0.9515, Adjusted R-squared: 0.9496
## F-statistic: 501.9 on 15 and 384 DF, p-value: < 2.2e-16
Which model predicts better?
- If we say “the model with the highest R-squared” we’ll say the third one. But note that the more variables has a model, the higher the R-squared
- Remember that a “healthy” way to test the predictive power of a model is cross validation
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 8):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 9):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 10):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(lattice)
train_control<- trainControl(method="cv", number=20,p=0.75, savePredictions = TRUE)
model1_cv<- train(Balance ~ Rating + Income, data=Credit, trControl=train_control, method = "lm" )
model1_cv
## Linear Regression
##
## 400 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 380, 380, 381, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 159.0672 0.8793192 121.2058
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model2_cv<- train(Balance ~ Rating + Income+ Student, data=Credit, trControl=train_control, method = "lm" )
model2_cv
## Linear Regression
##
## 400 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 102.9035 0.9510608 83.98769
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model3_cv<- train(Balance ~ Rating + Income + Student+Age, data=Credit, trControl=train_control, method = "lm" )
model3_cv
## Linear Regression
##
## 400 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 102.5357 0.9498921 83.56684
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model4_cv<- train(Balance~Rating+Income+Student+Cards+Age+Education+Married+Ethnicity+Gender+Gender*Ethnicity+Gender*Education+Gender*Rating+Gender*Income, data=Credit, trControl=train_control, method = "lm" )
model4_cv
## Linear Regression
##
## 400 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 381, 380, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 104.7672 0.9500272 84.84585
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Errors<-data.frame(model1_cv$results$RMSE,model2_cv$results$RMSE,model3_cv$results$RMSE,model4_cv$results$RMSE)
Errors
## model1_cv.results.RMSE model2_cv.results.RMSE model3_cv.results.RMSE
## 1 159.0672 102.9035 102.5357
## model4_cv.results.RMSE
## 1 104.7672
- As you can see, not always a model with more variables will predict better (and this is one random example, we’ll see more along this course)
- Check that the RMSE in the fitted models are always lower than in cross validation (Hint: the RMSE is called the “residual standard error” in the R output)
What is happening is also called : overfitting.