Week 5: Class 07/03/2023


Question: How the computer gets the coefficients?

We need a little bit of algebra. But don’t get angry, we’ll use the computer. As we have done before, let us make our own “small” data set. Choose a variable \(y\) and two \(x\) variables. For instance

y<-c(7,3,9,1,8,2,1,5,4,3,9)
x1<-c(1,2,3,1,0,3,1,8,7,1,1)
x2<-c(0,1,1,0,2,1,-1,2,1,0,3)

Well, we want to estimate the coefficients for the model

\[ y_i=\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+u_i \]

before doing it with [R], let us do it “manually”. We can write the model using matrices:

\[ \underbrace{\left[\begin{array}{c} 7\\ 3\\ 9\\ 1\\ 8\\ 2\\ 1\\ 5\\ 4\\ 3\\ 9 \end{array}\right]}_{\mathbf{Y}}=\underbrace{\left[\begin{array}{ccc} 1 & 1 & 0\\ 1 & 2 & 1\\ 1 & 3 & 1\\ 1 & 1 & 0\\ 1 & 0 & 2\\ 1 & 3 & 1\\ 1 & 1 & -1\\ 1 & 8 & 2\\ 1 & 7 & 1\\ 1 & 1 & 0\\ 1 & 1 & 3 \end{array}\right]}_{\mathbf{X}}\underbrace{\left[\begin{array}{c} \beta_{0}\\ \beta_{1}\\ \beta_{2} \end{array}\right]}_{\mathbf{\beta}}+\underbrace{\left[\begin{array}{c} u_{1}\\ u_{2}\\ u_{3}\\ u_{4}\\ u_{5}\\ u_{6}\\ u_{7}\\ u_{8}\\ u_{9}\\ u_{10}\\ u_{11} \end{array}\right]}_{\mathbf{U}} \]

so, in “compact form” can be expressed as:

\[ \mathbf{Y}=\mathbf{X}\beta+\mathbf{U} \]

It can be proven that we get the estimation of the parameters by using the expression \[ \hat{\beta}=\left(\mathbf{X}'\mathbf{X}\right)^{-1}\mathbf{X}'\mathbf{Y} \]

Don’t worry, let’s use [R] to get it:

library(matlib)

y<-c(7,3,9,1,8,2,1,5,4,3,9)
x1<-c(1,2,3,1,0,3,1,8,7,1,1)
x2<-c(0,1,1,0,2,1,-1,2,1,0,3)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)
X<-matrix(cbind(ones,x1,x2),11,3)
Y<-matrix(cbind(y),11,1)

beta<-inv(t(X)%*%X)%*%t(X)%*%Y
beta
##            [,1]
## [1,]  3.6794047
## [2,] -0.2743896
## [3,]  1.9209462
mod1<-lm(y~x1+x2)
summary(mod1)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7772 -1.2678 -0.3262  0.3995  4.2228 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.6794     1.1594   3.174   0.0131 *
## x1           -0.2744     0.3096  -0.886   0.4014  
## x2            1.9209     0.7144   2.689   0.0275 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.483 on 8 degrees of freedom
## Multiple R-squared:  0.4763, Adjusted R-squared:  0.3453 
## F-statistic: 3.637 on 2 and 8 DF,  p-value: 0.07524

As you can see, we get the same! Later we’ll come again to use this tool to check some properties.

Regression with qualitative variables

Qualitative factors often come in the form of binary information: a person is female or male; a person does or does not own a personal computer…In these examples, the relevant information can be captured by defining a binary variable or a zero-one variable. In Data Science, binary variables are most commonly called dummy variables, although this name is not especially descriptive. In defining a dummy variable, we must decide which event is assigned the value one and which is assigned the value zero. For example,

\[ \mathrm{Female}\begin{cases} 1 & \mathrm{the\:individual\:is\:female}\\ 0 & \mathrm{Otherwise} \end{cases} \]

In the same way, we can define

\[ \mathrm{male}\begin{cases} 1 & \mathrm{the\:individual\:is\:male}\\ 0 & \mathrm{Otherwise} \end{cases} \]

The name of the variable, in this case, indicates the event with the value one. The real benefit of capturing qualitative information using zero-one variables is that it leads to regression models where the parameters have very natural interpretations, as we will see now.

Important
We can use several dummy independent variables in the same equation and, of course, dummy variables with multiple possible values. For instance, let’s say that we have a variable called “eye color”. You should transcript this variable into a set of dummy variables:
\[ \mathrm{blue}\begin{cases} 1 & \mathrm{the\:individual\:has\:blue\:eyes}\\ 0 & \mathrm{Otherwise} \end{cases} \]
\[ \mathrm{brown}\begin{cases} 1 & \mathrm{the\:individual\:has\:brown\:eyes}\\ 0 & \mathrm{Otherwise} \end{cases} \]
\[ \mathrm{grey}\begin{cases} 1 & \mathrm{the\:individual\:has\:grey\:eyes}\\ 0 & \mathrm{Otherwise} \end{cases} \]

Exercise

Try to fit a linear model with the following data. What happens? Once you fitted it with [R], try to use the linear algebra method we explained. What happens? Try to deliver an explanation

library(matlib)

y<-c(7,3,9,1,8,2,1,5,4,3,9)
male<- c(1,0,0,1,0,1,0,0,1,1,1)
female<-c(0,1,1,0,1,0,1,1,0,0,0)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)


X<-matrix(cbind(ones,male,female),11,3)
Y<-matrix(cbind(y),11,1)

beta<-inv(t(X)%*%X)%*%t(X)%*%Y
beta

As you can see, you get an error. It is not possible to make the inverse of the matrix product. The same thing you find it when you try to estimate the effect with [R]

y<-c(7,3,9,1,8,2,1,5,4,3,9)
male<- c(1,0,0,1,0,1,0,0,1,1,1)
female<-c(0,1,1,0,1,0,1,1,0,0,0)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)


mod2<-lm(y~male+female)
summary(mod2)
## 
## Call:
## lm(formula = y ~ male + female)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2000 -2.2667 -0.3333  2.7333  4.6667 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   5.2000     1.4309   3.634  0.00545 **
## male         -0.8667     1.9374  -0.447  0.66521   
## female            NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.2 on 9 degrees of freedom
## Multiple R-squared:  0.02175,    Adjusted R-squared:  -0.08694 
## F-statistic: 0.2001 on 1 and 9 DF,  p-value: 0.6652

You get “NA” in female, for instance. What if we “eliminate” one of them?

library(matlib)

y<-c(7,3,9,1,8,2,1,5,4,3,9)
male<- c(1,0,0,1,0,1,0,0,1,1,1)
female<-c(0,1,1,0,1,0,1,1,0,0,0)
ones<-c(1,1,1,1,1,1,1,1,1,1,1)


X<-matrix(cbind(ones,male),11,2)
Y<-matrix(cbind(y),11,1)

beta<-inv(t(X)%*%X)%*%t(X)%*%Y
beta
##            [,1]
## [1,]  5.2000000
## [2,] -0.8666666

As you can see, we get rid of the problem. Why?


If you can build a variable (male/female) with other (female/male) you have, currently, only one variable.


Let’s see an example with real data using the data set called “Credit”. In the following regression, where we want to explain the balance in the current account, we use the variable Gender which is a binary variable that takes value 1 if the individual is a female and 0 if it is a male.

library(ISLR)
attach(Credit)

mod1<-lm(Balance~Gender)
summary(mod1)
## 
## Call:
## lm(formula = Balance ~ Gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -529.54 -455.35  -60.17  334.71 1489.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    509.80      33.13  15.389   <2e-16 ***
## GenderFemale    19.73      46.05   0.429    0.669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared:  0.0004611,  Adjusted R-squared:  -0.00205 
## F-statistic: 0.1836 on 1 and 398 DF,  p-value: 0.6685
  • How do you interpret the coefficient associated to “Gender”?

On average, a female has 19.73 $ more than a male in the current account.

  • How do you interpret the coefficient associated to “intercept”?

The average money that a male has in the current account.

  • Using a statistic test: Are there any statistically significative difference between female and males?

Since the p-value associated to the coefficient “Gender” is 0.669 we do not reject the null hypothesis that the coefficient is zero.

What happens if we have more than a category?

Let’s use the variable called Ethnicity. First of all, we can summarize the possible outcomes

attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
summary(Ethnicity)
## African American            Asian        Caucasian 
##               99              102              199

We have three possible values that can be codified as

\[ African\:American\begin{cases} 1 & if\:the\:person\:is\:African\:American\\ 0 & otherwise \end{cases} \]

\[ Asian\begin{cases} 1 & if\:the\:person\:is\:from\:Asia\\ 0 & otherwise \end{cases} \]

\[ Caucasian\begin{cases} 1 & if\:the\:person\:is\:Caucasian\\ 0 & otherwise \end{cases} \]

Again, we need to get rid of one of them. [R], by default, eliminates one of them.

attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## 
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
mod2<-lm(Balance~Ethnicity)
summary(mod2)
## 
## Call:
## lm(formula = Balance ~ Ethnicity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.00 -457.08  -63.25  339.25 1480.50 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          531.00      46.32  11.464   <2e-16 ***
## EthnicityAsian       -18.69      65.02  -0.287    0.774    
## EthnicityCaucasian   -12.50      56.68  -0.221    0.826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared:  0.0002188,  Adjusted R-squared:  -0.004818 
## F-statistic: 0.04344 on 2 and 397 DF,  p-value: 0.9575

How to interpret each coefficient?

  • Intercept: 531 is the average balance in the current account for (the reference level): an African American
  • An Asian has, on average, 18.69 $ less than an African American
  • A Caucasian has, on average, 12.50$ less than an African American.

*If you want to change the reference level, you can procceed as follows:

attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
mod3<-lm(Balance ~  relevel(Ethnicity , ref = "Caucasian")) 
summary(mod3)
## 
## Call:
## lm(formula = Balance ~ relevel(Ethnicity, ref = "Caucasian"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.00 -457.08  -63.25  339.25 1480.50 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                            518.497     32.670
## relevel(Ethnicity, ref = "Caucasian")African American   12.503     56.681
## relevel(Ethnicity, ref = "Caucasian")Asian              -6.184     56.122
##                                                       t value Pr(>|t|)    
## (Intercept)                                            15.871   <2e-16 ***
## relevel(Ethnicity, ref = "Caucasian")African American   0.221    0.826    
## relevel(Ethnicity, ref = "Caucasian")Asian             -0.110    0.912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared:  0.0002188,  Adjusted R-squared:  -0.004818 
## F-statistic: 0.04344 on 2 and 397 DF,  p-value: 0.9575

The figures are different but, you can check, the results are the same.

Finally, you can combine quantitative and qualitative variables as predictors in a linear regression

attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
mod3<-lm(Balance ~ log(Income)+Age+Gender+Ethnicity ) 
summary(mod3)
## 
## Call:
## lm(formula = Balance ~ log(Income) + Age + Gender + Ethnicity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -820.95 -352.71  -51.85  336.52 1128.45 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -382.525    130.562  -2.930  0.00359 ** 
## log(Income)         280.632     30.897   9.083  < 2e-16 ***
## Age                  -1.840      1.241  -1.483  0.13900    
## GenderFemale         18.392     42.098   0.437  0.66243    
## EthnicityAsian       -4.177     59.540  -0.070  0.94410    
## EthnicityCaucasian   -6.960     51.779  -0.134  0.89313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 420.6 on 394 degrees of freedom
## Multiple R-squared:  0.1737, Adjusted R-squared:  0.1632 
## F-statistic: 16.57 on 5 and 394 DF,  p-value: 7.461e-15

Exercise Explain the meaning of all the coefficients.

Week 5: Class 09/03/2023


Exercise

Run the following model

\[ Balance_{i}=\beta_0+\beta_1 female_i+\beta_2 Asian+\beta_3 Caucasian+\beta_4 female_i\times Asian_i+\beta_5 female_i\times Caucasian_i+u_i \]

Interpret the meaning of each coefficient in the model


Answer

library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 8):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
mod4<-lm(Balance ~ Gender+Ethnicity+Gender*Ethnicity ) 
summary(mod4)
## 
## Call:
## lm(formula = Balance ~ Gender + Ethnicity + Gender * Ethnicity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -563.11 -445.84  -57.75  333.87 1483.66 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       553.45      65.95   8.392 8.65e-16 ***
## GenderFemale                      -44.45      92.80  -0.479    0.632    
## EthnicityAsian                   -100.58      94.25  -1.067    0.287    
## EthnicityCaucasian                -38.11      80.91  -0.471    0.638    
## GenderFemale:EthnicityAsian       154.69     130.46   1.186    0.236    
## GenderFemale:EthnicityCaucasian    50.61     113.57   0.446    0.656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 461.6 on 394 degrees of freedom
## Multiple R-squared:  0.004472,   Adjusted R-squared:  -0.008161 
## F-statistic: 0.354 on 5 and 394 DF,  p-value: 0.8796

The interpretation is as follows:

  • \(\beta_0\) is the “reference group”. We are omitting an African-American male. So, we can say that- on average- an African-American male has a balance of 553.45$.
  • \(\beta_1\) is the difference in the balance of an African-American female with respect to the reference group (-44.45$)
  • \(\beta_2\) is the difference in the balance of an Asian male with respect to the reference group (African-American male) -100$
  • \(\beta_3\) is the difference in the balance of a caucasian male with respect to an African-American male (-38.11$)
  • \(\beta_4\) is the difference in the balance of an Asian female with respect to an African-American male (154.69$)
  • \(\beta_5\) is the difference in the balance of a caucasian female with respect to an African-American male (50.61$)

Today’s question #1: the difference between fitting a model and its predictive ability.

First of all, we need to define a measure that appears in the regression outputs: the R-squared. It means the percentage of the target variable that we can explain with the “predictor” variables we have chosen.

Could it be a measure of predictability? We’ll see today that we need to be cautious.


Exercise

Run the following models and explain the R-squared. What can you appreciate about the behaviour of the R-squared?

M1 \[ Balance_{i}=\beta_0+\beta_1 Rating_i+ \beta_2 Income_i+u_i \]

M2 \[ Balance_{i}=\beta_0+\beta_1 Rating_i+\beta_2 Income_i+ u_i \]

M3 \[ Balance_{i}=\beta_0+\beta_1 Rating_i+\beta_2 Income_i+\beta_3 Age_i+ u_i \]

M4 \[ Balance_{i}=Balance_{i}=\beta_0+\beta_1 Rating_i+\beta_2 Income_i+\beta_3 Age_i+\beta_4 Cards_i+\beta_5 Educ_i+\beta_6 Married_i+\beta_7 Ethnicity_i+\beta_8 Gender_i+\beta_9 Gender_i \times Ethnicity_i+\beta_{10} Gender_i\times Education_i+\beta_{11} Gender_i\times Rating_i+\beta_{12}Gender_i\times Income_i+ u_i \]


library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 8):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 9):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
mod1<-lm(Balance ~ Rating+Income)
summary(mod1)
## 
## Call:
## lm(formula = Balance ~ Rating + Income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -278.57 -112.69  -36.21   57.92  575.24 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -534.81215   21.60270  -24.76   <2e-16 ***
## Rating         3.94926    0.08621   45.81   <2e-16 ***
## Income        -7.67212    0.37846  -20.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162.9 on 397 degrees of freedom
## Multiple R-squared:  0.8751, Adjusted R-squared:  0.8745 
## F-statistic:  1391 on 2 and 397 DF,  p-value: < 2.2e-16
mod2<-lm(Balance ~ Rating + Income+Student ) 
summary(mod2)
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -226.126  -80.445   -5.018   65.192  293.234 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -581.07889   13.83463  -42.00   <2e-16 ***
## Rating         3.98747    0.05471   72.89   <2e-16 ***
## Income        -7.87493    0.24021  -32.78   <2e-16 ***
## StudentYes   418.76028   17.23025   24.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.3 on 396 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9495 
## F-statistic:  2502 on 3 and 396 DF,  p-value: < 2.2e-16
mod3<-lm(Balance ~ Rating + Income + Student+Age) 
summary(mod3)
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student + Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -217.606  -79.887   -8.163   62.680  292.009 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -547.30470   21.46064 -25.503   <2e-16 ***
## Rating         3.98073    0.05458  72.927   <2e-16 ***
## Income        -7.79773    0.24218 -32.198   <2e-16 ***
## StudentYes   417.50564   17.17164  24.314   <2e-16 ***
## Age           -0.62418    0.30407  -2.053   0.0408 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.9 on 395 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9499 
## F-statistic:  1892 on 4 and 395 DF,  p-value: < 2.2e-16
mod4<-lm(Balance~Rating+Income+Student+Cards+Age+Education+Married+Ethnicity+Gender+Gender*Ethnicity+Gender*Education+Gender*Rating+Gender*Income)
summary(mod4)
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student + Cards + Age + 
##     Education + Married + Ethnicity + Gender + Gender * Ethnicity + 
##     Gender * Education + Gender * Rating + Gender * Income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -199.53  -79.26  -15.28   63.10  312.52 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -526.85016   46.78319 -11.262   <2e-16 ***
## Rating                             3.89643    0.07925  49.168   <2e-16 ***
## Income                            -7.51981    0.35562 -21.145   <2e-16 ***
## StudentYes                       417.51539   17.44407  23.935   <2e-16 ***
## Cards                              3.72291    3.84268   0.969   0.3332    
## Age                               -0.66694    0.31073  -2.146   0.0325 *  
## Education                         -0.43612    2.42660  -0.180   0.8575    
## MarriedYes                       -15.79724   10.90514  -1.449   0.1483    
## EthnicityAsian                    21.10327   21.28005   0.992   0.3220    
## EthnicityCaucasian                 7.60749   18.11693   0.420   0.6748    
## GenderFemale                     -49.54959   57.16849  -0.867   0.3866    
## EthnicityAsian:GenderFemale       -0.03646   29.75393  -0.001   0.9990    
## EthnicityCaucasian:GenderFemale    5.16678   25.50986   0.203   0.8396    
## Education:GenderFemale             0.06164    3.35862   0.018   0.9854    
## Rating:GenderFemale                0.16065    0.11010   1.459   0.1454    
## Income:GenderFemale               -0.47704    0.48604  -0.981   0.3270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.2 on 384 degrees of freedom
## Multiple R-squared:  0.9515, Adjusted R-squared:  0.9496 
## F-statistic: 501.9 on 15 and 384 DF,  p-value: < 2.2e-16

Which model predicts better?

  • If we say “the model with the highest R-squared” we’ll say the third one. But note that the more variables has a model, the higher the R-squared
  • Remember that a “healthy” way to test the predictive power of a model is cross validation
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 5):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 6):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 7):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 8):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 9):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 10):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(lattice)

train_control<- trainControl(method="cv", number=20,p=0.75, savePredictions = TRUE)

model1_cv<- train(Balance ~ Rating + Income, data=Credit, trControl=train_control, method = "lm" )  
model1_cv
## Linear Regression 
## 
## 400 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 381, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   159.0672  0.8793192  121.2058
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
model2_cv<- train(Balance ~ Rating + Income+ Student, data=Credit, trControl=train_control, method = "lm" )  
model2_cv
## Linear Regression 
## 
## 400 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   102.9035  0.9510608  83.98769
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
model3_cv<- train(Balance ~ Rating + Income + Student+Age, data=Credit, trControl=train_control, method = "lm" )  
model3_cv
## Linear Regression 
## 
## 400 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   102.5357  0.9498921  83.56684
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
model4_cv<- train(Balance~Rating+Income+Student+Cards+Age+Education+Married+Ethnicity+Gender+Gender*Ethnicity+Gender*Education+Gender*Rating+Gender*Income, data=Credit, trControl=train_control, method = "lm" )  
model4_cv
## Linear Regression 
## 
## 400 samples
##   9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 381, 380, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   104.7672  0.9500272  84.84585
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
Errors<-data.frame(model1_cv$results$RMSE,model2_cv$results$RMSE,model3_cv$results$RMSE,model4_cv$results$RMSE)
Errors
##   model1_cv.results.RMSE model2_cv.results.RMSE model3_cv.results.RMSE
## 1               159.0672               102.9035               102.5357
##   model4_cv.results.RMSE
## 1               104.7672
  • As you can see, not always a model with more variables will predict better (and this is one random example, we’ll see more along this course)
  • Check that the RMSE in the fitted models are always lower than in cross validation (Hint: the RMSE is called the “residual standard error” in the R output)

What is happening is also called : overfitting.