KITADA

Lesson #21

Multiple Linear Regression

Motivation:

We’ve spent a lot of time discussing simple linear regression. In many situations, there is more than one explanatory variable that helps “explain” the response variable. When we have more than one explanatory variable, a Multiple Linear Regression analysis is performed. Many of the steps in performing a Multiple Linear Regression Analysis are the same as a Simple Linear Regression analysis, but there are some differences. In this lesson, we’ll start by assuming all conditions are met (we’ll talk more about those conditions in Lesson 22) and learn how to interpret the output from a multiple linear regression analysis.

What you need to know from this lesson:

After completing this lesson, you should be able to

explain the difference between simple and multiple linear regression
write the least-squares regression equation given computer output
interpret the coefficients and constant term in the equation in the context of the problem
construct a confidence interval for a coefficient in the population regression equation
predict the response variable for certain values of the explanatory variables
perform an F-test to determine if at least one of the explanatory variables is a significant predictor of the response variable
perform t-tests on each explanatory variable to determine the significance of each explanatory variable as a predictor of the response variable after accounting for the effects of the other explanatory variables in the model
perform a backwards model selection to determine a model containing only significant predictors of the response variable
compute and interpret R2 in the context of the problem
compute and interpret the estimate of the standard deviation of the residuals

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Sections 10.1 and 10.3 (pages 581-584) in the text
3. Do the Lesson 21 questions at the end of the lesson notes

The Lesson

The Literacy Rate Example

Literacy rate is a reflection of the educational facilities and quality of education available in a country, and mass communication plays a large part in the educational process. In an effort to relate the literacy rate of a country to various mass communication outlets, a demographer has proposed to relate literacy rate to the following variables: number of daily newspaper copies (per 1000 population), number of radios (per 1000 population), and number of TV sets (per 1000 population).

Here are the data:

LITERACY

##                      Country newspapers radios tv.sets literacy.rate
## 1  Czech Republic / Slovakia        280    266     228          0.98
## 2                      Italy        142    230     201          0.93
## 3                      Kenya         10    114       2          0.25
## 4                     Norway        391    313     227          0.99
## 5                     Panama         86    329      82          0.79
## 6                Philippines         17     42      11          0.72
## 7                    Tunisia         21     49      16          0.32
## 8                        USA        314   1695     472          0.99
## 9                     Russia        333    430     185          0.99
## 10                 Venezuala         91    182      89          0.82

*1. What is the response variable? What are the explanatory variables? *

Response: Literacy Rate
Explantory:
- Number of Newspapers (per 1000 population)
- Number of Radios (per 1000 population)
- Number of TV sets (per 1000 population)

In Lesson 22, we’ll discuss the conditions (and checks of those conditions) that must be met for conclusions from a multiple linear regression analysis to be valid to a population of interest. For now, let’s assume all conditions are satisfied.

The output from the multiple linear regression analysis is given below:

### MULTIPLE LIN REG MOD ###
lit_mod<-with(LITERACY, lm(literacy.rate~newspapers+radios+tv.sets))
summary(lit_mod)

## 
## Call:
## lm(formula = literacy.rate ~ newspapers + radios + tv.sets)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.233963 -0.069603 -0.007276  0.127095  0.188900 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.5148602  0.0936762   5.496  0.00152 **
## newspapers   0.0005421  0.0008653   0.626  0.55410   
## radios      -0.0003535  0.0003285  -1.076  0.32330   
## tv.sets      0.0019882  0.0015503   1.282  0.24699   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1865 on 6 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.5482 
## F-statistic:  4.64 on 3 and 6 DF,  p-value: 0.05255

### ANOVA ###
anova(lit_mod)

## Analysis of Variance Table
## 
## Response: literacy.rate
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## newspapers  1 0.42615 0.42615 12.2579 0.01281 *
## radios      1 0.00063 0.00063  0.0182 0.89704  
## tv.sets     1 0.05718 0.05718  1.6448 0.24699  
## Residuals   6 0.20859 0.03477                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that in the output above, sums of squares are given for each of the variables listed in the model. These are considered Type I sum of squares and are “sequential.” In essence the factors are tested in the order they are listed in the model.

2. To obtain the sum of squares for the model, you must add all of the sum of squares for each variable. What is the sum of squares for the model containing the variables newspaper copies, radios and television sets?

### SSM ###
0.42615+0.00063+0.05718

## [1] 0.48396

3. Write the least-squares regression equation and explain what each term in the regression equation represents in the context of the problem.

LITERACY=0.5148602+0.0005421*NEWSPAPER-0.0003535*RADIOS+0.0019882TV

4. Interpret the coefficient of each term in the context of the problem.

NEWSPAPER: With all other variable held constant, for every additional newspaper copy (per 1000 population) literacy rate increases .05%.
RADIOS: With all other variable held constant, for every additional radio (per 1000 population) literacy rate decreases -0.03%. -TV SETS: With all other variable held constant, for every additional tv set (per 1000 population) literacy rate increases .19%.

5. What is the “constant” term equivalent to in the Simple Linear Regression equation? Interpret the constant term in the context of the problem.

IN SLR: The y-intercept is when x=0
IN MLR: The y-intercept is when $ x_1=0, x_2=0, ..., x_k=0 $

6. Construct and interpret a 95% confidence interval for the coefficient of TV sets.

### CONFIDENCE INTERVAL FOR THE EFFECT OF TV ###
point_est<-0.0019882

critical_val<-qt(0.975, df=6)
critical_val

## [1] 2.446912

std_err<-0.0015503 

point_est+c(-1,1)*critical_val*std_err

## [1] -0.001805247  0.005781647

7. Predict literacy rate for a country that has 200 newspapers (per 1000 in the population), 800 radios (per 1000 in the population), and 250 TV sets (per 1000 in the population).

### PREDICTION ###
0.5148602+
  0.0005421*200+
  -0.0003535*800+
  0.0019882*250

## [1] 0.8375302

8. What percent of the variation in literacy rate is being explained by this regression model?

### R SQUARED ###
summary(lit_mod)$r.squared

## [1] 0.698808

9. One goal of a multiple linear regression analysis is to determine which (if any) explanatory variables are “statistically significant predictors” of the response variable. A multi-step process is used to determine this.

Step 1: Determine if at least one of the explanatory variables helps to predict the literacy rate by performing an F-test.

State the null and alternative hypotheses.

$ H_0 = \beta_1 = \beta_2 = \beta_3 = 0 $

$ H_A = At least one is not equal $

Calculate the F-statistic

### F STATISTIC ###
SSM = 0.42615+0.00063+0.05718
df_SSM = 3
MSM = SSM/df_SSM
MSM

## [1] 0.16132

SSE = 0.20859 
df_SSE = 6
MSE = SSE/df_SSE
MSE

## [1] 0.034765

F_stat = MSM/MSE
F_stat

## [1] 4.640299

What are the degrees of freedom for the F-statistic?

### F DEGREES OF FREEDOM ###
df_SSM = 3

df_SSE = 6

Determine the p-value and state a conclusion in the context of the problem.

### F TEST ###
pf(F_stat, df1=df_SSM, df2=df_SSE, lower.tail=FALSE)

## [1] 0.05255035

There is little evidence to suggest that at least one of the explantory variables (newspapers, radios, or tvs) has a significant effect on the literacy rate.

Step 2: If there is even a little evidence to reject the claim in the null hypothesis from the F-test, perform t-tests on each explanatory variable.

Let’s perform a t-test by hand for the coefficient of newspaper copies.

State the null and alternative hypotheses in notation and in words

$H_0: \beta_1 = 0 $, there is no effect of amount of newspapers on literacy

$ H_A: \beta_1 \new 0 $, there is an effect of amount of newspapers on literacy

Calculate the t-statistic

### T TEST-STAT ###
# 0.626

How many degrees of freedom does the t-statistic have?

### T DF ###
# 6

Find the p-value and state a conclusion about newspaper copies in the context of the problem.

### P-VAL ###
# 0.55410

Here is the output. From the output, perform the t-test on radios and TV sets. For each test,

State the null and alternative hypotheses
Give the t-statistic with degrees of freedom
State a conclusion in the context of the problem supported by a p-value.

summary(lit_mod)

## 
## Call:
## lm(formula = literacy.rate ~ newspapers + radios + tv.sets)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.233963 -0.069603 -0.007276  0.127095  0.188900 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.5148602  0.0936762   5.496  0.00152 **
## newspapers   0.0005421  0.0008653   0.626  0.55410   
## radios      -0.0003535  0.0003285  -1.076  0.32330   
## tv.sets      0.0019882  0.0015503   1.282  0.24699   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1865 on 6 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.5482 
## F-statistic:  4.64 on 3 and 6 DF,  p-value: 0.05255

Step 3: Use a backwards selection process to determine a “final model” that contains only statistically significant predictors of the response variable.

If all p-values from the t-test are less than 0.05 (or so), all variables are deemed “statistically significant predictors” of the response variable and all stay in the model.
If one or more of the p-values from the t-test are greater than 0.05 (or so):
- Remove the variable with the highest p-value from the t-tests as this variable is deemed the “least statistically significant predictor” of the response variable.
- Re-do the multiple linear regression analysis with the remaining explanatory variables and recalculate the t-statistics and p-values from the t-test.
- If at least one of the explanatory variables has a p-value from the t-tests, remove the variable with the highest p-value.
- Continue this process of removing the explanatory variable with the highest p-value from the t-tests and re-running the analyses with the remaining variables until all remaining explanatory variables have p-values from the t-tests $ \leq $ 0.05 (or so).

a. Which variable would be removed first (if any)?

Newspapers should be removed first

After removing newspaper copies, run the analysis again with the remaining explanatory variables and re-do the t-tests. The output is given below.

### REMOVE NEWSPAPER###
lit_mod2<-with(LITERACY, lm(literacy.rate~radios+tv.sets))
summary(lit_mod2)

## 
## Call:
## lm(formula = literacy.rate ~ radios + tv.sets)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23164 -0.05620 -0.03659  0.14394  0.18769 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5300764  0.0864566   6.131 0.000476 ***
## radios      -0.0004736  0.0002548  -1.859 0.105418    
## tv.sets      0.0027812  0.0008551   3.253 0.014006 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1782 on 7 degrees of freedom
## Multiple R-squared:  0.6791, Adjusted R-squared:  0.5874 
## F-statistic: 7.407 on 2 and 7 DF,  p-value: 0.01872

b. Which variable, if any, would be removed?

Radios

Once again, run the analysis with the remaining variable after radios has been eliminated. The output is given below:

### REMOVE RADIO###
lit_mod3<-with(LITERACY, lm(literacy.rate~tv.sets))
summary(lit_mod3)

## 
## Call:
## lm(formula = literacy.rate ~ tv.sets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3207 -0.1542  0.1012  0.1234  0.1652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.567902   0.096057   5.912 0.000357 ***
## tv.sets     0.001389   0.000471   2.948 0.018473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2037 on 8 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.4608 
## F-statistic: 8.693 on 1 and 8 DF,  p-value: 0.01847

c. Is television sets a statistically significant predictor of literacy rate? Explain.

There is convicing evidence to suggest that there is a significant effect of tv sets on literacy rates with a p-value of 0.018473. Therefore, we will reject the null.

9. Once a “final model” is determined with only statistically significant predictors, use the output from that model to answer the following questions:

### FINAL MOD ###
summary(lit_mod3)

## 
## Call:
## lm(formula = literacy.rate ~ tv.sets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3207 -0.1542  0.1012  0.1234  0.1652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.567902   0.096057   5.912 0.000357 ***
## tv.sets     0.001389   0.000471   2.948 0.018473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2037 on 8 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.4608 
## F-statistic: 8.693 on 1 and 8 DF,  p-value: 0.01847

anova(lit_mod3)

## Analysis of Variance Table
## 
## Response: literacy.rate
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## tv.sets    1 0.36065 0.36065  8.6925 0.01847 *
## Residuals  8 0.33191 0.04149                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a. What is the least-squares regression equation?

LITERACY = 0.567902+0.001389*TVSETS

b. What is the estimate of the standard deviation of the residuals,$ \sigma $?

# Residual standard error: 0.2037

c. What percent of the variation in literacy rate is explained by the regression model?

# Multiple R-squared:  0.5207

d. Predict literacy rate for the same country as in #7.

### PREDICT TV=250 ###
newdata<-data.frame(tv.sets=c(250))
predict.lm(lit_mod3, newdata, interval="predict")

##         fit       lwr      upr
## 1 0.9150565 0.4108944 1.419219