KITADA

Lesson #21

Multiple Linear Regression

Motivation:

We’ve spent a lot of time discussing simple linear regression. In many situations, there is more than one explanatory variable that helps “explain” the response variable. When we have more than one explanatory variable, a Multiple Linear Regression analysis is performed. Many of the steps in performing a Multiple Linear Regression Analysis are the same as a Simple Linear Regression analysis, but there are some differences. In this lesson, we’ll start by assuming all conditions are met (we’ll talk more about those conditions in Lesson 22) and learn how to interpret the output from a multiple linear regression analysis.

What you need to know from this lesson:

After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

The Literacy Rate Example

Literacy rate is a reflection of the educational facilities and quality of education available in a country, and mass communication plays a large part in the educational process. In an effort to relate the literacy rate of a country to various mass communication outlets, a demographer has proposed to relate literacy rate to the following variables: number of daily newspaper copies (per 1000 population), number of radios (per 1000 population), and number of TV sets (per 1000 population).

Here are the data:

LITERACY
##                      Country newspapers radios tv.sets literacy.rate
## 1  Czech Republic / Slovakia        280    266     228          0.98
## 2                      Italy        142    230     201          0.93
## 3                      Kenya         10    114       2          0.25
## 4                     Norway        391    313     227          0.99
## 5                     Panama         86    329      82          0.79
## 6                Philippines         17     42      11          0.72
## 7                    Tunisia         21     49      16          0.32
## 8                        USA        314   1695     472          0.99
## 9                     Russia        333    430     185          0.99
## 10                 Venezuala         91    182      89          0.82

*1. What is the response variable? What are the explanatory variables? *

In Lesson 22, we’ll discuss the conditions (and checks of those conditions) that must be met for conclusions from a multiple linear regression analysis to be valid to a population of interest. For now, let’s assume all conditions are satisfied.

The output from the multiple linear regression analysis is given below:

### MULTIPLE LIN REG MOD ###
lit_mod<-with(LITERACY, lm(literacy.rate~newspapers+radios+tv.sets))
summary(lit_mod)
## 
## Call:
## lm(formula = literacy.rate ~ newspapers + radios + tv.sets)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.233963 -0.069603 -0.007276  0.127095  0.188900 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.5148602  0.0936762   5.496  0.00152 **
## newspapers   0.0005421  0.0008653   0.626  0.55410   
## radios      -0.0003535  0.0003285  -1.076  0.32330   
## tv.sets      0.0019882  0.0015503   1.282  0.24699   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1865 on 6 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.5482 
## F-statistic:  4.64 on 3 and 6 DF,  p-value: 0.05255
### ANOVA ###
anova(lit_mod)
## Analysis of Variance Table
## 
## Response: literacy.rate
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## newspapers  1 0.42615 0.42615 12.2579 0.01281 *
## radios      1 0.00063 0.00063  0.0182 0.89704  
## tv.sets     1 0.05718 0.05718  1.6448 0.24699  
## Residuals   6 0.20859 0.03477                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that in the output above, sums of squares are given for each of the variables listed in the model. These are considered Type I sum of squares and are “sequential.” In essence the factors are tested in the order they are listed in the model.

2. To obtain the sum of squares for the model, you must add all of the sum of squares for each variable. What is the sum of squares for the model containing the variables newspaper copies, radios and television sets?

### SSM ###
0.42615+0.00063+0.05718
## [1] 0.48396

3. Write the least-squares regression equation and explain what each term in the regression equation represents in the context of the problem.

LITERACY=0.5148602+0.0005421*NEWSPAPER-0.0003535*RADIOS+0.0019882TV

4. Interpret the coefficient of each term in the context of the problem.

5. What is the “constant” term equivalent to in the Simple Linear Regression equation? Interpret the constant term in the context of the problem.

6. Construct and interpret a 95% confidence interval for the coefficient of TV sets.

### CONFIDENCE INTERVAL FOR THE EFFECT OF TV ###
point_est<-0.0019882

critical_val<-qt(0.975, df=6)
critical_val
## [1] 2.446912
std_err<-0.0015503 

point_est+c(-1,1)*critical_val*std_err
## [1] -0.001805247  0.005781647

7. Predict literacy rate for a country that has 200 newspapers (per 1000 in the population), 800 radios (per 1000 in the population), and 250 TV sets (per 1000 in the population).

### PREDICTION ###
0.5148602+
  0.0005421*200+
  -0.0003535*800+
  0.0019882*250
## [1] 0.8375302

8. What percent of the variation in literacy rate is being explained by this regression model?

### R SQUARED ###
summary(lit_mod)$r.squared
## [1] 0.698808

9. One goal of a multiple linear regression analysis is to determine which (if any) explanatory variables are “statistically significant predictors” of the response variable. A multi-step process is used to determine this.

Step 1: Determine if at least one of the explanatory variables helps to predict the literacy rate by performing an F-test.

\( H_0 = \beta_1 = \beta_2 = \beta_3 = 0 \)

\( H_A = At least one is not equal \)

### F STATISTIC ###
SSM = 0.42615+0.00063+0.05718
df_SSM = 3
MSM = SSM/df_SSM
MSM 
## [1] 0.16132
SSE = 0.20859 
df_SSE = 6
MSE = SSE/df_SSE
MSE
## [1] 0.034765
F_stat = MSM/MSE
F_stat
## [1] 4.640299
### F DEGREES OF FREEDOM ###
df_SSM = 3

df_SSE = 6
### F TEST ###
pf(F_stat, df1=df_SSM, df2=df_SSE, lower.tail=FALSE)
## [1] 0.05255035

There is little evidence to suggest that at least one of the explantory variables (newspapers, radios, or tvs) has a significant effect on the literacy rate.

Step 2: If there is even a little evidence to reject the claim in the null hypothesis from the F-test, perform t-tests on each explanatory variable.

Let’s perform a t-test by hand for the coefficient of newspaper copies.

$H_0: \beta_1 = 0 $, there is no effect of amount of newspapers on literacy

\( H_A: \beta_1 \new 0 \), there is an effect of amount of newspapers on literacy

### T TEST-STAT ###
# 0.626
### T DF ###
# 6
### P-VAL ###
# 0.55410

Here is the output. From the output, perform the t-test on radios and TV sets. For each test,

summary(lit_mod)
## 
## Call:
## lm(formula = literacy.rate ~ newspapers + radios + tv.sets)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.233963 -0.069603 -0.007276  0.127095  0.188900 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.5148602  0.0936762   5.496  0.00152 **
## newspapers   0.0005421  0.0008653   0.626  0.55410   
## radios      -0.0003535  0.0003285  -1.076  0.32330   
## tv.sets      0.0019882  0.0015503   1.282  0.24699   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1865 on 6 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.5482 
## F-statistic:  4.64 on 3 and 6 DF,  p-value: 0.05255

Step 3: Use a backwards selection process to determine a “final model” that contains only statistically significant predictors of the response variable.

a. Which variable would be removed first (if any)?

Newspapers should be removed first

After removing newspaper copies, run the analysis again with the remaining explanatory variables and re-do the t-tests. The output is given below.

### REMOVE NEWSPAPER###
lit_mod2<-with(LITERACY, lm(literacy.rate~radios+tv.sets))
summary(lit_mod2)
## 
## Call:
## lm(formula = literacy.rate ~ radios + tv.sets)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23164 -0.05620 -0.03659  0.14394  0.18769 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5300764  0.0864566   6.131 0.000476 ***
## radios      -0.0004736  0.0002548  -1.859 0.105418    
## tv.sets      0.0027812  0.0008551   3.253 0.014006 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1782 on 7 degrees of freedom
## Multiple R-squared:  0.6791, Adjusted R-squared:  0.5874 
## F-statistic: 7.407 on 2 and 7 DF,  p-value: 0.01872

b. Which variable, if any, would be removed?

Radios

Once again, run the analysis with the remaining variable after radios has been eliminated. The output is given below:

### REMOVE RADIO###
lit_mod3<-with(LITERACY, lm(literacy.rate~tv.sets))
summary(lit_mod3)
## 
## Call:
## lm(formula = literacy.rate ~ tv.sets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3207 -0.1542  0.1012  0.1234  0.1652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.567902   0.096057   5.912 0.000357 ***
## tv.sets     0.001389   0.000471   2.948 0.018473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2037 on 8 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.4608 
## F-statistic: 8.693 on 1 and 8 DF,  p-value: 0.01847

c. Is television sets a statistically significant predictor of literacy rate? Explain.

There is convicing evidence to suggest that there is a significant effect of tv sets on literacy rates with a p-value of 0.018473. Therefore, we will reject the null.

9. Once a “final model” is determined with only statistically significant predictors, use the output from that model to answer the following questions:

### FINAL MOD ###
summary(lit_mod3)
## 
## Call:
## lm(formula = literacy.rate ~ tv.sets)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3207 -0.1542  0.1012  0.1234  0.1652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.567902   0.096057   5.912 0.000357 ***
## tv.sets     0.001389   0.000471   2.948 0.018473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2037 on 8 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.4608 
## F-statistic: 8.693 on 1 and 8 DF,  p-value: 0.01847
anova(lit_mod3)
## Analysis of Variance Table
## 
## Response: literacy.rate
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## tv.sets    1 0.36065 0.36065  8.6925 0.01847 *
## Residuals  8 0.33191 0.04149                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a. What is the least-squares regression equation?

LITERACY = 0.567902+0.001389*TVSETS

b. What is the estimate of the standard deviation of the residuals,\( \sigma \)?

# Residual standard error: 0.2037 

c. What percent of the variation in literacy rate is explained by the regression model?

# Multiple R-squared:  0.5207

d. Predict literacy rate for the same country as in #7.

### PREDICT TV=250 ###
newdata<-data.frame(tv.sets=c(250))
predict.lm(lit_mod3, newdata, interval="predict")
##         fit       lwr      upr
## 1 0.9150565 0.4108944 1.419219