Q1: Check the first 3 rows of this dataset.

head(state.x77, 3)
        Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska         365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417

Q2: Create a new data frame using variable “Murder”, “Population”, “Illiteracy”, “Income” and “Frost” and check it.

df <- as.data.frame(state.x77)
sub.df <- df[, c("Murder","Population","Illiteracy", "Income", "Frost")]
head(sub.df)
           Murder Population Illiteracy Income Frost
Alabama      15.1       3615        2.1   3624    20
Alaska       11.3        365        1.5   6315   152
Arizona       7.8       2212        1.8   4530    15
Arkansas     10.1       2110        1.9   3378    65
California   10.3      21198        1.1   5114    20
Colorado      6.8       2541        0.7   4884   166

Q3: Conduct a multiple regression to predict the dependent variable “Murder” using all other variables as the independent variables and show the results.

mlr <- lm(Murder ~., data=sub.df)
options(scipen = 999)
summary(mlr)

Call:
lm(formula = Murder ~ ., data = sub.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 1.23456341 3.86611474   0.319    0.7510    
Population  0.00022368 0.00009052   2.471    0.0173 *  
Illiteracy  4.14283659 0.87435319   4.738 0.0000219 ***
Income      0.00006442 0.00068370   0.094    0.9253    
Frost       0.00058131 0.01005366   0.058    0.9541    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567, Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 0.00000009133

Q4: Use stepwise method to select independent variables and conduct multiple regression with the selected independent variables.

mlr.step <- step(mlr, direction = "both")
Start:  AIC=97.75
Murder ~ Population + Illiteracy + Income + Frost

             Df Sum of Sq    RSS     AIC
- Frost       1     0.021 289.19  95.753
- Income      1     0.057 289.22  95.759
<none>                    289.17  97.749
- Population  1    39.238 328.41 102.111
- Illiteracy  1   144.264 433.43 115.986

Step:  AIC=95.75
Murder ~ Population + Illiteracy + Income

             Df Sum of Sq    RSS     AIC
- Income      1     0.057 289.25  93.763
<none>                    289.19  95.753
+ Frost       1     0.021 289.17  97.749
- Population  1    43.658 332.85 100.783
- Illiteracy  1   236.196 525.38 123.605

Step:  AIC=93.76
Murder ~ Population + Illiteracy

             Df Sum of Sq    RSS     AIC
<none>                    289.25  93.763
+ Income      1     0.057 289.19  95.753
+ Frost       1     0.021 289.22  95.759
- Population  1    48.517 337.76  99.516
- Illiteracy  1   299.646 588.89 127.311
summary(mlr.step)

Call:
lm(formula = Murder ~ Population + Illiteracy, data = sub.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7652 -1.6561 -0.0898  1.4570  7.6758 

Coefficients:
              Estimate Std. Error t value      Pr(>|t|)    
(Intercept) 1.65154974 0.81011208   2.039       0.04713 *  
Population  0.00022419 0.00007984   2.808       0.00724 ** 
Illiteracy  4.08073664 0.58481561   6.978 0.00000000883 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.481 on 47 degrees of freedom
Multiple R-squared:  0.5668,    Adjusted R-squared:  0.5484 
F-statistic: 30.75 on 2 and 47 DF,  p-value: 0.000000002893

Q5: Compare the results in Q3 and Q4.

summary(mlr)

Call:
lm(formula = Murder ~ ., data = sub.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 1.23456341 3.86611474   0.319    0.7510    
Population  0.00022368 0.00009052   2.471    0.0173 *  
Illiteracy  4.14283659 0.87435319   4.738 0.0000219 ***
Income      0.00006442 0.00068370   0.094    0.9253    
Frost       0.00058131 0.01005366   0.058    0.9541    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567, Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 0.00000009133
summary(mlr.step)

Call:
lm(formula = Murder ~ Population + Illiteracy, data = sub.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7652 -1.6561 -0.0898  1.4570  7.6758 

Coefficients:
              Estimate Std. Error t value      Pr(>|t|)    
(Intercept) 1.65154974 0.81011208   2.039       0.04713 *  
Population  0.00022419 0.00007984   2.808       0.00724 ** 
Illiteracy  4.08073664 0.58481561   6.978 0.00000000883 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.481 on 47 degrees of freedom
Multiple R-squared:  0.5668,    Adjusted R-squared:  0.5484 
F-statistic: 30.75 on 2 and 47 DF,  p-value: 0.000000002893
항목 Q3 (모든 변수 포함) Q4 (Stepwise 선택 변수)
사용 변수 Population, Illiteracy, Income, Frost Population, Illiteracy
주요 유의 변수 Population, Illiteracy Population, Illiteracy
R-squared 0.567 0.5668
모델 특성 불필요한 변수 포함 가능성 간결하고 해석 용이한 모델