Assignment 1

Download Assignment 1-3.pdf

Model 1 and 2

library("ISLR")
library("coefplot", "ggplot2")
## 載入需要的套件:ggplot2
data(Carseats)

## data attributes
dim(Carseats)
## [1] 400  11
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 
## linear regression models
## using Advertising, Price, Education as independent variables
lm.fit1 <- lm(Sales ~ Advertising + Price + Education, data = Carseats)
lm.fit2 <- lm(Sales ~ Advertising + Price + Education, data = Carseats[1:200, ])
summary(lm.fit1)
## 
## Call:
## lm(formula = Sales ~ Advertising + Price + Education, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8111 -1.5774 -0.0823  1.5176  6.2580 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.55277    0.87778  15.440  < 2e-16 ***
## Advertising  0.12257    0.01809   6.774 4.56e-11 ***
## Price       -0.05455    0.00508 -10.739  < 2e-16 ***
## Education   -0.03975    0.04588  -0.866    0.387    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.4 on 396 degrees of freedom
## Multiple R-squared:  0.2832, Adjusted R-squared:  0.2778 
## F-statistic: 52.16 on 3 and 396 DF,  p-value: < 2.2e-16
summary(lm.fit2)
## 
## Call:
## lm(formula = Sales ~ Advertising + Price + Education, data = Carseats[1:200, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1095 -1.4929 -0.1362  1.3478  6.3480 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.882594   1.285484  10.800  < 2e-16 ***
## Advertising  0.152776   0.028626   5.337 2.60e-07 ***
## Price       -0.057705   0.007311  -7.893 2.05e-13 ***
## Education   -0.054431   0.065058  -0.837    0.404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.447 on 196 degrees of freedom
## Multiple R-squared:  0.3201, Adjusted R-squared:  0.3097 
## F-statistic: 30.76 on 3 and 196 DF,  p-value: 2.432e-16
coefplot(lm.fit1)

coefplot(lm.fit2)

In both models using confidence level of 95%, we see Advertising and Price have a significant effect on Sales, whereas Education has an insignificant effect on Sales.

Advertising has a positive effect on Sales in both models, with an estimate of 0.12257 in model 1 and 0.152776 in model 2, meaning that an one unit increase in Advertising increases Sales by 0.12257 units in model 1 and 0.152776 units in model 2.

Price has a negative effect on Sales in both models, with an estimate of -0.05455 in model 1 and -0.057705 in model 2, meaning that an one unit increase in Price decreases Sales by 0.05455 units in model 1 and 0.057705 units in model 2.

Model 3

lm.fit3 <- lm(Sales ~ Income + Age + Education + Price, data = Carseats)
## mean sum of squares
mse <- mean(residuals(lm.fit3) ^ 2)
## root MSE
rmse <- sqrt(mse)
## residual sum of squares
rss <- sum(residuals(lm.fit3) ^ 2)
## residual standard error
rse <- sqrt(sum(residuals(lm.fit3) ^ 2) / lm.fit3$df.residual)
## R squared
rsq <- summary(lm.fit3)$r.squared
## R squared is calculated as (TSS - RSS) / TSS. By moving things around 
## the equal sign, we get RSS / TSS = 1 - R squared
tss <- rss / (1 - rsq)

## anova of model 3
anova(lm.fit3)
## Analysis of Variance Table
## 
## Response: Sales
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Income      1   73.48   73.48  12.8902 0.0003719 ***
## Age         1  169.97  169.97  29.8184 8.405e-08 ***
## Education   1    5.60    5.60   0.9823 0.3222376    
## Price       1  681.68  681.68 119.5896 < 2.2e-16 ***
## Residuals 395 2251.55    5.70                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## We see that Education accounts for the largest variation in the model. 

Bias-Variance Tradeoff

Bias is the squared difference between the actual value and the predicted value, whereas Variance is the error introduced by the independent variables.

High bias can cause underfitting of a model, making predicted values way off with the actual value.

High variance can cause overfitting of a model, creating too much noises in the model.

## using Income, Age and Education, Price as independent variables
lm.fit4 <- lm(Sales ~ Income + Age, data = Carseats)
lm.fit5 <- lm(Sales ~ Income + Age + Education + Price, data = Carseats)

model_bias = function(predicted, actual) {
  mean(sum((predicted - actual) ^ 2))
}

bias_fit4 <- model_bias(lm.fit4$fitted.values, Carseats$Sales)
bias_fit5 <- model_bias(lm.fit5$fitted.values, Carseats$Sales)

var4 <- sum(vcov(lm.fit4))
var5 <- sum(vcov(lm.fit5))

## model 4 has a higher bias but lower variance;
## model 5 has a lower bias but higher variance. 

End of Homework 1.