library("ISLR")
library("coefplot", "ggplot2")
## 載入需要的套件:ggplot2
data(Carseats)
## data attributes
dim(Carseats)
## [1] 400 11
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
## linear regression models
## using Advertising, Price, Education as independent variables
lm.fit1 <- lm(Sales ~ Advertising + Price + Education, data = Carseats)
lm.fit2 <- lm(Sales ~ Advertising + Price + Education, data = Carseats[1:200, ])
summary(lm.fit1)
##
## Call:
## lm(formula = Sales ~ Advertising + Price + Education, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8111 -1.5774 -0.0823 1.5176 6.2580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.55277 0.87778 15.440 < 2e-16 ***
## Advertising 0.12257 0.01809 6.774 4.56e-11 ***
## Price -0.05455 0.00508 -10.739 < 2e-16 ***
## Education -0.03975 0.04588 -0.866 0.387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.4 on 396 degrees of freedom
## Multiple R-squared: 0.2832, Adjusted R-squared: 0.2778
## F-statistic: 52.16 on 3 and 396 DF, p-value: < 2.2e-16
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Advertising + Price + Education, data = Carseats[1:200,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1095 -1.4929 -0.1362 1.3478 6.3480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.882594 1.285484 10.800 < 2e-16 ***
## Advertising 0.152776 0.028626 5.337 2.60e-07 ***
## Price -0.057705 0.007311 -7.893 2.05e-13 ***
## Education -0.054431 0.065058 -0.837 0.404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.447 on 196 degrees of freedom
## Multiple R-squared: 0.3201, Adjusted R-squared: 0.3097
## F-statistic: 30.76 on 3 and 196 DF, p-value: 2.432e-16
coefplot(lm.fit1)
coefplot(lm.fit2)
In both models using confidence level of 95%, we see Advertising and Price have a significant effect on Sales, whereas Education has an insignificant effect on Sales.
Advertising has a positive effect on Sales in both models, with an estimate of 0.12257 in model 1 and 0.152776 in model 2, meaning that an one unit increase in Advertising increases Sales by 0.12257 units in model 1 and 0.152776 units in model 2.
Price has a negative effect on Sales in both models, with an estimate of -0.05455 in model 1 and -0.057705 in model 2, meaning that an one unit increase in Price decreases Sales by 0.05455 units in model 1 and 0.057705 units in model 2.
lm.fit3 <- lm(Sales ~ Income + Age + Education + Price, data = Carseats)
## mean sum of squares
mse <- mean(residuals(lm.fit3) ^ 2)
## root MSE
rmse <- sqrt(mse)
## residual sum of squares
rss <- sum(residuals(lm.fit3) ^ 2)
## residual standard error
rse <- sqrt(sum(residuals(lm.fit3) ^ 2) / lm.fit3$df.residual)
## R squared
rsq <- summary(lm.fit3)$r.squared
## R squared is calculated as (TSS - RSS) / TSS. By moving things around
## the equal sign, we get RSS / TSS = 1 - R squared
tss <- rss / (1 - rsq)
## anova of model 3
anova(lm.fit3)
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Income 1 73.48 73.48 12.8902 0.0003719 ***
## Age 1 169.97 169.97 29.8184 8.405e-08 ***
## Education 1 5.60 5.60 0.9823 0.3222376
## Price 1 681.68 681.68 119.5896 < 2.2e-16 ***
## Residuals 395 2251.55 5.70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## We see that Education accounts for the largest variation in the model.
Bias is the squared difference between the actual value and the predicted value, whereas Variance is the error introduced by the independent variables.
High bias can cause underfitting of a model, making predicted values way off with the actual value.
High variance can cause overfitting of a model, creating too much noises in the model.
## using Income, Age and Education, Price as independent variables
lm.fit4 <- lm(Sales ~ Income + Age, data = Carseats)
lm.fit5 <- lm(Sales ~ Income + Age + Education + Price, data = Carseats)
model_bias = function(predicted, actual) {
mean(sum((predicted - actual) ^ 2))
}
bias_fit4 <- model_bias(lm.fit4$fitted.values, Carseats$Sales)
bias_fit5 <- model_bias(lm.fit5$fitted.values, Carseats$Sales)
var4 <- sum(vcov(lm.fit4))
var5 <- sum(vcov(lm.fit5))
## model 4 has a higher bias but lower variance;
## model 5 has a lower bias but higher variance.