library(tidyverse)
library(car)
DATE <- c("324", "325", "326", "327", "328", "329", "330", "331", "401",
"402", "403", "404", "405", "406", "407", "408", "409", "410", "411"
, "412", "413", "414", "415", "416", "417", "418", "419")
Taipei_Temperature <- c(21, 28, 25, 17, 18, 24, 26, 23, 17, 15, 17, 21, 26, 26, 23, 24
, 28, 30, 29, 31, 30, 24, 20, 22, 20, 22, 18)
Taipei <- c(0, 0, 0, 1, 5, 1, 7, 10, 19, 13, 22, 29, 41, 84, 61, 74, 67, 96,
101, 142, 147, 145, 185, 221, 245, 270, 287)
Weekday <- c("Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y", "Y",
"Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y")
cvdta <- data.frame(DATE, Weekday, Taipei_Temperature, Taipei)
head(cvdta)
## DATE Weekday Taipei_Temperature Taipei
## 1 324 Y 21 0
## 2 325 Y 28 0
## 3 326 N 25 0
## 4 327 N 17 1
## 5 328 Y 18 5
## 6 329 Y 24 1
str(cvdta)
## 'data.frame': 27 obs. of 4 variables:
## $ DATE : chr "324" "325" "326" "327" ...
## $ Weekday : chr "Y" "Y" "N" "N" ...
## $ Taipei_Temperature: num 21 28 25 17 18 24 26 23 17 15 ...
## $ Taipei : num 0 0 0 1 5 1 7 10 19 13 ...
#convert char to factor variables
cvdta$DATE <- as.factor(cvdta$DATE)
cvdta$Weekday <- as.factor(cvdta$Weekday)
ggplot(aes(y = Taipei, x = Taipei_Temperature, color = Weekday), data = cvdta) +
geom_point() +
geom_smooth(method = lm, se = F) +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
由此圖可看出weekdayY與WeekdayN的斜率不相同。
#run multiple linear regression
cvmod <- lm(Taipei ~ Taipei_Temperature + Weekday, data = cvdta)
summary(cvmod)
##
## Call:
## lm(formula = Taipei ~ Taipei_Temperature + Weekday, data = cvdta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.86 -74.68 -23.07 54.75 206.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.662153 96.804435 0.689 0.498
## Taipei_Temperature 0.756913 4.178546 0.181 0.858
## WeekdayY 0.002713 40.597481 0.000 1.000
##
## Residual standard error: 94.29 on 24 degrees of freedom
## Multiple R-squared: 0.001425, Adjusted R-squared: -0.08179
## F-statistic: 0.01712 on 2 and 24 DF, p-value: 0.983
由於T值小於2(T= 0.181<2),且近乎零,以及P值大於0.05(P=0.858>0.05),則表示台北每日確診數量和台北最高溫的平均數差異不顯著,則兩者之間的相關不顯著。 故表示在95%的信心水準下,無法推翻虛無假設,也表示台北每日確診數量和台北最高溫之間無相關。且台北每日最高溫無法預測台北每日新增確診數量。
得出公式:Taipei=0.76(Taipei_Temperature)+0.002(WeekdayY)+94.29
Checking the normality of the residuals
hist( x = residuals(cvmod),
xlab = "Value of residual",
main = "",
breaks = 20)
從上圖可看出Residuals不是常態分布,且接近正偏態,表示預測值檢觀察值不集中於0,則此model是有很大誤差的。 Normal Q-Q - to examine whether the residuals are normally distributed. - It’s good if residuals points follow the straight dashed line.
# Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals.
plot(cvmod, which = 2)
Checking the linearity of the relationship Residuals vs Fitted - A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
# Plot of the fitted values against the residuals for cvmod, with a line showing the relationship between the two.
plot(cvmod, which = 1)
從此圖可看出不是水平線,預測值的誤差會隨著預測值而變動,表示其中還有一些變異是無法被此model解釋的。 Advance version
library(car)
residualPlots(cvmod)
## Test stat Pr(>|Test stat|)
## Taipei_Temperature -0.3281 0.7458
## Weekday
## Tukey test -0.3284 0.7426
Checking the homogeneity of variance Scale-Location (or Spread-Location). - Horizontal line with equally spread points is a good indication of homoscedasticity. - If homogeneity of variance is violated - the standard error estimates associated with the regression coefficients are no longer entirely reliable - t tests for the coefficients aren’t quite right either.
# Plot of the fitted values (model predictions) against the square root of the abs standardised residuals.
plot(cvmod, which = 3)