library(tidyverse)
library(car)
DATE <- c("324", "325", "326", "327", "328", "329", "330", "331", "401",
"402", "403", "404", "405", "406", "407", "408", "409", "410", "411"
, "412", "413", "414", "415", "416", "417", "418", "419")
Taipei_Temperature <- c(21, 28, 25, 17, 18, 24, 26, 23, 17, 15, 17, 21, 26, 26,
23, 24, 28, 30, 29, 31, 30, 24, 20, 22, 20, 22, 18)
Taipei <- c(0, 0, 0, 1, 5, 1, 7, 10, 19, 13, 22, 29, 41, 84, 61, 74, 67, 96,
101, 142, 147, 145, 185, 221, 245, 270, 287)
Weekday <- c("Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y", "Y"
, "Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y")
收集資料的日期從3/24-4/49,欲觀察台北每日最高氣溫與台北每日新增確診人數是否有相關。 以台北每日最高氣溫當作獨變項,而台北每日新增確診人數則為依變項。 且再多觀察平日與假日的差異是否對台北新增確診人數有所影響,則平日(weekdayY)與假日 (WeekdayN)為獨變項,而台北每日新增確診人數則為依變項。 由於預測變項有兩個,則使用多元回歸法去觀察其相關性。
cvdta <- data.frame(DATE, Weekday, Taipei_Temperature, Taipei)
head(cvdta)
## DATE Weekday Taipei_Temperature Taipei
## 1 324 Y 21 0
## 2 325 Y 28 0
## 3 326 N 25 0
## 4 327 N 17 1
## 5 328 Y 18 5
## 6 329 Y 24 1
str(cvdta)
## 'data.frame': 27 obs. of 4 variables:
## $ DATE : chr "324" "325" "326" "327" ...
## $ Weekday : chr "Y" "Y" "N" "N" ...
## $ Taipei_Temperature: num 21 28 25 17 18 24 26 23 17 15 ...
## $ Taipei : num 0 0 0 1 5 1 7 10 19 13 ...
#convert char to factor variables
cvdta$DATE <- as.factor(cvdta$DATE)
cvdta$Weekday <- as.factor(cvdta$Weekday)
ggplot(aes(y = Taipei, x = Taipei_Temperature, color = Weekday), data = cvdta) +
geom_point() +
geom_smooth(method = lm, se = F) +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
且由此圖可看出weekdayY與WeekdayN的斜率不相同。在平日的時候,台北最高溫與台北新增確診人數呈負相關;而在假日的時候,台北最高溫與台北新增確診人數呈正相關。
#run multiple linear regression
cvmod <- lm(Taipei ~ Taipei_Temperature + Weekday, data = cvdta)
summary(cvmod)
##
## Call:
## lm(formula = Taipei ~ Taipei_Temperature + Weekday, data = cvdta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.86 -74.68 -23.07 54.75 206.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.662153 96.804435 0.689 0.498
## Taipei_Temperature 0.756913 4.178546 0.181 0.858
## WeekdayY 0.002713 40.597481 0.000 1.000
##
## Residual standard error: 94.29 on 24 degrees of freedom
## Multiple R-squared: 0.001425, Adjusted R-squared: -0.08179
## F-statistic: 0.01712 on 2 and 24 DF, p-value: 0.983
台北最高溫與台北確診數的相關係數為0.76,顯示二者為正相關,然而,t值0.181,p值大於0.05,因此無法推翻虛無假設。換言之,台北最高溫與台北確診數之相關並未達顯著水準,未能顯著影響台北確診數。
得出公式: 台北確診數 =66.662 + 0.76台北最高溫 + 0.002週間 + 94.29
R-squared=0.001425,表示此模型的解釋力僅只0.1%,意即此模型只能解釋依變項中0.1%的變異。則此模型的解釋力偏低,對依變項的預測力不佳。
根據迴歸模型,週末(WeekdayN)是參照組。相較於週末,週間的台北確診平均數高0.002,然而,t值0.000,p值大於0.05,因此二者之間的平均數差異並為達顯著水準,也未能對台北確診數有顯著的影響。
Checking the normality of the residuals
hist( x = residuals(cvmod),
xlab = "Value of residual",
main = "",
breaks = 20)
從上圖可看出Residuals不是常態分布,且接近正偏態,表示預測值檢觀察值不集中於0,則此model是有很大誤差的。
Normal Q-Q - to examine whether the residuals are normally distributed. - It’s good if residuals points follow the straight dashed line.
# Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals.
plot(cvmod, which = 2)
不管是第一個殘差圖或是此QQplot圖都顯示殘差並未服從常態分佈,顯示此模型與資料本身並不適配。 relationship Residuals vs Fitted - A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
# Plot of the fitted values against the residuals for cvmod, with a line showing the relationship between the two.
plot(cvmod, which = 1)
從此圖可看出不是水平線,預測值的誤差會隨著預測值而變動,表示其中還有一些變異是無法被此公式解釋的。 Advance version
library(car)
residualPlots(cvmod)
## Test stat Pr(>|Test stat|)
## Taipei_Temperature -0.3281 0.7458
## Weekday
## Tukey test -0.3284 0.7426
Checking the homogeneity of variance Scale-Location (or Spread-Location). - Horizontal line with equally spread points is a good indication of homoscedasticity. - If homogeneity of variance is violated - the standard error estimates associated with the regression coefficients are no longer entirely reliable - t tests for the coefficients aren’t quite right either.
# Plot of the fitted values (model predictions) against the square root of the abs standardised residuals.
plot(cvmod, which = 3)
上兩個圖中的紅線應該是貼近0,且筆直的水平線,顯示殘差並未隨著預測值的變動而有所改變,二者並無相關。 但圖中的紅線顯示為曲線,代表著台本確診案例尚有未被解釋的變異。可考慮納入其他更有預測力的變項。