library(tidyverse)
library(car)

DATE <- c("324", "325", "326", "327", "328", "329", "330", "331", "401",
          "402", "403", "404", "405", "406", "407", "408", "409", "410", "411"
          , "412", "413", "414", "415", "416", "417", "418", "419")
Taipei_Temperature <- c(21, 28, 25, 17, 18, 24, 26, 23, 17, 15, 17, 21, 26, 26, 
                        23, 24, 28, 30, 29, 31, 30, 24, 20, 22, 20, 22, 18)
Taipei <- c(0, 0, 0, 1, 5, 1, 7, 10, 19, 13, 22, 29, 41, 84, 61, 74, 67, 96, 
            101, 142, 147, 145, 185, 221, 245, 270, 287)
Weekday <- c("Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y", "Y"
             , "Y", "Y", "N", "N", "Y", "Y", "Y", "Y", "Y", "N", "N", "Y", "Y")

收集資料的日期從3/24-4/49，欲觀察台北每日最高氣溫與台北每日新增確診人數是否有相關。以台北每日最高氣溫當作獨變項，而台北每日新增確診人數則為依變項。且再多觀察平日與假日的差異是否對台北新增確診人數有所影響，則平日(weekdayY)與假日 (WeekdayN)為獨變項，而台北每日新增確診人數則為依變項。由於預測變項有兩個，則使用多元回歸法去觀察其相關性。

cvdta <- data.frame(DATE, Weekday, Taipei_Temperature, Taipei)

head(cvdta)

##   DATE Weekday Taipei_Temperature Taipei
## 1  324       Y                 21      0
## 2  325       Y                 28      0
## 3  326       N                 25      0
## 4  327       N                 17      1
## 5  328       Y                 18      5
## 6  329       Y                 24      1

str(cvdta)

## 'data.frame':    27 obs. of  4 variables:
##  $ DATE              : chr  "324" "325" "326" "327" ...
##  $ Weekday           : chr  "Y" "Y" "N" "N" ...
##  $ Taipei_Temperature: num  21 28 25 17 18 24 26 23 17 15 ...
##  $ Taipei            : num  0 0 0 1 5 1 7 10 19 13 ...

#convert char to factor variables
cvdta$DATE <- as.factor(cvdta$DATE)
cvdta$Weekday <- as.factor(cvdta$Weekday)

ggplot(aes(y = Taipei, x = Taipei_Temperature, color = Weekday), data = cvdta) +
  geom_point() +
  geom_smooth(method = lm, se = F) + 
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

且由此圖可看出weekdayY與WeekdayN的斜率不相同。在平日的時候，台北最高溫與台北新增確診人數呈負相關；而在假日的時候，台北最高溫與台北新增確診人數呈正相關。

跑線性回歸模型，看台北每日最高溫與平日假日的預測力

#run multiple linear regression
cvmod <- lm(Taipei ~ Taipei_Temperature + Weekday, data = cvdta)
summary(cvmod)

## 
## Call:
## lm(formula = Taipei ~ Taipei_Temperature + Weekday, data = cvdta)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.86 -74.68 -23.07  54.75 206.71 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)        66.662153  96.804435   0.689    0.498
## Taipei_Temperature  0.756913   4.178546   0.181    0.858
## WeekdayY            0.002713  40.597481   0.000    1.000
## 
## Residual standard error: 94.29 on 24 degrees of freedom
## Multiple R-squared:  0.001425,   Adjusted R-squared:  -0.08179 
## F-statistic: 0.01712 on 2 and 24 DF,  p-value: 0.983

H0=台北每日確診數量和台北最高溫無相關

H1=台北每日確診數量和台北最高溫有相關

台北最高溫與台北確診數的相關係數為0.76，顯示二者為正相關，然而，t值0.181，p值大於0.05，因此無法推翻虛無假設。換言之，台北最高溫與台北確診數之相關並未達顯著水準，未能顯著影響台北確診數。

得出公式：台北確診數 =66.662 ＋ 0.76台北最高溫＋ 0.002週間＋ 94.29

R-squared=0.001425，表示此模型的解釋力僅只0.1%，意即此模型只能解釋依變項中0.1%的變異。則此模型的解釋力偏低，對依變項的預測力不佳。

H0=平日的台北每日確診平均數與假日的台北每日確診平均數無差異

H1=平日的台北每日確診平均數與假日的台北每日確診平均數有差異

根據迴歸模型，週末（WeekdayN)是參照組。相較於週末，週間的台北確診平均數高0.002，然而，t值0.000，p值大於0.05，因此二者之間的平均數差異並為達顯著水準，也未能對台北確診數有顯著的影響。

Diagnostic

Checking the normality of the residuals

hist( x = residuals(cvmod),
      xlab = "Value of residual",
      main = "",
      breaks = 20)

從上圖可看出Residuals不是常態分布，且接近正偏態，表示預測值檢觀察值不集中於0，則此model是有很大誤差的。

Normal Q-Q - to examine whether the residuals are normally distributed. - It’s good if residuals points follow the straight dashed line.

# Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals.
plot(cvmod, which = 2)

不管是第一個殘差圖或是此QQplot圖都顯示殘差並未服從常態分佈，顯示此模型與資料本身並不適配。 relationship Residuals vs Fitted - A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.

# Plot of the fitted values against the residuals for cvmod, with a line showing the relationship between the two.
plot(cvmod, which  = 1)

從此圖可看出不是水平線，預測值的誤差會隨著預測值而變動，表示其中還有一些變異是無法被此公式解釋的。 Advance version

library(car)
residualPlots(cvmod)

##                    Test stat Pr(>|Test stat|)
## Taipei_Temperature   -0.3281           0.7458
## Weekday                                      
## Tukey test           -0.3284           0.7426

Checking the homogeneity of variance Scale-Location (or Spread-Location). - Horizontal line with equally spread points is a good indication of homoscedasticity. - If homogeneity of variance is violated - the standard error estimates associated with the regression coefficients are no longer entirely reliable - t tests for the coefficients aren’t quite right either.

# Plot of the fitted values (model predictions) against the square root of the abs standardised residuals.
plot(cvmod, which = 3)

上兩個圖中的紅線應該是貼近0，且筆直的水平線，顯示殘差並未隨著預測值的變動而有所改變，二者並無相關。但圖中的紅線顯示為曲線，代表著台本確診案例尚有未被解釋的變異。可考慮納入其他更有預測力的變項。

HW8_108096027

CHING

2022-04-26