Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/Summary%20of%20Weather.csv"
weather<-read_csv(url(urlfile))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 119040 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Date, Precip, PoorWeather, PRCP, TSHDSBRSGF
## dbl (17): STA, WindGustSpd, MaxTemp, MinTemp, MeanTemp, Snowfall, YR, MO, DA...
## lgl  (9): FT, FB, FTI, ITH, SD3, RHX, RHN, RVG, WTE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The max temperature is plotted against the min temperature in a scatter plot and the best fit line is plotted. The relationship shows a positive linear trend. There are a line of points that do not seem to follow the linear trend. There is more variability and scatter towards the negative temperature points for both Max and Min temp and the points are not as well evenly distributed about the line at these negative points.

#subset data to contain min temp and max temp data
weather<-weather[,5:6]

ggplot(weather, aes(x = MinTemp, y = MaxTemp)) +
  geom_point()+
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

The first quartile and third quartile are about the same value. The max and min points are different which most likely has to do with the two line of points outside of the visible linear trend.

m <- lm(MaxTemp ~ MinTemp, data = weather)
summary(m)
## 
## Call:
## lm(formula = MaxTemp ~ MinTemp, data = weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.958  -2.770  -0.517   2.184  38.967 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.700567   0.028466   375.9   <2e-16 ***
## MinTemp      0.918774   0.001449   634.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.167 on 119038 degrees of freedom
## Multiple R-squared:  0.7716, Adjusted R-squared:  0.7716 
## F-statistic: 4.02e+05 on 1 and 119038 DF,  p-value: < 2.2e-16

These line of points are removed to see if the fit can be improved since they make up an insignificant number of points.

cleaned <- filter(weather,MinTemp <= 0 | MaxTemp >= -17, MinTemp >= -17 | MaxTemp <= 0)
ggplot(cleaned, aes(x = MinTemp, y = MaxTemp)) +
  geom_point()+
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Removing those outliers helped center the points about 0 which makes the trend look more centered.

m <- lm(MaxTemp ~ MinTemp, data = cleaned)
summary(m)
## 
## Call:
## lm(formula = MaxTemp ~ MinTemp, data = cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2274  -2.7629  -0.5407   2.1929  20.6457 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.630273   0.027970   380.1   <2e-16 ***
## MinTemp      0.922897   0.001424   648.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.081 on 118968 degrees of freedom
## Multiple R-squared:  0.7794, Adjusted R-squared:  0.7794 
## F-statistic: 4.202e+05 on 1 and 118968 DF,  p-value: < 2.2e-16

To assess whether the model is reliable, the residuals are plotted to confirm they are normally distributed. The residuals

The points seem fairly evenly distributed around 0 in residual plot except towards the higher fitted values where the point cluster becomes more narrower and biased towards higher residuals. This can show the flattening that was seen in the actual scatter plot.

ggplot(data = m, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

summary(m$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -21.2274  -2.7629  -0.5407   0.0000   2.1929  20.6457

The normal probability plot is relatively linear except for some of residuals less than -2.5. If you look at the lm plot, you can see that majority of the points are on the negative side of the fitted line, hence this bias. The residuals are relative normal as the residuals increase towards positive values.

ggplot(data = m, aes(sample = .resid)) +
  stat_qq()

Conclusion: The relationship between the max temp and min temp is relatively linear. The Q-Q plot shows that the residuals are relatively normally distributed at the higher residual values but there is some bias towards the negative end of the line. The influential points were removed since their were insignificant numbers of them. Overall, min temp can be used to predict max temp, especially in the higher temperature values.