Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/Summary%20of%20Weather.csv"
weather<-read_csv(url(urlfile))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 119040 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Date, Precip, PoorWeather, PRCP, TSHDSBRSGF
## dbl (17): STA, WindGustSpd, MaxTemp, MinTemp, MeanTemp, Snowfall, YR, MO, DA...
## lgl (9): FT, FB, FTI, ITH, SD3, RHX, RHN, RVG, WTE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The max temperature is plotted against the min temperature in a scatter plot and the best fit line is plotted. The relationship shows a positive linear trend. There are a line of points that do not seem to follow the linear trend. There is more variability and scatter towards the negative temperature points for both Max and Min temp and the points are not as well evenly distributed about the line at these negative points.
#subset data to contain min temp and max temp data
weather<-weather[,5:6]
ggplot(weather, aes(x = MinTemp, y = MaxTemp)) +
geom_point()+
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
The first quartile and third quartile are about the same value. The max
and min points are different which most likely has to do with the two
line of points outside of the visible linear trend.
m <- lm(MaxTemp ~ MinTemp, data = weather)
summary(m)
##
## Call:
## lm(formula = MaxTemp ~ MinTemp, data = weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.958 -2.770 -0.517 2.184 38.967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.700567 0.028466 375.9 <2e-16 ***
## MinTemp 0.918774 0.001449 634.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.167 on 119038 degrees of freedom
## Multiple R-squared: 0.7716, Adjusted R-squared: 0.7716
## F-statistic: 4.02e+05 on 1 and 119038 DF, p-value: < 2.2e-16
These line of points are removed to see if the fit can be improved since they make up an insignificant number of points.
cleaned <- filter(weather,MinTemp <= 0 | MaxTemp >= -17, MinTemp >= -17 | MaxTemp <= 0)
ggplot(cleaned, aes(x = MinTemp, y = MaxTemp)) +
geom_point()+
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Removing those outliers helped center the points about 0 which makes the
trend look more centered.
m <- lm(MaxTemp ~ MinTemp, data = cleaned)
summary(m)
##
## Call:
## lm(formula = MaxTemp ~ MinTemp, data = cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2274 -2.7629 -0.5407 2.1929 20.6457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.630273 0.027970 380.1 <2e-16 ***
## MinTemp 0.922897 0.001424 648.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.081 on 118968 degrees of freedom
## Multiple R-squared: 0.7794, Adjusted R-squared: 0.7794
## F-statistic: 4.202e+05 on 1 and 118968 DF, p-value: < 2.2e-16
To assess whether the model is reliable, the residuals are plotted to confirm they are normally distributed. The residuals
The points seem fairly evenly distributed around 0 in residual plot except towards the higher fitted values where the point cluster becomes more narrower and biased towards higher residuals. This can show the flattening that was seen in the actual scatter plot.
ggplot(data = m, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
summary(m$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -21.2274 -2.7629 -0.5407 0.0000 2.1929 20.6457
The normal probability plot is relatively linear except for some of residuals less than -2.5. If you look at the lm plot, you can see that majority of the points are on the negative side of the fitted line, hence this bias. The residuals are relative normal as the residuals increase towards positive values.
ggplot(data = m, aes(sample = .resid)) +
stat_qq()
Conclusion: The relationship between the max temp and min temp is relatively linear. The Q-Q plot shows that the residuals are relatively normally distributed at the higher residual values but there is some bias towards the negative end of the line. The influential points were removed since their were insignificant numbers of them. Overall, min temp can be used to predict max temp, especially in the higher temperature values.