1. Build a linear (or generalized linear) model as you like
  1. Use the tools from previous weeks to diagnose the model
  1. Interpret at least one of the coefficients
# Read the dataset
weather_data <- read.csv("C:\\Users\\singh\\Documents\\StatsR\\dataset\\Final\\weather_repo.csv")
# Selecting predictor variables and handling missing values
predictors <- c("humidity", "wind_kph", "air_quality_PM2.5", "air_quality_PM10")
weather_data_cleaned <- na.omit(weather_data[, c("temperature_celsius", predictors)])
head(weather_data_cleaned)
##   temperature_celsius humidity wind_kph air_quality_PM2.5 air_quality_PM10
## 1                28.8       19     11.5               7.9             11.1
## 2                21.3       54      3.6              31.7             39.3
## 3                18.1       40      3.6               7.7             12.8
## 4                19.2       49      3.6              20.9             52.4
## 5                18.5       40      3.6              10.8             24.3
## 6                17.0       27      3.6              12.2             25.9

Linear Model Building

We build a linear model with temperature in Celsius as the response variable and humidity, wind speed, and air quality indices as explanatory variables.

# Selecting variables
response_var <- weather_data_cleaned$temperature_celsius
explanatory_vars <- weather_data_cleaned[, c('humidity', 'wind_kph', 'air_quality_PM2.5', 'air_quality_PM10')]

# Building the linear model
model <- lm(temperature_celsius ~ humidity + wind_kph + air_quality_PM2.5 + air_quality_PM10, data = weather_data_cleaned)

summary(model)
## 
## Call:
## lm(formula = temperature_celsius ~ humidity + wind_kph + air_quality_PM2.5 + 
##     air_quality_PM10, data = weather_data_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.298  -4.614   0.737   5.165  17.764 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       25.053903   0.522208  47.977  < 2e-16 ***
## humidity          -0.059601   0.006151  -9.690  < 2e-16 ***
## wind_kph           0.136176   0.016840   8.086 9.43e-16 ***
## air_quality_PM2.5 -0.044661   0.006878  -6.493 1.01e-10 ***
## air_quality_PM10   0.039191   0.005165   7.587 4.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.16 on 2529 degrees of freedom
## Multiple R-squared:  0.09699,    Adjusted R-squared:  0.09556 
## F-statistic: 67.91 on 4 and 2529 DF,  p-value: < 2.2e-16

Observations

Diagnostics of the Model

# Calculating MSE
predicted <- predict(model, weather_data_cleaned)
mse <- mean((weather_data_cleaned$temperature_celsius - predicted)^2)
print(paste("MSE:", mse))
## [1] "MSE: 37.8703498574756"

Interpretation

MSE of 37.87 suggests that on average, the model’s predictions deviate from the actual observed temperatures by the square root of 37.87, which is approximately 6.15 degrees Celsius.

Multicollinearity Check

We check for multicollinearity using Variance Inflation Factor (VIF)

# Checking for multicollinearity (VIF)
vif(model)
##          humidity          wind_kph air_quality_PM2.5  air_quality_PM10 
##          1.057367          1.021015          9.010306          9.038898

Multicollinearity: The Variance Inflation Factor (VIF) values for air quality PM2.5 and PM10 are relatively high (above 8), suggesting potential multicollinearity issues. This means these variables might not be providing independent information.

Coefficient Interpretation

Interpreting the coefficient of humidity.

coef(summary(model))["humidity", ]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -5.960139e-02  6.150547e-03 -9.690422e+00  7.895244e-22

Humidity: The negative coefficient for humidity (-0.0596) suggests that as humidity increases, the temperature tends to decrease, holding other factors constant. This could be due to the cooling effect of moisture in the air.