# Read the dataset
weather_data <- read.csv("C:\\Users\\singh\\Documents\\StatsR\\dataset\\Final\\weather_repo.csv")
# Selecting predictor variables and handling missing values
predictors <- c("humidity", "wind_kph", "air_quality_PM2.5", "air_quality_PM10")
weather_data_cleaned <- na.omit(weather_data[, c("temperature_celsius", predictors)])
head(weather_data_cleaned)
## temperature_celsius humidity wind_kph air_quality_PM2.5 air_quality_PM10
## 1 28.8 19 11.5 7.9 11.1
## 2 21.3 54 3.6 31.7 39.3
## 3 18.1 40 3.6 7.7 12.8
## 4 19.2 49 3.6 20.9 52.4
## 5 18.5 40 3.6 10.8 24.3
## 6 17.0 27 3.6 12.2 25.9
We build a linear model with temperature in Celsius as the response variable and humidity, wind speed, and air quality indices as explanatory variables.
# Selecting variables
response_var <- weather_data_cleaned$temperature_celsius
explanatory_vars <- weather_data_cleaned[, c('humidity', 'wind_kph', 'air_quality_PM2.5', 'air_quality_PM10')]
# Building the linear model
model <- lm(temperature_celsius ~ humidity + wind_kph + air_quality_PM2.5 + air_quality_PM10, data = weather_data_cleaned)
summary(model)
##
## Call:
## lm(formula = temperature_celsius ~ humidity + wind_kph + air_quality_PM2.5 +
## air_quality_PM10, data = weather_data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.298 -4.614 0.737 5.165 17.764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.053903 0.522208 47.977 < 2e-16 ***
## humidity -0.059601 0.006151 -9.690 < 2e-16 ***
## wind_kph 0.136176 0.016840 8.086 9.43e-16 ***
## air_quality_PM2.5 -0.044661 0.006878 -6.493 1.01e-10 ***
## air_quality_PM10 0.039191 0.005165 7.587 4.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.16 on 2529 degrees of freedom
## Multiple R-squared: 0.09699, Adjusted R-squared: 0.09556
## F-statistic: 67.91 on 4 and 2529 DF, p-value: < 2.2e-16
# Calculating MSE
predicted <- predict(model, weather_data_cleaned)
mse <- mean((weather_data_cleaned$temperature_celsius - predicted)^2)
print(paste("MSE:", mse))
## [1] "MSE: 37.8703498574756"
MSE of 37.87 suggests that on average, the model’s predictions deviate from the actual observed temperatures by the square root of 37.87, which is approximately 6.15 degrees Celsius.
We check for multicollinearity using Variance Inflation Factor (VIF)
# Checking for multicollinearity (VIF)
vif(model)
## humidity wind_kph air_quality_PM2.5 air_quality_PM10
## 1.057367 1.021015 9.010306 9.038898
Multicollinearity: The Variance Inflation Factor (VIF) values for air quality PM2.5 and PM10 are relatively high (above 8), suggesting potential multicollinearity issues. This means these variables might not be providing independent information.
Interpreting the coefficient of humidity.
coef(summary(model))["humidity", ]
## Estimate Std. Error t value Pr(>|t|)
## -5.960139e-02 6.150547e-03 -9.690422e+00 7.895244e-22
Humidity: The negative coefficient for humidity (-0.0596) suggests that as humidity increases, the temperature tends to decrease, holding other factors constant. This could be due to the cooling effect of moisture in the air.