This analysis involves building a logistic regression model using a weather dataset. The primary goal is to predict a binary variable based on several explanatory variables.
Data Loading and Pre-processing
# Reading the dataset
weather_data <- read.csv("C:\\Users\\singh\\Documents\\StatsR\\dataset\\Final\\weather_repo.csv")
# Displaying the first few rows of the dataset
head(weather_data)
## country location_name latitude longitude timezone last_updated_epoch
## 1 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693301400
## 2 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693364400
## 3 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693439100
## 4 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693525500
## 5 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693611000
## 6 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693698300
## last_updated temperature_celsius temperature_fahrenheit
## 1 8/29/2023 14:00 28.8 83.8
## 2 8/30/2023 7:30 21.3 70.3
## 3 8/31/2023 4:15 18.1 64.6
## 4 9/1/2023 4:15 19.2 66.6
## 5 9/2/2023 4:00 18.5 65.3
## 6 9/3/2023 4:15 17.0 62.6
## condition_text wind_mph wind_kph wind_degree wind_direction
## 1 Sunny 7.2 11.5 74 ENE
## 2 Sunny 2.2 3.6 199 SSW
## 3 Clear 2.2 3.6 256 WSW
## 4 Clear 2.2 3.6 282 WNW
## 5 Moderate rain at times 2.2 3.6 262 W
## 6 Clear 2.2 3.6 237 WSW
## pressure_mb pressure_in precip_mm precip_in humidity cloud feels_like_celsius
## 1 1004 29.64 0.0 0.00 19 0 26.7
## 2 1011 29.84 0.0 0.00 54 4 21.3
## 3 1010 29.83 0.0 0.00 40 0 18.1
## 4 1010 29.83 0.0 0.00 49 5 19.2
## 5 1010 29.82 0.5 0.02 40 87 18.6
## 6 1009 29.79 0.0 0.00 27 0 17.0
## feels_like_fahrenheit visibility_km visibility_miles uv_index gust_mph
## 1 80.1 10 6 7 8.3
## 2 70.3 10 6 6 2.5
## 3 64.6 10 6 1 3.4
## 4 66.6 10 6 1 3.1
## 5 65.5 10 6 1 2.7
## 6 62.6 10 6 1 2.9
## gust_kph air_quality_Carbon_Monoxide air_quality_Ozone
## 1 13.3 647.5 130.2
## 2 4.0 2964.0 57.2
## 3 5.4 754.4 46.5
## 4 5.0 1228.3 45.4
## 5 4.3 454.0 52.9
## 6 4.7 701.0 64.4
## air_quality_Nitrogen_dioxide air_quality_Sulphur_dioxide air_quality_PM2.5
## 1 1.2 0.4 7.9
## 2 20.9 0.8 31.7
## 3 6.4 0.4 7.7
## 4 12.7 0.7 20.9
## 5 4.7 0.4 10.8
## 6 6.8 0.6 12.2
## air_quality_PM10 air_quality_us_epa_index air_quality_gb_defra_index sunrise
## 1 11.1 1 1 5:24 AM
## 2 39.3 2 3 5:25 AM
## 3 12.8 1 1 5:25 AM
## 4 52.4 2 2 5:26 AM
## 5 24.3 1 1 5:26 AM
## 6 25.9 1 2 5:27 AM
## sunset moonrise moonset moon_phase moon_illumination
## 1 6:24 PM 5:39 PM 2:48 AM Waxing Gibbous 93
## 2 6:23 PM 6:18 PM 4:05 AM Full Moon 98
## 3 6:23 PM 6:18 PM 4:05 AM Full Moon 98
## 4 6:21 PM 6:52 PM 5:22 AM Waning Gibbous 100
## 5 6:20 PM 7:23 PM 6:36 AM Waning Gibbous 99
## 6 6:19 PM 7:53 PM 7:48 AM Waning Gibbous 94
I will choose a condition_text column that can be
converted into a binary variable. The binary variable can be created by
categorizing weather conditions into ‘Rainy’ and ‘Not Rainy’.
# Creating a binary variable 'is_rainy'
weather_data$is_rainy <- ifelse(grepl("rain", weather_data$condition_text, ignore.case = TRUE), 1, 0)
# Checking the structure of the modified dataset
str(weather_data)
## 'data.frame': 2534 obs. of 42 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ location_name : chr "Kabul" "Kabul" "Kabul" "Kabul" ...
## $ latitude : num 34.5 34.5 34.5 34.5 34.5 ...
## $ longitude : num 69.2 69.2 69.2 69.2 69.2 ...
## $ timezone : chr "Asia/Kabul" "Asia/Kabul" "Asia/Kabul" "Asia/Kabul" ...
## $ last_updated_epoch : int 1693301400 1693364400 1693439100 1693525500 1693611000 1693698300 1693783800 1693870200 1693955700 1694041200 ...
## $ last_updated : chr "8/29/2023 14:00" "8/30/2023 7:30" "8/31/2023 4:15" "9/1/2023 4:15" ...
## $ temperature_celsius : num 28.8 21.3 18.1 19.2 18.5 17 15.6 13.8 15.3 16.6 ...
## $ temperature_fahrenheit : num 83.8 70.3 64.6 66.6 65.3 62.6 60.1 56.8 59.5 61.9 ...
## $ condition_text : chr "Sunny" "Sunny" "Clear" "Clear" ...
## $ wind_mph : num 7.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 ...
## $ wind_kph : num 11.5 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 ...
## $ wind_degree : int 74 199 256 282 262 237 330 332 270 336 ...
## $ wind_direction : chr "ENE" "SSW" "WSW" "WNW" ...
## $ pressure_mb : int 1004 1011 1010 1010 1010 1009 1009 1011 1011 1011 ...
## $ pressure_in : num 29.6 29.8 29.8 29.8 29.8 ...
## $ precip_mm : num 0 0 0 0 0.5 0 0.1 0 0 0 ...
## $ precip_in : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ humidity : int 19 54 40 49 40 27 23 27 25 32 ...
## $ cloud : int 0 4 0 5 87 0 86 0 0 6 ...
## $ feels_like_celsius : num 26.7 21.3 18.1 19.2 18.6 17 15.6 13.8 15.3 16.6 ...
## $ feels_like_fahrenheit : num 80.1 70.3 64.6 66.6 65.5 62.6 60.1 56.8 59.5 61.9 ...
## $ visibility_km : num 10 10 10 10 10 10 10 10 10 10 ...
## $ visibility_miles : int 6 6 6 6 6 6 6 6 6 6 ...
## $ uv_index : int 7 6 1 1 1 1 1 1 1 1 ...
## $ gust_mph : num 8.3 2.5 3.4 3.1 2.7 2.9 2.9 3.8 0.4 3.8 ...
## $ gust_kph : num 13.3 4 5.4 5 4.3 4.7 4.7 6.1 0.7 6.1 ...
## $ air_quality_Carbon_Monoxide : num 648 2964 754 1228 454 ...
## $ air_quality_Ozone : num 130.2 57.2 46.5 45.4 52.9 ...
## $ air_quality_Nitrogen_dioxide: num 1.2 20.9 6.4 12.7 4.7 6.8 5.8 6.8 9.9 4.3 ...
## $ air_quality_Sulphur_dioxide : num 0.4 0.8 0.4 0.7 0.4 0.6 0.5 0.5 0.7 0.5 ...
## $ air_quality_PM2.5 : num 7.9 31.7 7.7 20.9 10.8 12.2 6.4 8.6 7.2 11.7 ...
## $ air_quality_PM10 : num 11.1 39.3 12.8 52.4 24.3 25.9 14.3 18.8 14.9 22.3 ...
## $ air_quality_us_epa_index : int 1 2 1 2 1 1 1 1 1 1 ...
## $ air_quality_gb_defra_index : int 1 3 1 2 1 2 1 1 1 1 ...
## $ sunrise : chr "5:24 AM" "5:25 AM" "5:25 AM" "5:26 AM" ...
## $ sunset : chr "6:24 PM" "6:23 PM" "6:23 PM" "6:21 PM" ...
## $ moonrise : chr "5:39 PM" "6:18 PM" "6:18 PM" "6:52 PM" ...
## $ moonset : chr "2:48 AM" "4:05 AM" "4:05 AM" "5:22 AM" ...
## $ moon_phase : chr "Waxing Gibbous" "Full Moon" "Full Moon" "Waning Gibbous" ...
## $ moon_illumination : int 93 98 98 100 99 94 88 79 70 60 ...
## $ is_rainy : num 0 0 0 0 1 0 1 0 0 0 ...
Using the condition_text binary variable as the
dependent variable, we will construct a logistic regression model with 4
explanatory variables temperature, humidity, wind speed, and
pressure.
# Selecting explanatory variables
explanatory_vars <- c("temperature_celsius", "humidity", "wind_kph", "pressure_mb")
# Splitting the data into training and testing sets
set.seed(42)
trainIndex <- createDataPartition(weather_data$is_rainy, p = .7, list = FALSE, times = 1)
trainData <- weather_data[trainIndex, ]
testData <- weather_data[-trainIndex, ]
# Fitting the logistic regression model
logitModel <- glm(is_rainy ~ temperature_celsius + humidity + wind_kph + pressure_mb,
data = trainData, family = "binomial")
# Summary of the model
summary(logitModel)
##
## Call:
## glm(formula = is_rainy ~ temperature_celsius + humidity + wind_kph +
## pressure_mb, family = "binomial", data = trainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 72.069630 16.331625 4.413 1.02e-05 ***
## temperature_celsius -0.006885 0.014321 -0.481 0.631
## humidity 0.055591 0.006706 8.290 < 2e-16 ***
## wind_kph 0.061875 0.009885 6.260 3.86e-10 ***
## pressure_mb -0.078089 0.015985 -4.885 1.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1257.9 on 1773 degrees of freedom
## Residual deviance: 1094.8 on 1769 degrees of freedom
## AIC: 1104.8
##
## Number of Fisher Scoring iterations: 6
Temperature (Celsius) (-0.0069): Indicates that for each one-degree increase in temperature, the log odds of it being rainy decreases by 0.0069. However, this effect is not statistically significant (p-value = 0.631), meaning temperature might not be a strong predictor of raininess in this model.
Humidity (0.0556): For each unit increase in humidity, the log odds of it being rainy increase by 0.0556. This effect is statistically significant (p-value < 2e-16), suggesting humidity is a strong predictor of rain.
Wind Speed (kph) (0.0619): For each one-kph increase in wind speed, the log odds of it being rainy increases by 0.0619. This is also statistically significant (p-value = 3.86e-10), indicating that wind speed is an important factor in predicting rain.
Pressure (mb) (-0.0781): For each one-millibar decrease in pressure, the log odds of it being rainy increase by 0.0781. This is statistically significant (p-value = 1.03e-06), suggesting that lower pressure is associated with higher chances of rain.
Interpreting the coefficients and computing confidence intervals for the model.
confint(logitModel)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 40.23000727 104.31740332
## temperature_celsius -0.03476764 0.02143936
## humidity 0.04297378 0.06927606
## wind_kph 0.04254998 0.08135457
## pressure_mb -0.10968704 -0.04696293
We will compute a confidence interval for one of the
coefficients. We’ll choose the coefficient of humidity for
this purpose.
The 95% confidence interval for the coefficient of humidity is approximately [0.0429,0.0692] on the log-odds scale. This means we are 95% confident that the true log-odds coefficient for humidity lies within this interval.
This confidence interval does not include zero, which suggests that the effect of humidity on the likelihood of rain is statistically significant.
Exploring the need for transformations of the explanatory variables.
# Creating a new data frame for predictions
humidity_range <- seq(min(trainData$humidity), max(trainData$humidity), length.out = 100)
new_data <- data.frame(temperature_celsius = mean(trainData$temperature_celsius),
humidity = humidity_range,
wind_kph = mean(trainData$wind_kph),
pressure_mb = mean(trainData$pressure_mb))
# Adding predicted probabilities to the new data frame
new_data$predicted_prob <- predict(logitModel, newdata = new_data, type = "response")
# Plotting
ggplot(trainData, aes(x = humidity, y = is_rainy)) +
geom_point(alpha = 0.2) +
geom_line(data = new_data, aes(x = humidity, y = predicted_prob), color = "blue") +
stat_smooth(method = "loess", color = "red", se = FALSE) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Predicted Probability of Rain vs. Humidity",
x = "Humidity",
y = "Predicted Probability of Rain")
## `geom_smooth()` using formula = 'y ~ x'
The logistic regression model suggests significant relationships between the probability of rainy weather and variables like humidity, wind speed, and pressure. Temperature does not show a significant effect. The exploratory analysis does not strongly indicate a need for transformation of the explanatory variables.