1. Select an interesting binary column of data, or one which can be reasonably converted into a binary variable
  1. Build a logistic regression model for this variable, using between 1-4 explanatory variables
  1. Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not)

Weather Data Analysis

This analysis involves building a logistic regression model using a weather dataset. The primary goal is to predict a binary variable based on several explanatory variables.

Data Loading and Pre-processing

# Reading the dataset
weather_data <- read.csv("C:\\Users\\singh\\Documents\\StatsR\\dataset\\Final\\weather_repo.csv")

# Displaying the first few rows of the dataset
head(weather_data)
##       country location_name latitude longitude   timezone last_updated_epoch
## 1 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693301400
## 2 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693364400
## 3 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693439100
## 4 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693525500
## 5 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693611000
## 6 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693698300
##      last_updated temperature_celsius temperature_fahrenheit
## 1 8/29/2023 14:00                28.8                   83.8
## 2  8/30/2023 7:30                21.3                   70.3
## 3  8/31/2023 4:15                18.1                   64.6
## 4   9/1/2023 4:15                19.2                   66.6
## 5   9/2/2023 4:00                18.5                   65.3
## 6   9/3/2023 4:15                17.0                   62.6
##           condition_text wind_mph wind_kph wind_degree wind_direction
## 1                  Sunny      7.2     11.5          74            ENE
## 2                  Sunny      2.2      3.6         199            SSW
## 3                  Clear      2.2      3.6         256            WSW
## 4                  Clear      2.2      3.6         282            WNW
## 5 Moderate rain at times      2.2      3.6         262              W
## 6                  Clear      2.2      3.6         237            WSW
##   pressure_mb pressure_in precip_mm precip_in humidity cloud feels_like_celsius
## 1        1004       29.64       0.0      0.00       19     0               26.7
## 2        1011       29.84       0.0      0.00       54     4               21.3
## 3        1010       29.83       0.0      0.00       40     0               18.1
## 4        1010       29.83       0.0      0.00       49     5               19.2
## 5        1010       29.82       0.5      0.02       40    87               18.6
## 6        1009       29.79       0.0      0.00       27     0               17.0
##   feels_like_fahrenheit visibility_km visibility_miles uv_index gust_mph
## 1                  80.1            10                6        7      8.3
## 2                  70.3            10                6        6      2.5
## 3                  64.6            10                6        1      3.4
## 4                  66.6            10                6        1      3.1
## 5                  65.5            10                6        1      2.7
## 6                  62.6            10                6        1      2.9
##   gust_kph air_quality_Carbon_Monoxide air_quality_Ozone
## 1     13.3                       647.5             130.2
## 2      4.0                      2964.0              57.2
## 3      5.4                       754.4              46.5
## 4      5.0                      1228.3              45.4
## 5      4.3                       454.0              52.9
## 6      4.7                       701.0              64.4
##   air_quality_Nitrogen_dioxide air_quality_Sulphur_dioxide air_quality_PM2.5
## 1                          1.2                         0.4               7.9
## 2                         20.9                         0.8              31.7
## 3                          6.4                         0.4               7.7
## 4                         12.7                         0.7              20.9
## 5                          4.7                         0.4              10.8
## 6                          6.8                         0.6              12.2
##   air_quality_PM10 air_quality_us_epa_index air_quality_gb_defra_index sunrise
## 1             11.1                        1                          1 5:24 AM
## 2             39.3                        2                          3 5:25 AM
## 3             12.8                        1                          1 5:25 AM
## 4             52.4                        2                          2 5:26 AM
## 5             24.3                        1                          1 5:26 AM
## 6             25.9                        1                          2 5:27 AM
##    sunset moonrise moonset     moon_phase moon_illumination
## 1 6:24 PM  5:39 PM 2:48 AM Waxing Gibbous                93
## 2 6:23 PM  6:18 PM 4:05 AM      Full Moon                98
## 3 6:23 PM  6:18 PM 4:05 AM      Full Moon                98
## 4 6:21 PM  6:52 PM 5:22 AM Waning Gibbous               100
## 5 6:20 PM  7:23 PM 6:36 AM Waning Gibbous                99
## 6 6:19 PM  7:53 PM 7:48 AM Waning Gibbous                94

Binary Variable Creation

I will choose a condition_text column that can be converted into a binary variable. The binary variable can be created by categorizing weather conditions into ‘Rainy’ and ‘Not Rainy’.

# Creating a binary variable 'is_rainy'
weather_data$is_rainy <- ifelse(grepl("rain", weather_data$condition_text, ignore.case = TRUE), 1, 0)

# Checking the structure of the modified dataset
str(weather_data)
## 'data.frame':    2534 obs. of  42 variables:
##  $ country                     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ location_name               : chr  "Kabul" "Kabul" "Kabul" "Kabul" ...
##  $ latitude                    : num  34.5 34.5 34.5 34.5 34.5 ...
##  $ longitude                   : num  69.2 69.2 69.2 69.2 69.2 ...
##  $ timezone                    : chr  "Asia/Kabul" "Asia/Kabul" "Asia/Kabul" "Asia/Kabul" ...
##  $ last_updated_epoch          : int  1693301400 1693364400 1693439100 1693525500 1693611000 1693698300 1693783800 1693870200 1693955700 1694041200 ...
##  $ last_updated                : chr  "8/29/2023 14:00" "8/30/2023 7:30" "8/31/2023 4:15" "9/1/2023 4:15" ...
##  $ temperature_celsius         : num  28.8 21.3 18.1 19.2 18.5 17 15.6 13.8 15.3 16.6 ...
##  $ temperature_fahrenheit      : num  83.8 70.3 64.6 66.6 65.3 62.6 60.1 56.8 59.5 61.9 ...
##  $ condition_text              : chr  "Sunny" "Sunny" "Clear" "Clear" ...
##  $ wind_mph                    : num  7.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 ...
##  $ wind_kph                    : num  11.5 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 ...
##  $ wind_degree                 : int  74 199 256 282 262 237 330 332 270 336 ...
##  $ wind_direction              : chr  "ENE" "SSW" "WSW" "WNW" ...
##  $ pressure_mb                 : int  1004 1011 1010 1010 1010 1009 1009 1011 1011 1011 ...
##  $ pressure_in                 : num  29.6 29.8 29.8 29.8 29.8 ...
##  $ precip_mm                   : num  0 0 0 0 0.5 0 0.1 0 0 0 ...
##  $ precip_in                   : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ humidity                    : int  19 54 40 49 40 27 23 27 25 32 ...
##  $ cloud                       : int  0 4 0 5 87 0 86 0 0 6 ...
##  $ feels_like_celsius          : num  26.7 21.3 18.1 19.2 18.6 17 15.6 13.8 15.3 16.6 ...
##  $ feels_like_fahrenheit       : num  80.1 70.3 64.6 66.6 65.5 62.6 60.1 56.8 59.5 61.9 ...
##  $ visibility_km               : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ visibility_miles            : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ uv_index                    : int  7 6 1 1 1 1 1 1 1 1 ...
##  $ gust_mph                    : num  8.3 2.5 3.4 3.1 2.7 2.9 2.9 3.8 0.4 3.8 ...
##  $ gust_kph                    : num  13.3 4 5.4 5 4.3 4.7 4.7 6.1 0.7 6.1 ...
##  $ air_quality_Carbon_Monoxide : num  648 2964 754 1228 454 ...
##  $ air_quality_Ozone           : num  130.2 57.2 46.5 45.4 52.9 ...
##  $ air_quality_Nitrogen_dioxide: num  1.2 20.9 6.4 12.7 4.7 6.8 5.8 6.8 9.9 4.3 ...
##  $ air_quality_Sulphur_dioxide : num  0.4 0.8 0.4 0.7 0.4 0.6 0.5 0.5 0.7 0.5 ...
##  $ air_quality_PM2.5           : num  7.9 31.7 7.7 20.9 10.8 12.2 6.4 8.6 7.2 11.7 ...
##  $ air_quality_PM10            : num  11.1 39.3 12.8 52.4 24.3 25.9 14.3 18.8 14.9 22.3 ...
##  $ air_quality_us_epa_index    : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ air_quality_gb_defra_index  : int  1 3 1 2 1 2 1 1 1 1 ...
##  $ sunrise                     : chr  "5:24 AM" "5:25 AM" "5:25 AM" "5:26 AM" ...
##  $ sunset                      : chr  "6:24 PM" "6:23 PM" "6:23 PM" "6:21 PM" ...
##  $ moonrise                    : chr  "5:39 PM" "6:18 PM" "6:18 PM" "6:52 PM" ...
##  $ moonset                     : chr  "2:48 AM" "4:05 AM" "4:05 AM" "5:22 AM" ...
##  $ moon_phase                  : chr  "Waxing Gibbous" "Full Moon" "Full Moon" "Waning Gibbous" ...
##  $ moon_illumination           : int  93 98 98 100 99 94 88 79 70 60 ...
##  $ is_rainy                    : num  0 0 0 0 1 0 1 0 0 0 ...

Logistic Regression Model

Using the condition_text binary variable as the dependent variable, we will construct a logistic regression model with 4 explanatory variables temperature, humidity, wind speed, and pressure.

# Selecting explanatory variables
explanatory_vars <- c("temperature_celsius", "humidity", "wind_kph", "pressure_mb")

# Splitting the data into training and testing sets
set.seed(42)
trainIndex <- createDataPartition(weather_data$is_rainy, p = .7, list = FALSE, times = 1)
trainData <- weather_data[trainIndex, ]
testData <- weather_data[-trainIndex, ]

# Fitting the logistic regression model
logitModel <- glm(is_rainy ~ temperature_celsius + humidity + wind_kph + pressure_mb, 
                  data = trainData, family = "binomial")

# Summary of the model
summary(logitModel)
## 
## Call:
## glm(formula = is_rainy ~ temperature_celsius + humidity + wind_kph + 
##     pressure_mb, family = "binomial", data = trainData)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         72.069630  16.331625   4.413 1.02e-05 ***
## temperature_celsius -0.006885   0.014321  -0.481    0.631    
## humidity             0.055591   0.006706   8.290  < 2e-16 ***
## wind_kph             0.061875   0.009885   6.260 3.86e-10 ***
## pressure_mb         -0.078089   0.015985  -4.885 1.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1257.9  on 1773  degrees of freedom
## Residual deviance: 1094.8  on 1769  degrees of freedom
## AIC: 1104.8
## 
## Number of Fisher Scoring iterations: 6

Observations

  1. Temperature (Celsius) (-0.0069): Indicates that for each one-degree increase in temperature, the log odds of it being rainy decreases by 0.0069. However, this effect is not statistically significant (p-value = 0.631), meaning temperature might not be a strong predictor of raininess in this model.

  2. Humidity (0.0556): For each unit increase in humidity, the log odds of it being rainy increase by 0.0556. This effect is statistically significant (p-value < 2e-16), suggesting humidity is a strong predictor of rain.

  3. Wind Speed (kph) (0.0619): For each one-kph increase in wind speed, the log odds of it being rainy increases by 0.0619. This is also statistically significant (p-value = 3.86e-10), indicating that wind speed is an important factor in predicting rain.

  4. Pressure (mb) (-0.0781): For each one-millibar decrease in pressure, the log odds of it being rainy increase by 0.0781. This is statistically significant (p-value = 1.03e-06), suggesting that lower pressure is associated with higher chances of rain.

Confidence Intervals

Interpreting the coefficients and computing confidence intervals for the model.

confint(logitModel)
## Waiting for profiling to be done...
##                           2.5 %       97.5 %
## (Intercept)         40.23000727 104.31740332
## temperature_celsius -0.03476764   0.02143936
## humidity             0.04297378   0.06927606
## wind_kph             0.04254998   0.08135457
## pressure_mb         -0.10968704  -0.04696293

Interpretation

  • We will compute a confidence interval for one of the coefficients. We’ll choose the coefficient of humidity for this purpose.

  • The 95% confidence interval for the coefficient of humidity is approximately [0.0429,0.0692] on the log-odds scale. This means we are 95% confident that the true log-odds coefficient for humidity lies within this interval.

  • This confidence interval does not include zero, which suggests that the effect of humidity on the likelihood of rain is statistically significant.

Exploratory Data Analysis for Transformations

Exploring the need for transformations of the explanatory variables.

# Creating a new data frame for predictions
humidity_range <- seq(min(trainData$humidity), max(trainData$humidity), length.out = 100)
new_data <- data.frame(temperature_celsius = mean(trainData$temperature_celsius),
                       humidity = humidity_range,
                       wind_kph = mean(trainData$wind_kph),
                       pressure_mb = mean(trainData$pressure_mb))

# Adding predicted probabilities to the new data frame
new_data$predicted_prob <- predict(logitModel, newdata = new_data, type = "response")

# Plotting
ggplot(trainData, aes(x = humidity, y = is_rainy)) +
  geom_point(alpha = 0.2) +
  geom_line(data = new_data, aes(x = humidity, y = predicted_prob), color = "blue") +
  stat_smooth(method = "loess", color = "red", se = FALSE) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Predicted Probability of Rain vs. Humidity",
       x = "Humidity",
       y = "Predicted Probability of Rain")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation

  • Humidity relationship with ‘is_rainy’ appears to be linear, indicating no need for a transformation.

Conclusion

The logistic regression model suggests significant relationships between the probability of rainy weather and variables like humidity, wind speed, and pressure. Temperature does not show a significant effect. The exploratory analysis does not strongly indicate a need for transformation of the explanatory variables.