library(readr)
library(stats)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ stringr   1.5.0
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
my_data <- read_delim("C:/Users/user/Documents/Statistics/Telangana_2018_complete_weather_data.csv",delim=",")
## Rows: 230384 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): District, Mandal, Location,  Date
## dbl (6): row_id, temp_min, temp_max, humidity_min, humidity_max, wind_speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable

I created a new column “humid_average” by doing average of “humidity_min” and “humidity_max”

my_new_data<- my_data %>%
  mutate(humid_average = (humidity_min + humidity_max) / 2)

converting “humid_average” column into a binary column”humid_day”

my_new_data$humid_day <- ifelse(my_new_data$humid_average > 61.20, 1, 0)

Build a logistic regression model for this variable, using explanatory variables

Explanatory variable = temp_max

model <- glm(humid_day ~ temp_max, data = my_new_data, family = binomial(link = "logit"))  
summary(model)
## 
## Call:
## glm(formula = humid_day ~ temp_max, family = binomial(link = "logit"), 
##     data = my_new_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 10.742469   0.050715   211.8   <2e-16 ***
## temp_max    -0.312727   0.001464  -213.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 318966  on 230383  degrees of freedom
## Residual deviance: 253419  on 230382  degrees of freedom
## AIC: 253423
## 
## Number of Fisher Scoring iterations: 4

Interpretation of the coefficients

Intercept :

Estimate: 10.742469,Std. Error: 0.050715,z value: 211.8,Pr(>|z|): <2e-16

The intercept represents the log-odds of the response variable humid_day being 1 when temp_max is equal to 0. In this context, the extremely positive intercept value suggests that when temp_max is zero, the log-odds of a day being “humid” are significantly positive. This implies that when the maximum temperature is very low, the model predicts a high likelihood of a “humid” day.

Coefficient for temp_max:

Estimate: -0.312727,Std. Error: 0.001464,z value: -213.6,Pr(>|z|): <2e-16

The coefficient for temp_max represents how the log-odds of the response variable humid_day being 1 change for a one-unit increase in temp_max.An estimate of -0.312727 means that for every one-unit increase in temp_max, the log-odds of the day being “humid” decrease by approximately 0.312727 units.The extremely high z value and very low p-value indicate that this coefficient is highly significant. This suggests that the variable temp_max has a strong impact on predicting whether a day is “humid” or not.

Calculating Confidence Interval

ci <- confint(model, parm = "temp_max", level = 0.95)
## Waiting for profiling to be done...
coef_temp_max <- coef(model)["temp_max"]
se_temp_max <- sqrt(vcov(model)["temp_max", "temp_max"])


lower_bound <- coef_temp_max - 1.96 * se_temp_max 
upper_bound <- coef_temp_max + 1.96 * se_temp_max

cat("Coefficient (temp_max):", coef_temp_max, "\n")
## Coefficient (temp_max): -0.3127266
cat("Standard Error (SE):", se_temp_max, "\n")
## Standard Error (SE): 0.001463881
cat("95% Confidence Interval (CI): [", lower_bound, ",", upper_bound, "]\n")
## 95% Confidence Interval (CI): [ -0.3155958 , -0.3098574 ]

In this context, The standard error is 0.001463881 which measures the variability and uncertainty in the coefficient estimate. A smaller standard error suggests a more precise estimate.

Confidence Interval (CI): The 95% confidence interval for the coefficient of temp_max is approximately [-0.3155958, -0.3098574]. This represents the range of values within which we can reasonably expect the true population coefficient to lie, with 95% confidence.

my_new_data %>%
  ggplot(mapping = aes(x =temp_max, y = humid_day)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Consider a transformation for any explanatory variable, and illustrate why you need the transformation

ggplot(my_new_data, aes(x = temp_max, y = humid_day)) +
  geom_point() +
  geom_smooth(method = "glm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Scatterplot of temp_max vs. humid_day",
       x = "temp_max",
       y = "humid_day") +
  theme_minimal()

From the above scatterplot, we can determine that the line is curved representing a non linear relationship between temp_max and humid_day. If the line appears to be curved or if the relationship is not adequately captured by a straight line, a transformation may be needed. I will be using “Log Transformation”

my_new_data <- my_new_data %>%
  mutate(log_temp_max = log(temp_max))

model <- lm(humid_day ~ log_temp_max, data = my_new_data)

rsquared <- summary(model)$r.squared

my_new_data %>%
  ggplot(mapping = aes(x = log_temp_max, y = humid_day)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  labs(
    title = "Relationship between log_temp_max and humid_day",
    subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'