Untitled

library(readr)
library(stats)
library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ stringr   1.5.0
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8

my_data <- read_delim("C:/Users/user/Documents/Statistics/Telangana_2018_complete_weather_data.csv",delim=",")

## Rows: 230384 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): District, Mandal, Location,  Date
## dbl (6): row_id, temp_min, temp_max, humidity_min, humidity_max, wind_speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable

I created a new column “humid_average” by doing average of “humidity_min” and “humidity_max”

my_new_data<- my_data %>%
  mutate(humid_average = (humidity_min + humidity_max) / 2)

converting “humid_average” column into a binary column”humid_day”

my_new_data$humid_day <- ifelse(my_new_data$humid_average > 61.20, 1, 0)

Build a logistic regression model for this variable, using explanatory variables

Interpret the coefficients, and explain what they mean in your notebook
(Bonus) Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning

Explanatory variable = temp_max

model <- glm(humid_day ~ temp_max, data = my_new_data, family = binomial(link = "logit"))  
summary(model)

## 
## Call:
## glm(formula = humid_day ~ temp_max, family = binomial(link = "logit"), 
##     data = my_new_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 10.742469   0.050715   211.8   <2e-16 ***
## temp_max    -0.312727   0.001464  -213.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 318966  on 230383  degrees of freedom
## Residual deviance: 253419  on 230382  degrees of freedom
## AIC: 253423
## 
## Number of Fisher Scoring iterations: 4

Interpretation of the coefficients

Intercept :

Estimate: 10.742469,Std. Error: 0.050715,z value: 211.8,Pr(>|z|): <2e-16

The intercept represents the log-odds of the response variable humid_day being 1 when temp_max is equal to 0. In this context, the extremely positive intercept value suggests that when temp_max is zero, the log-odds of a day being “humid” are significantly positive. This implies that when the maximum temperature is very low, the model predicts a high likelihood of a “humid” day.

Coefficient for temp_max:

Estimate: -0.312727,Std. Error: 0.001464,z value: -213.6,Pr(>|z|): <2e-16

The coefficient for temp_max represents how the log-odds of the response variable humid_day being 1 change for a one-unit increase in temp_max.An estimate of -0.312727 means that for every one-unit increase in temp_max, the log-odds of the day being “humid” decrease by approximately 0.312727 units.The extremely high z value and very low p-value indicate that this coefficient is highly significant. This suggests that the variable temp_max has a strong impact on predicting whether a day is “humid” or not.

Calculating Confidence Interval

ci <- confint(model, parm = "temp_max", level = 0.95)

## Waiting for profiling to be done...

coef_temp_max <- coef(model)["temp_max"]
se_temp_max <- sqrt(vcov(model)["temp_max", "temp_max"])


lower_bound <- coef_temp_max - 1.96 * se_temp_max 
upper_bound <- coef_temp_max + 1.96 * se_temp_max

cat("Coefficient (temp_max):", coef_temp_max, "\n")

## Coefficient (temp_max): -0.3127266

cat("Standard Error (SE):", se_temp_max, "\n")

## Standard Error (SE): 0.001463881

cat("95% Confidence Interval (CI): [", lower_bound, ",", upper_bound, "]\n")

## 95% Confidence Interval (CI): [ -0.3155958 , -0.3098574 ]

In this context, The standard error is 0.001463881 which measures the variability and uncertainty in the coefficient estimate. A smaller standard error suggests a more precise estimate.

Confidence Interval (CI): The 95% confidence interval for the coefficient of temp_max is approximately [-0.3155958, -0.3098574]. This represents the range of values within which we can reasonably expect the true population coefficient to lie, with 95% confidence.

my_new_data %>%
  ggplot(mapping = aes(x =temp_max, y = humid_day)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Consider a transformation for any explanatory variable, and illustrate why you need the transformation

ggplot(my_new_data, aes(x = temp_max, y = humid_day)) +
  geom_point() +
  geom_smooth(method = "glm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Scatterplot of temp_max vs. humid_day",
       x = "temp_max",
       y = "humid_day") +
  theme_minimal()

From the above scatterplot, we can determine that the line is curved representing a non linear relationship between temp_max and humid_day. If the line appears to be curved or if the relationship is not adequately captured by a straight line, a transformation may be needed. I will be using “Log Transformation”

my_new_data <- my_new_data %>%
  mutate(log_temp_max = log(temp_max))

model <- lm(humid_day ~ log_temp_max, data = my_new_data)

rsquared <- summary(model)$r.squared

my_new_data %>%
  ggplot(mapping = aes(x = log_temp_max, y = humid_day)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  labs(
    title = "Relationship between log_temp_max and humid_day",
    subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))
  ) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'