library(readr)
library(stats)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ stringr 1.5.0
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
my_data <- read_delim("C:/Users/user/Documents/Statistics/Telangana_2018_complete_weather_data.csv",delim=",")
## Rows: 230384 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): District, Mandal, Location, Date
## dbl (6): row_id, temp_min, temp_max, humidity_min, humidity_max, wind_speed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I created a new column “humid_average” by doing average of “humidity_min” and “humidity_max”
my_new_data<- my_data %>%
mutate(humid_average = (humidity_min + humidity_max) / 2)
converting “humid_average” column into a binary column”humid_day”
my_new_data$humid_day <- ifelse(my_new_data$humid_average > 61.20, 1, 0)
Interpret the coefficients, and explain what they mean in your notebook
(Bonus) Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and interpret its meaning
Explanatory variable = temp_max
model <- glm(humid_day ~ temp_max, data = my_new_data, family = binomial(link = "logit"))
summary(model)
##
## Call:
## glm(formula = humid_day ~ temp_max, family = binomial(link = "logit"),
## data = my_new_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.742469 0.050715 211.8 <2e-16 ***
## temp_max -0.312727 0.001464 -213.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 318966 on 230383 degrees of freedom
## Residual deviance: 253419 on 230382 degrees of freedom
## AIC: 253423
##
## Number of Fisher Scoring iterations: 4
Intercept :
Estimate: 10.742469,Std. Error: 0.050715,z value: 211.8,Pr(>|z|): <2e-16
The intercept represents the log-odds of the response variable humid_day being 1 when temp_max is equal to 0. In this context, the extremely positive intercept value suggests that when temp_max is zero, the log-odds of a day being “humid” are significantly positive. This implies that when the maximum temperature is very low, the model predicts a high likelihood of a “humid” day.
Coefficient for temp_max:
Estimate: -0.312727,Std. Error: 0.001464,z value: -213.6,Pr(>|z|): <2e-16
The coefficient for temp_max represents how the log-odds of the response variable humid_day being 1 change for a one-unit increase in temp_max.An estimate of -0.312727 means that for every one-unit increase in temp_max, the log-odds of the day being “humid” decrease by approximately 0.312727 units.The extremely high z value and very low p-value indicate that this coefficient is highly significant. This suggests that the variable temp_max has a strong impact on predicting whether a day is “humid” or not.
ci <- confint(model, parm = "temp_max", level = 0.95)
## Waiting for profiling to be done...
coef_temp_max <- coef(model)["temp_max"]
se_temp_max <- sqrt(vcov(model)["temp_max", "temp_max"])
lower_bound <- coef_temp_max - 1.96 * se_temp_max
upper_bound <- coef_temp_max + 1.96 * se_temp_max
cat("Coefficient (temp_max):", coef_temp_max, "\n")
## Coefficient (temp_max): -0.3127266
cat("Standard Error (SE):", se_temp_max, "\n")
## Standard Error (SE): 0.001463881
cat("95% Confidence Interval (CI): [", lower_bound, ",", upper_bound, "]\n")
## 95% Confidence Interval (CI): [ -0.3155958 , -0.3098574 ]
In this context, The standard error is 0.001463881 which measures the variability and uncertainty in the coefficient estimate. A smaller standard error suggests a more precise estimate.
Confidence Interval (CI): The 95% confidence interval for the coefficient of temp_max is approximately [-0.3155958, -0.3098574]. This represents the range of values within which we can reasonably expect the true population coefficient to lie, with 95% confidence.
my_new_data %>%
ggplot(mapping = aes(x =temp_max, y = humid_day)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_smooth(method = 'lm', se = FALSE) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(my_new_data, aes(x = temp_max, y = humid_day)) +
geom_point() +
geom_smooth(method = "glm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Scatterplot of temp_max vs. humid_day",
x = "temp_max",
y = "humid_day") +
theme_minimal()
From the above scatterplot, we can determine that the line is curved representing a non linear relationship between temp_max and humid_day. If the line appears to be curved or if the relationship is not adequately captured by a straight line, a transformation may be needed. I will be using “Log Transformation”
my_new_data <- my_new_data %>%
mutate(log_temp_max = log(temp_max))
model <- lm(humid_day ~ log_temp_max, data = my_new_data)
rsquared <- summary(model)$r.squared
my_new_data %>%
ggplot(mapping = aes(x = log_temp_max, y = humid_day)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
labs(
title = "Relationship between log_temp_max and humid_day",
subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))
) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'