library(readr)
library(stats)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.2     âś” stringr   1.5.0
## âś” forcats   1.0.0     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
my_data <- read_delim("C:/Users/user/Documents/Statistics/Telangana_2018_complete_weather_data.csv",delim=",")
## Rows: 230384 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): District, Mandal, Location, Date
## dbl (6): row_id, temp_min, temp_max, humidity_min, humidity_max, wind_speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I created a new column “humid_average” by doing average of “humidity_min” and “humidity_max”

my_new_data<- my_data %>%
  mutate(humid_average = (humidity_min + humidity_max) / 2)

converting “humid_average” column into a binary column”humid_day”

my_new_data$humid_day <- ifelse(my_new_data$humid_average > 61.20, 1, 0)

Build a linear (or generalized linear) model

Explanatory variable = temp_max

Response Variable = humid_day

model <- glm(humid_day ~ temp_max, data = my_new_data)
summary(model)
## 
## Call:
## glm(formula = humid_day ~ temp_max, data = my_new_data)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.5775911  0.0075896   339.6   <2e-16 ***
## temp_max    -0.0603846  0.0002168  -278.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1866987)
## 
##     Null deviance: 57493  on 230383  degrees of freedom
## Residual deviance: 43012  on 230382  degrees of freedom
## AIC: 267161
## 
## Number of Fisher Scoring iterations: 2

Interpretation of the coefficients

Intercept :

Estimate: 2.5775911, Std. Error: 0.0075896, t value: 339.6, Pr(>|t|): <2e-16

The intercept represents the estimated mean value of the response variable “humid_day” when the independent variable “temp_max” is equal to 0. In this context, the extremely positive intercept value suggests that when the maximum temperature is zero, the model predicts a significantly higher value for humid_day.

Coefficient for temp_max:

Estimate: -0.0603846,Std. Error: 0.0002168, t value: -278.5, Pr(>|t|): <2e-16

The coefficient for temp_max represents how the response variable ” humid_day” changes for a one-unit increase in temp_max. An estimate of -0.0603846 means that for every one-unit increase in temp_max, the model predicts a decrease of approximately 0.0603846 units in the response variable humid_day. The extremely high t value and very low p-value indicate that this coefficient is highly significant. This suggests that the variable temp_max has a strong impact on predicting the value of humid_day.

Diagnosing the model

plot(model, which = 1)

plot(model, which = 2)

ggplot(my_new_data, aes(x = temp_max, y = humid_day)) +
  geom_point() +
  geom_smooth(method = "glm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(title = "Scatterplot of temp_max vs. humid_day",
       x = "temp_max",
       y = "humid_day") +
  theme_minimal()

From the above scatterplot, we can determine that the line is curved representing a non linear relationship between temp_max and humid_day.It indicates that as temp_max changes, the humid_day variable doesn’t change linearly but instead follows a curved trajectory.

This non linearity is the issue addressed by the model.When we fit a linear regression model to data that exhibits a non-linear relationship, the model may not capture the underlying patterns effectively, leading to poor model performance and inaccurate predictions.