library(readr)
library(stats)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.2 âś” stringr 1.5.0
## âś” forcats 1.0.0 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
my_data <- read_delim("C:/Users/user/Documents/Statistics/Telangana_2018_complete_weather_data.csv",delim=",")
## Rows: 230384 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): District, Mandal, Location, Date
## dbl (6): row_id, temp_min, temp_max, humidity_min, humidity_max, wind_speed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I created a new column “humid_average” by doing average of “humidity_min” and “humidity_max”
my_new_data<- my_data %>%
mutate(humid_average = (humidity_min + humidity_max) / 2)
converting “humid_average” column into a binary column”humid_day”
my_new_data$humid_day <- ifelse(my_new_data$humid_average > 61.20, 1, 0)
Explanatory variable = temp_max
Response Variable = humid_day
model <- glm(humid_day ~ temp_max, data = my_new_data)
summary(model)
##
## Call:
## glm(formula = humid_day ~ temp_max, data = my_new_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5775911 0.0075896 339.6 <2e-16 ***
## temp_max -0.0603846 0.0002168 -278.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1866987)
##
## Null deviance: 57493 on 230383 degrees of freedom
## Residual deviance: 43012 on 230382 degrees of freedom
## AIC: 267161
##
## Number of Fisher Scoring iterations: 2
Intercept :
Estimate: 2.5775911, Std. Error: 0.0075896, t value: 339.6, Pr(>|t|): <2e-16
The intercept represents the estimated mean value of the response variable “humid_day” when the independent variable “temp_max” is equal to 0. In this context, the extremely positive intercept value suggests that when the maximum temperature is zero, the model predicts a significantly higher value for humid_day.
Coefficient for temp_max:
Estimate: -0.0603846,Std. Error: 0.0002168, t value: -278.5, Pr(>|t|): <2e-16
The coefficient for temp_max represents how the response variable ” humid_day” changes for a one-unit increase in temp_max. An estimate of -0.0603846 means that for every one-unit increase in temp_max, the model predicts a decrease of approximately 0.0603846 units in the response variable humid_day. The extremely high t value and very low p-value indicate that this coefficient is highly significant. This suggests that the variable temp_max has a strong impact on predicting the value of humid_day.
plot(model, which = 1)
plot(model, which = 2)
ggplot(my_new_data, aes(x = temp_max, y = humid_day)) +
geom_point() +
geom_smooth(method = "glm", formula = y ~ poly(x, 2), se = FALSE) +
labs(title = "Scatterplot of temp_max vs. humid_day",
x = "temp_max",
y = "humid_day") +
theme_minimal()
From the above scatterplot, we can determine that the line is curved representing a non linear relationship between temp_max and humid_day.It indicates that as temp_max changes, the humid_day variable doesn’t change linearly but instead follows a curved trajectory.
This non linearity is the issue addressed by the model.When we fit a linear regression model to data that exhibits a non-linear relationship, the model may not capture the underlying patterns effectively, leading to poor model performance and inaccurate predictions.