Load Necessary Libraries

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.5.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
setwd("C:/Users/Aniruddha/Downloads")
Climate<-read.csv("Indian_Climate_Dataset_2024_2025.csv")
View(Climate)

Change column names to match the new format

colnames(Climate)<-c("Date","City","State","Max_temp_in_C","Min_Temp_in_C","Avg_temp_in_C","Humidity","Rainfall","Wind_Speed","AQI","AQI_Category","Pressure","Cloud_Cover")
colnames(Climate)
##  [1] "Date"          "City"          "State"         "Max_temp_in_C"
##  [5] "Min_Temp_in_C" "Avg_temp_in_C" "Humidity"      "Rainfall"     
##  [9] "Wind_Speed"    "AQI"           "AQI_Category"  "Pressure"     
## [13] "Cloud_Cover"

Level 1: Understanding the Data (Basic Exploration) ————————————————————————————————————————————–

Question 1.1: What is the structure of the dataset(number of rows, columns, and data types)?

str(Climate)
## 'data.frame':    7310 obs. of  13 variables:
##  $ Date         : chr  "2024-01-01" "2024-01-01" "2024-01-01" "2024-01-01" ...
##  $ City         : chr  "Mumbai" "Delhi" "Bengaluru" "Chennai" ...
##  $ State        : chr  "Maharashtra" "Delhi" "Karnataka" "Tamil Nadu" ...
##  $ Max_temp_in_C: num  32.5 25.4 37.2 37.2 27.4 44.4 32.8 40.4 42.3 27.4 ...
##  $ Min_Temp_in_C: num  18 10.7 30.8 30.4 17.5 31.6 25.1 33.5 31 15.3 ...
##  $ Avg_temp_in_C: num  25.2 18.1 34 33.8 22.5 38 28.9 37 36.6 21.3 ...
##  $ Humidity     : num  77.6 84.1 49 34.2 32.2 91.1 83.9 30.4 51.5 79.5 ...
##  $ Rainfall     : num  0 0 3.7 9.5 9.1 51.9 0 8.2 0 0 ...
##  $ Wind_Speed   : num  3.3 9 6.6 9 9.2 4 5.2 19.7 18.8 14 ...
##  $ AQI          : int  259 130 54 176 97 140 345 75 186 73 ...
##  $ AQI_Category : chr  "Poor" "Moderate" "Satisfactory" "Moderate" ...
##  $ Pressure     : num  1020 1008 1008 993 1008 ...
##  $ Cloud_Cover  : num  62.1 46 61.3 70 56.9 9.3 12.1 39.1 89.3 7.4 ...

The dataset contains multiple variables such as temperature, AQI, humidity, and city names. The structure analysis shows the number of observations and the type of each variable (numeric, character, date). This helps in understanding how the data is organized and which variables can be used for analysis.

Question 1.2: Are there any missing values in temperature, CO₂, or region columns?

colSums(is.na(Climate))
##          Date          City         State Max_temp_in_C Min_Temp_in_C 
##             0             0             0             0             0 
## Avg_temp_in_C      Humidity      Rainfall    Wind_Speed           AQI 
##             0             0             0             0             0 
##  AQI_Category      Pressure   Cloud_Cover 
##             0             0             0

The missing value analysis reveals whether the dataset has incomplete entries. If missing values are present, they can affect the accuracy of results. Handling missing values ensures reliable and consistent analysis.

Question 1.3: What is the average temperature for each year across all regions?

Climate$Date <- as.Date(Climate$Date)
Climate$Year <- year(Climate$Date)
Climate %>% group_by(Year) %>% 
  summarise(avg_temp=mean(Avg_temp_in_C,na.rm=TRUE))

The average temperature per year shows the overall climate trend. A higher average indicates warmer conditions, while fluctuations suggest variability in climate patterns over time.

Level 2: Data Extraction & Filtering ——————————————————————————————————————————————-

Question 2.1: Which are the top 10 regions with the highest average temperature increase?

avg_city_temp <- aggregate(Avg_temp_in_C ~ City, data = Climate, mean, na.rm = TRUE)
top_10_cities_temp <- avg_city_temp[order(-avg_city_temp$Avg_temp_in_C), ][1:10, ]
top_10_cities_temp

The top 10 cities with the highest average temperature represent regions experiencing consistently warmer conditions. These cities are more likely to face heat stress and may require better climate adaptation strategies.

Question 2.2: Which 5 regions recorded the highest temperature in any given year?

highest_city_temp<-aggregate(Avg_temp_in_C ~ City,data=Climate,max,na.rm=TRUE)
top_5_cities_maxtemp<-highest_city_temp[order(-highest_city_temp$Avg_temp_in_C),][1:5,]
top_5_cities_maxtemp

The cities with the highest recorded temperature indicate extreme weather events. These extreme values highlight regions vulnerable to heatwaves and climate risks.

Question 2.3: Find the top 10 Cities with the highest CO₂ emissions.

emmissions<-aggregate(AQI~City,data=Climate,mean,na.rm=TRUE)
top_10_city_emmissions<-emmissions[order(-emmissions$AQI),][1:10,]
top_10_city_emmissions

Cities with the highest AQI levels indicate poor air quality. This suggests environmental stress and potential health risks, which contribute to overall climate acclimatization challenges.

Level 3: Grouping & Summarization ———————————————————————————————————————-

Question 3.1: Determine the month with the highest average temperature rise.

Climate$Month<-month(Climate$Date,label=TRUE)
Climate %>% group_by(Month) %>% 
  summarise(avg_temp_month=mean(Avg_temp_in_C,na.rm=TRUE)) %>% 
  arrange(desc(avg_temp_month))

The analysis of monthly temperature identifies the hottest month. This helps in understanding seasonal climate patterns and planning for extreme weather conditions.

Question 3.2: Identify regions with the most extreme climate fluctuations (highest variance in temperature).

Climate %>% group_by(City) %>% 
  summarise(temp_varience=sd(Avg_temp_in_C),na.rm=TRUE) %>% 
  arrange(desc(temp_varience))

Cities with high temperature variability experience frequent climate fluctuations. This indicates unstable weather conditions, making acclimatization more difficult for residents.

Level 4: Sorting & Ranking Data ——————————————————————————————————————————-

Question 4.1: Rank cities based on average temperature increase (highest to lowest).

Climate %>% group_by(City,Year) %>% 
  summarise(avg_temp_increase=mean(Avg_temp_in_C,na.rm=TRUE)) %>% 
  mutate(rank=row_number(desc(avg_temp_increase))) %>% 
  arrange(rank)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by City and Year.
## ℹ Output is grouped by City.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(City, Year))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

Ranking cities based on average temperature helps identify the hottest regions. Higher-ranked cities face greater heat exposure and may require more adaptive measures.

Question 4.2: Find the hottest year and rank them.

Climate %>% group_by(Year) %>%
  summarise(max_temp_year = mean(Avg_temp_in_C, na.rm = TRUE)) %>%
  mutate(rank = row_number(desc(max_temp_year))) %>%
  arrange(rank)

The top 5 cities with the highest temperatures highlight extreme climate zones. These areas are more prone to heatwaves and environmental stress.

Question 4.3: Identify the year with the highest CO₂ emission.

Climate %>% group_by(Year) %>% 
  summarise(highest_emmissions=max(AQI,na.rm=TRUE)) %>% 
  mutate(rank=dense_rank(desc(highest_emmissions))) %>% 
  arrange(rank)

The hottest year represents the peak of temperature trends in the dataset. This indicates possible climate warming patterns during that period.

Level 5: Feature Engineering (Creating New Insights) ————————————————————————————————————————————-

Question 5.1: Create a new column for “Temperature Anomaly” (difference from global mean).

Climate %>% 
  mutate(temp_anomaly=Avg_temp_in_C-mean(Avg_temp_in_C,na.rm=TRUE))

The temperature anomaly shows how much a city’s temperature deviates from the overall average. Positive values indicate hotter-than-average conditions, while negative values indicate cooler conditions.

Question 5.2: Calculate “Acclimatization Index” (Temperature deviation over time per region).

Climate %>% 
  mutate(acclimization_index=scale(Avg_temp_in_C)+scale(AQI))

The acclimatization index combines temperature and AQI to measure environmental stress. Higher values indicate regions where both heat and pollution levels are high, making adaptation more challenging.

Question 5.3: Compute the percentage of years with extreme temperature (> threshold) per region.

threshold<-mean(Climate$Avg_temp_in_C,na.rm=TRUE)
Climate %>% group_by(Year) %>% 
  summarise(
    max_temp=sum(Avg_temp_in_C>threshold,na.rm=TRUE),
    total_days=n(),
    percentage=(max_temp/total_days)*100
  )

The percentage of extreme temperature days shows how frequently a region experiences unusually high temperatures. A higher percentage indicates increased climate risk.

Data Visualization ————————————————————————————————————–

V1 Bar Chart: Compare average temperature across different continents.

city_avg <- aggregate(Avg_temp_in_C ~ City, data = Climate, mean, na.rm = TRUE)
ggplot(city_avg, aes(x = City, y = Avg_temp_in_C, fill = City)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Average Temperature by City",
       x = "City",
       y = "Temperature (°C)")

The bar chart compares average temperatures across cities. It clearly shows which cities are warmer and helps identify regional climate differences.

V2 Histogram: Visualize distribution of temperature anomalies.

ggplot(Climate, aes(x = Avg_temp_in_C)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 30) +
  labs(title = "Temperature Distribution", x = "Temperature (°C)")

The histogram shows how temperature values are distributed. It helps identify whether temperatures are concentrated around a range or widely spread.

V3 Pie Chart: Show proportion of global CO₂ emissions by region.

aqi_table <- table(Climate$AQI_Category)

pie(aqi_table,
    col = rainbow(length(aqi_table)),
    main = "AQI Category Distribution")

The pie chart represents the proportion of AQI categories. It shows the dominance of certain air quality levels across the dataset.

V4 Scatter Plot / Pair Plot: Explore relationship between CO₂ emissions and temperature.

ggplot(Climate, aes(x = AQI, y = Avg_temp_in_C, color = City)) +
  geom_point(size = 3) +
  labs(title = "AQI vs Temperature")

The scatter plot shows the relationship between temperature and AQI. A visible trend indicates whether higher temperatures are associated with poor air quality.

V5 Line Chart: Is there a noticeable increase in global temperature over the years (e.g., 1900–2020)?

ggplot(Climate, aes(x = Date, y = Avg_temp_in_C, color = City)) +
  geom_line(size = 1) +
  labs(title = "Temperature Trend Over Time")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The line chart shows how temperature changes over time. It helps identify trends such as increasing or decreasing temperature patterns.

V6 Boxplot: Visualize CO₂ emission distribution across countries.

ggplot(Climate, aes(x = City, y = Avg_temp_in_C, fill = City)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Temperature Variation by City") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Pastel1")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Pastel1 is 9
## Returning the palette you asked for with that many colors

The boxplot shows variation and spread of temperature in each city. It highlights median values, outliers, and variability.

Advanced Analysis (Statistical & ML) ———————————————————————————————

1.1 ANOVA 1 Analyze the effect of region on temperature variation.

anova_model <- aov(Avg_temp_in_C ~ City, data = Climate)
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)
## City           9    305    33.9   0.952  0.478
## Residuals   7300 259906    35.6

ANOVA tests whether temperature differences between cities are statistically significant. A significant result indicates that location plays an important role in climate variation.

1.2 ANOVA 2 Analyze the effect of year on global temperature.

anova_year <- aov(Avg_temp_in_C ~ Year, data = Climate)
summary(anova_year)
##               Df Sum Sq Mean Sq F value Pr(>F)
## Year           1     20   19.59    0.55  0.458
## Residuals   7308 260192   35.60

This analysis checks whether temperature changes significantly across years. It helps detect climate trends over time.

1.3 ANOVA 3

anova_model <- aov(Avg_temp_in_C ~ City, data = Climate)
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)
## City           9    305    33.9   0.952  0.478
## Residuals   7300 259906    35.6

ANOVA tests whether the average temperature differs across cities. If the p-value is less than 0.05, it indicates that at least one city has significantly different temperature levels, confirming regional climate variation.

1.3 Simple Linear Regression Study relationship between Year and Temperature increase.

model1 <- lm(Avg_temp_in_C ~ Year, data = Climate)
summary(model1)
## 
## Call:
## lm(formula = Avg_temp_in_C ~ Year, data = Climate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4320  -4.9320  -0.0285   4.9715  12.3715 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -179.6165   282.5770  -0.636    0.525
## Year           0.1035     0.1396   0.742    0.458
## 
## Residual standard error: 5.967 on 7308 degrees of freedom
## Multiple R-squared:  7.528e-05,  Adjusted R-squared:  -6.155e-05 
## F-statistic: 0.5502 on 1 and 7308 DF,  p-value: 0.4583

This model shows the relationship between year and temperature. A positive slope indicates rising temperatures over time.

1.4 Multiple Linear Regression 1 Predict temperature using: CO₂ emissions Population Industrial activity

model2 <- lm(Avg_temp_in_C ~ AQI + Humidity + Wind_Speed, data = Climate)
summary(model2)
## 
## Call:
## lm(formula = Avg_temp_in_C ~ AQI + Humidity + Wind_Speed, data = Climate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3974  -4.9276   0.0079   4.9830  12.4393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.9843297  0.3236923  92.632   <2e-16 ***
## AQI          0.0007748  0.0007826   0.990    0.322    
## Humidity     0.0012128  0.0037362   0.325    0.745    
## Wind_Speed  -0.0170267  0.0106349  -1.601    0.109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.966 on 7306 degrees of freedom
## Multiple R-squared:  0.000504,   Adjusted R-squared:  9.358e-05 
## F-statistic: 1.228 on 3 and 7306 DF,  p-value: 0.2977

This model analyzes how multiple factors (AQI, humidity, etc.) influence temperature. It provides a deeper understanding of climate interactions.

1.5 Correlation Find correlation between: Temperature & CO₂ Temperature & sea level rise

cor(Climate$Avg_temp_in_C, Climate$AQI, use = "complete.obs")
## [1] 0.01175204

Correlation measures the strength of relationship between variables like temperature and AQI. A strong correlation indicates that the variables are closely related.

1.6 Correlation + Regression How does AQI impact temperature?

library(ggplot2)

ggplot(Climate, aes(x = AQI, y = Avg_temp_in_C)) +
  geom_point(color = "blue", size = 2) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  theme_minimal() +
  labs(title = "Relationship between AQI and Temperature",
       x = "AQI",
       y = "Temperature (°C)")
## `geom_smooth()` using formula = 'y ~ x'

The correlation value indicates the strength of the relationship between AQI and temperature. If positive, it suggests that higher pollution levels are associated with higher temperatures. The regression model shows how much temperature changes with AQI. A significant p-value confirms that AQI has a measurable impact on temperature.

Machine Learning Techniques

K-Means Clustering Identify clusters of countries based on: Temperature trends CO₂ emissions Climate risk

set.seed(123)
cluster_data <- Climate %>%
  select(Avg_temp_in_C, AQI)
kmeans_result <- kmeans(scale(cluster_data), centers = 3)
Climate$Cluster <- kmeans_result$cluster

library(ggplot2)

ggplot(Climate, aes(x = Avg_temp_in_C, y = AQI, color = Cluster)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "K-Means Clustering of Cities",
       x = "Temperature (°C)",
       y = "AQI")

Clustering groups cities with similar climate patterns. This helps identify regions with similar environmental conditions and adaptation needs.

KNN Classification Predict whether a region is: Low risk Moderate risk High climate risk based on climate indicators

library(class)
Climate$Category <- ifelse(Climate$AQI > 150, "High", "Low")
Climate$Category <- as.factor(Climate$Category)
train <- Climate[1:100, ]
test <- Climate[101:150, ]

knn_pred <- knn(
  train = train[, c("Avg_temp_in_C", "AQI")],
  test = test[, c("Avg_temp_in_C", "AQI")],
  cl = train$Category,
  k = 3
)
test$Predicted <- knn_pred

ggplot(test, aes(x = Avg_temp_in_C, y = AQI, color = Predicted)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "KNN Classification (Climate Risk)",
       x = "Temperature (°C)",
       y = "AQI")

KNN predicts climate categories based on features like temperature and AQI. It helps classify regions into different risk levels.

The analysis reveals significant variation in temperature and air quality across cities in India. Certain regions experience higher environmental stress due to elevated temperature and AQI levels. Statistical and machine learning methods further highlight patterns and relationships, providing insights into climate acclimatization trends. These findings can help in understanding regional climate challenges and planning adaptive strategies.