| Matric | Full_Name |
|---|---|
| S2110638 | Karam Aljanadi |
| 23094825 | Siti Hairuneesha Binti Zainuddin |
| u2005486 | Afrina Rosa binti Mohamad Sattar |
| 23085637 | Chia Kai Swan |
| S2119226 | Muhammad Taufiq Bin Ismail |
Weather forecasting plays a pivotal role in understanding climatic patterns, preparing for extreme weather events, and optimizing resources across various sectors. This study focuses on Seoul’s weather over a 15-year period (2009–2023), utilizing historical weather data to uncover significant trends and patterns. By analyzing key attributes such as temperature, precipitation, wind speed, and cloud cover, we aim to gain insights into Seoul’s evolving climate and predict future weather conditions.
The dataset, titled “Seoul Historical Weather Data (1994–2024),” is available on Kaggle.
Research Objective:
Research Questions:
What weather trends can be identified in Seoul from 2009 to 2023 ?
How accurate are forecasting models for predicting future weather?
How can visualizations help urban planners use weather data for decision-making?
CRISP-DM
1.Understand: Define the problem: Analyze Seoul’s weather data for insights and predictions.
2.Explore: Load and examine the dataset, identifying key attributes.
3.Prepare: Clean and preprocess data for analysis.
4.Model: Build predictive models using statistical/ML methods.
5.Evaluate: Assess model performance using relevant metrics.
6.Deploy: Present findings, forecasts, and actionable recommendations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (readr)
library (dplyr)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(ggplot2)
library(rpart)
library(corrplot)
## corrplot 0.95 loaded
library(Metrics)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:Metrics':
##
## precision, recall
##
## The following object is masked from 'package:purrr':
##
## lift
library(rsconnect)
library(knitr)
df<-read_csv("dataset123.csv")
## Rows: 10958 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, datetime, preciptype, conditions, description, icon, stations
## dbl (25): ROW_KEY, tempmax, tempmin, temp, feelslikemax, feelslikemin, feel...
## dttm (2): sunrise, sunset
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show the number of rows in a more descriptive way
cat("The dataset contains", nrow(df), "rows.")
## The dataset contains 10958 rows.
str(df)
## spc_tbl_ [10,958 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ROW_KEY : num [1:10958] 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr [1:10958] "seoul" "seoul" "seoul" "seoul" ...
## $ datetime : chr [1:10958] "1/1/1994" "2/1/1994" "3/1/1994" "4/1/1994" ...
## $ tempmax : num [1:10958] 35.2 43 47.9 38.8 40 42.7 37 44.3 48.8 48 ...
## $ tempmin : num [1:10958] 16.4 31.5 30.9 22.1 24 26.1 17.7 NA 33.6 27.9 ...
## $ temp : num [1:10958] 26.3 36.2 38 30.1 33.1 35.5 27.5 34.2 41 39.7 ...
## $ feelslikemax : num [1:10958] 33.4 39.4 44.7 32 40 39.1 37 37.6 46.1 48 ...
## $ feelslikemin : num [1:10958] 13 26.7 24.5 18.4 18.5 15.7 13.7 12.6 29.2 27.9 ...
## $ feelslike : num [1:10958] 24.3 32.6 35.4 26.3 31 30.1 24.3 28.2 39.1 38.9 ...
## $ dew : num [1:10958] 15.5 27.9 27.3 13.6 21.7 25 11.3 24.3 34.3 32.4 ...
## $ humidity : num [1:10958] 65.9 72.1 68.1 51.2 63.9 66.7 52.1 67.7 77.4 75.8 ...
## $ precip : num [1:10958] 0 0 0 0 0.01 0.1 0 0 0 0 ...
## $ precipprob : num [1:10958] 0 0 0 0 100 100 0 0 0 0 ...
## $ precipcover : num [1:10958] 0 0 0 0 4.17 4.17 0 0 0 0 ...
## $ preciptype : chr [1:10958] NA NA NA NA ...
## $ snow : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ snowdepth : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ windgust : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ windspeed : num [1:10958] 5.5 9.1 12.8 11.7 6.8 11.8 9.1 14.7 8.9 4.5 ...
## $ winddir : num [1:10958] 115 182 290 302 134 ...
## $ sealevelpressure: num [1:10958] 1025 1022 1020 1025 1024 ...
## $ cloudcover : num [1:10958] 63 88.8 57.4 16.3 90.4 57.2 30.1 46.6 NA 81 ...
## $ visibility : num [1:10958] 6.6 6.9 5.3 7.3 6.2 5.1 7.4 7.6 3.7 2.2 ...
## $ solarradiation : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ solarenergy : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ uvindex : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ severerisk : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ sunrise : POSIXct[1:10958], format: "1994-01-01 07:46:54" "1994-01-02 07:47:03" ...
## $ sunset : POSIXct[1:10958], format: "1994-01-01 17:23:56" "1994-01-02 17:24:44" ...
## $ moonphase : num [1:10958] 0.61 0.65 0.68 0.72 0.75 0.79 0.83 0.86 0.9 0.93 ...
## $ conditions : chr [1:10958] "Partially cloudy" "Partially cloudy" "Partially cloudy" "Clear" ...
## $ description : chr [1:10958] "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Clear conditions throughout the day." ...
## $ icon : chr [1:10958] "partly-cloudy-day" "partly-cloudy-day" "partly-cloudy-day" "clear-day" ...
## $ stations : chr [1:10958] "4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000" "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,"| __truncated__ ...
## - attr(*, "spec")=
## .. cols(
## .. ROW_KEY = col_double(),
## .. name = col_character(),
## .. datetime = col_character(),
## .. tempmax = col_double(),
## .. tempmin = col_double(),
## .. temp = col_double(),
## .. feelslikemax = col_double(),
## .. feelslikemin = col_double(),
## .. feelslike = col_double(),
## .. dew = col_double(),
## .. humidity = col_double(),
## .. precip = col_double(),
## .. precipprob = col_double(),
## .. precipcover = col_double(),
## .. preciptype = col_character(),
## .. snow = col_double(),
## .. snowdepth = col_double(),
## .. windgust = col_double(),
## .. windspeed = col_double(),
## .. winddir = col_double(),
## .. sealevelpressure = col_double(),
## .. cloudcover = col_double(),
## .. visibility = col_double(),
## .. solarradiation = col_double(),
## .. solarenergy = col_double(),
## .. uvindex = col_double(),
## .. severerisk = col_double(),
## .. sunrise = col_datetime(format = ""),
## .. sunset = col_datetime(format = ""),
## .. moonphase = col_double(),
## .. conditions = col_character(),
## .. description = col_character(),
## .. icon = col_character(),
## .. stations = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
kable(head(df))
| ROW_KEY | name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | preciptype | snow | snowdepth | windgust | windspeed | winddir | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | seoul | 1/1/1994 | 35.2 | 16.4 | 26.3 | 33.4 | 13.0 | 24.3 | 15.5 | 65.9 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 5.5 | 115.4 | 1025.4 | 63.0 | 6.6 | NA | NA | NA | NA | 1994-01-01 07:46:54 | 1994-01-01 17:23:56 | 0.61 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 2 | seoul | 2/1/1994 | 43.0 | 31.5 | 36.2 | 39.4 | 26.7 | 32.6 | 27.9 | 72.1 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 9.1 | 181.7 | 1022.2 | 88.8 | 6.9 | NA | NA | NA | NA | 1994-01-02 07:47:03 | 1994-01-02 17:24:44 | 0.65 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 3 | seoul | 3/1/1994 | 47.9 | 30.9 | 38.0 | 44.7 | 24.5 | 35.4 | 27.3 | 68.1 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 12.8 | 289.5 | 1020.0 | 57.4 | 5.3 | NA | NA | NA | NA | 1994-01-03 07:47:11 | 1994-01-03 17:25:33 | 0.68 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 4 | seoul | 4/1/1994 | 38.8 | 22.1 | 30.1 | 32.0 | 18.4 | 26.3 | 13.6 | 51.2 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 11.7 | 301.9 | 1025.1 | 16.3 | 7.3 | NA | NA | NA | NA | 1994-01-04 07:47:16 | 1994-01-04 17:26:23 | 0.72 | Clear | Clear conditions throughout the day. | clear-day | 47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5 | seoul | 5/1/1994 | 40.0 | 24.0 | 33.1 | 40.0 | 18.5 | 31.0 | 21.7 | 63.9 | 0.01 | 100 | 4.17 | rain,snow | NA | NA | NA | 6.8 | 134.1 | 1023.9 | 90.4 | 6.2 | NA | NA | NA | NA | 1994-01-05 07:47:19 | 1994-01-05 17:27:14 | 0.75 | Snow, Rain, Overcast | Cloudy skies throughout the day with late afternoon rain or snow. | rain | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 6 | seoul | 6/1/1994 | 42.7 | 26.1 | 35.5 | 39.1 | 15.7 | 30.1 | 25.0 | 66.7 | 0.10 | 100 | 4.17 | rain | NA | NA | NA | 11.8 | 290.2 | 1021.9 | 57.2 | 5.1 | NA | NA | NA | NA | 1994-01-06 07:47:21 | 1994-01-06 17:28:07 | 0.79 | Rain, Partially cloudy | Partly cloudy throughout the day with morning rain. | rain | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
Filter data for the 2009-2023 period. This is for data relevancy and potentially leaving out any outdated data.
# Convert `datetime` column to Date
df$datetime <- as.Date(df$datetime, format = "%d/%m/%Y")
# Filter the data for the date range 2009 to 2023
filtered_data <- df %>% filter(datetime >= "2009-01-01" & datetime <= "2023-12-31")
| ROW_KEY | name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | preciptype | snow | snowdepth | windgust | windspeed | winddir | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5480 | seoul | 2009-01-01 | 28.5 | 14.5 | 20.9 | 23.2 | 6.4 | 15.3 | 5.9 | 52.7 | 0 | 0 | 0 | NA | 0 | 0 | NA | 9.5 | 308.6 | 1026.5 | 0.2 | 7.3 | NA | NA | NA | NA | 2009-01-01 07:46:55 | 2009-01-01 17:24:11 | 0.16 | Clear | Clear conditions throughout the day. | clear-day | 471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5481 | seoul | 2009-01-02 | 35.0 | 13.3 | 24.2 | 30.8 | 12.7 | 22.3 | 12.2 | 63.0 | 0 | 0 | 0 | NA | 0 | 0 | NA | 5.6 | 326.3 | 1029.3 | 0.0 | 7.0 | NA | NA | NA | NA | 2009-01-02 07:47:04 | 2009-01-02 17:24:59 | 0.19 | Clear | Clear conditions throughout the day. | clear-day | 4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5482 | seoul | 2009-01-03 | 39.4 | 15.1 | 27.2 | 39.4 | 12.7 | 25.6 | 14.2 | 62.7 | 0 | 0 | 0 | NA | 0 | 0 | NA | 5.6 | 329.4 | 1029.4 | 11.2 | 6.7 | NA | NA | NA | NA | 2009-01-03 07:47:11 | 2009-01-03 17:25:48 | 0.23 | Clear | Clear conditions throughout the day. | clear-day | 471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5483 | seoul | 2009-01-04 | 40.4 | 22.4 | 30.5 | 36.3 | 22.4 | 28.6 | 15.8 | 58.6 | 0 | 0 | 0 | NA | 0 | 0 | NA | 7.8 | 306.2 | 1026.9 | 27.5 | 7.1 | NA | NA | NA | NA | 2009-01-04 07:47:15 | 2009-01-04 17:26:39 | 0.25 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5484 | seoul | 2009-01-05 | 34.0 | 18.7 | 27.4 | 32.2 | 17.9 | 24.4 | 17.6 | 67.8 | 0 | 0 | 0 | NA | 0 | 0 | NA | 9.0 | 287.0 | 1026.9 | 34.0 | 6.6 | NA | NA | NA | NA | 2009-01-05 07:47:18 | 2009-01-05 17:27:31 | 0.29 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5485 | seoul | 2009-01-06 | 34.9 | 15.8 | 26.2 | 34.8 | 15.8 | 24.4 | 14.2 | 63.1 | 0 | 0 | 0 | NA | 0 | 0 | NA | 5.6 | 326.1 | 1028.9 | 38.8 | 4.6 | NA | NA | NA | NA | 2009-01-06 07:47:19 | 2009-01-06 17:28:24 | 0.33 | Partially cloudy | Becoming cloudy in the afternoon. | partly-cloudy-day | 4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
# Filter the relevant columns
filtered_data <- filtered_data %>%
select(datetime, tempmax, tempmin, temp, precip, windspeed, cloudcover)
# Check the filtered dataset
str(filtered_data)
## tibble [5,478 × 7] (S3: tbl_df/tbl/data.frame)
## $ datetime : Date[1:5478], format: "2009-01-01" "2009-01-02" ...
## $ tempmax : num [1:5478] 28.5 35 39.4 40.4 34 34.9 39.5 37.9 32.1 24.9 ...
## $ tempmin : num [1:5478] 14.5 13.3 15.1 22.4 18.7 15.8 15.6 17.6 18.5 14.6 ...
## $ temp : num [1:5478] 20.9 24.2 27.2 30.5 27.4 26.2 27.5 27.9 25.7 19.7 ...
## $ precip : num [1:5478] 0 0 0 0 0 0 0 0 0 0 ...
## $ windspeed : num [1:5478] 9.5 5.6 5.6 7.8 9 5.6 6.7 7.4 7.9 9.1 ...
## $ cloudcover: num [1:5478] 0.2 0 11.2 27.5 34 38.8 10.8 36.1 17.2 8.2 ...
Filter relevant columns Key Attributes:
tempmax: Maximum daily temperatures (°C)
tempmin: Minimum daily temperatures (°C)
precip: Daily precipitation totals (mm)
windspeed: Average daily wind speed (km/h)
cloudcover: Average cloud cover percentage
These attributes will help analyze trends such as heatwaves, rainfall patterns, and wind speed variability, which are important for urban planning and sustainable resource management.
# Show number of missing values before handling
missing_before <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values before handling:")
## [1] "Missing values before handling:"
print(missing_before)
## datetime tempmax tempmin temp precip windspeed cloudcover
## 0 0 0 1 1 0 2
# Fill missing values in numeric columns with the mean
numeric_columns <- names(filtered_data)[sapply(filtered_data, is.numeric)]
filtered_data <- filtered_data %>%
mutate(across(all_of(numeric_columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Show number of missing values after handling
missing_after <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values after handling:")
## [1] "Missing values after handling:"
print(missing_after)
## datetime tempmax tempmin temp precip windspeed cloudcover
## 0 0 0 0 0 0 0
# Remove duplicates
filtered_data <- distinct(filtered_data)
This transformation is done to make the column name more descriptive or clearer.
precip to precipitation
#transform precip to precipitation
# Rename the column
filtered_data <- filtered_data %>%
rename(precipitation = precip)
# Check the updated column names
colnames(filtered_data)
## [1] "datetime" "tempmax" "tempmin" "temp"
## [5] "precipitation" "windspeed" "cloudcover"
# Time series plot
ggplot(filtered_data, aes(x = datetime, y = tempmax)) +
geom_line(color = "blue") + # Line plot
labs(title = "Time Series of Precipitation",
x = "Date and Time",
y = "Temperature Maximum") +
theme_minimal()
The precipitation data shows a clear cyclical pattern, likely due to seasonal variations. There appears to be a slight upward trend over time. Some anomalies are observed, suggesting the influence of weather events or other factors.
# Summary statistics for temp columns
summary(filtered_data$tempmax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.10 39.10 61.00 57.61 79.20 102.30
summary(filtered_data$tempmin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17.10 23.70 41.70 41.68 62.50 84.40
# Check for unusually high or low values
boxplot(filtered_data$tempmax, main = "Boxplot of tempmax", ylab = "Temperature")
boxplot(filtered_data$tempmin, main = "Boxplot of tempmin", ylab = "Temperature")
Both tempmax and tempmin exhibit a wide
range and significant variability.
Tempmax shows a higher median and a wider range
compared to tempmin.
This suggests substantial daily temperature fluctuations.
Considerations:
Units: The units for temperature are not specified in the plots. Knowing the units (e.g., Celsius, Fahrenheit) would be essential for accurate interpretation.
Unit Consistency Using a single unit (Celsius) ensures accurate comparisons and calculations throughout the analysis.
# Define conversion functions
celsius_to_fahrenheit <- function(celsius) {
celsius * 9 / 5 + 32
}
fahrenheit_to_celsius <- function(fahrenheit) {
(fahrenheit - 32) * 5 / 9
}
# Define the date range for conversion
start_date <- as.Date('2009-01-01')
end_date <- as.Date('2021-12-31')
# Filter the data for the dates that need conversion (2009-2021)
data_to_convert <- filtered_data[filtered_data$datetime >= start_date & filtered_data$datetime <= end_date, ]
# Define the columns to convert
temp_columns <- c("tempmax", "tempmin", "temp")
# Apply temperature conversion for the filtered rows (2009-2021)
for (col in temp_columns) {
data_to_convert[[col]] <- round(fahrenheit_to_celsius(data_to_convert[[col]]), 1)
}
# Now, merge the converted data back with the original data
data_non_converted <- filtered_data[!(filtered_data$datetime >= start_date & filtered_data$datetime <= end_date), ]
# Combine the data
final_data <- rbind(data_non_converted, data_to_convert)
# Ensure that the combined data is sorted by date
final_data <- final_data[order(final_data$datetime), ]
| datetime | tempmax | tempmin | temp | precipitation | windspeed | cloudcover |
|---|---|---|---|---|---|---|
| 2009-01-01 | -1.9 | -9.7 | -6.2 | 0 | 9.5 | 0.2 |
| 2009-01-02 | 1.7 | -10.4 | -4.3 | 0 | 5.6 | 0.0 |
| 2009-01-03 | 4.1 | -9.4 | -2.7 | 0 | 5.6 | 11.2 |
| 2009-01-04 | 4.7 | -5.3 | -0.8 | 0 | 7.8 | 27.5 |
| 2009-01-05 | 1.1 | -7.4 | -2.6 | 0 | 9.0 | 34.0 |
| 2009-01-06 | 1.6 | -9.0 | -3.2 | 0 | 5.6 | 38.8 |
Exploratory Data Analysis (EDA) is an essential step in data analysis that helps us understand the dataset, uncover underlying patterns, detect anomalies, and test assumptions before applying any predictive models. In this section, we aim to explore the Seoul Historical Weather data to extract meaningful insights regarding weather patterns over a 15-year period (2009–2023). By visualizing the data, calculating summary statistics, and performing various checks, we will prepare the dataset for further analysis and modeling.
Calculate summary statistics (mean, median, variance) for key attributes like temperature, precipitation, wind speed, etc.
## datetime tempmax tempmin temp
## Min. :2009-01-01 Min. :-11.30 Min. :-19.000 Min. :-14.40
## 1st Qu.:2012-10-01 1st Qu.: 8.40 1st Qu.: -1.000 1st Qu.: 3.80
## Median :2016-07-01 Median : 19.55 Median : 9.000 Median : 14.10
## Mean :2016-07-01 Mean : 17.68 Mean : 8.331 Mean : 12.89
## 3rd Qu.:2020-03-31 3rd Qu.: 27.10 3rd Qu.: 18.275 3rd Qu.: 22.60
## Max. :2023-12-31 Max. : 39.10 Max. : 29.100 Max. : 49.47
## precipitation windspeed cloudcover
## Min. : 0.0000 Min. : 2.60 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 7.90 1st Qu.: 25.40
## Median : 0.0000 Median :10.00 Median : 50.50
## Mean : 0.7568 Mean :10.61 Mean : 50.19
## 3rd Qu.: 0.0400 3rd Qu.:12.60 3rd Qu.: 74.80
## Max. :268.6200 Max. :44.90 Max. :100.00
To observe temperature variations over time
ggplot(final_data, aes(x = datetime)) +
geom_line(aes(y = tempmax), color = "red") +
geom_line(aes(y = tempmin), color = "blue") +
geom_line(aes(y = temp), color = "green") +
labs(title = "Temperature Over Time", x = "Date", y = "Temperature (°C)") +
theme_minimal()
Color Interpretation:
Red: Represents the highest temperature readings.
Blue: Represents the lowest temperature readings.
Green: Represents intermediate temperature readings.
Visual Interpretation:
Cyclical Pattern: The most prominent feature is the clear cyclical pattern in the temperature data. There are distinct peaks and troughs that repeat at regular intervals. This suggests a strong seasonal component in the temperature variations.
Amplitude: The amplitude of the fluctuations seems to vary somewhat over time. There are periods with higher peaks and deeper troughs, and others with more moderate variations. This could be due to factors like El Niño or other climate phenomena.
Trend: There appears to be a slight upward trend in the overall temperature levels from 2010 to 2020. This could be attributed to long-term climate change or other regional factors.
Anomalies: There are a few notable anomalies, such as a sharp drop in temperature around 2016. These could be due to specific weather events or other localized factors.
In Summary:
The plot visually depicts the temperature fluctuations over time, highlighting the seasonal variations, potential trends, and notable anomalies. The use of color effectively differentiates between temperature ranges, providing a clear visual representation of the temperature variations.
# Reshape the dataset to long format
final_data_long <- pivot_longer(final_data, cols = c(precipitation, windspeed, cloudcover),
names_to = "variable", values_to = "value")
# Plotting with facets
ggplot(final_data_long, aes(x = datetime, y = value)) +
geom_line() +
facet_wrap(~ variable, scales = "free_y") + # Use free y-axis scale for each variable
labs(title = "Time Series of Precipitation, Wind Speed, and Cloud Cover", x = "Date", y = "Value") +
theme_minimal()
Precipitation: The precipitation time series shows distinct peaks and troughs, suggesting a seasonal rainfall pattern. There appear to be periods with higher precipitation and others with lower precipitation.
Wind Speed: The wind speed time series also shows a cyclical pattern, but with less pronounced peaks and troughs compared to precipitation. The overall variation in wind speed seems to be less than that of precipitation.
Cloud Cover: The cloud cover time series exhibits a high degree of variability with frequent fluctuations. There are periods with consistently high cloud cover and others with lower cloud cover.
The most likely explanation for the cyclical patterns is the seasonal variation in weather patterns. In many regions, precipitation, wind speed, and cloud cover are influenced by factors like monsoon seasons, changes in atmospheric circulation, and solar radiation
A heatwave threshold is defined (in this case, 35 degrees Celsius). Showing the days with a maximum temperature exceeding the threshold are marked as heatwave days.
# Define heatwave threshold
heatwave_threshold <- 35 # Example threshold for heatwaves
# Mark heatwave days
final_data <- final_data %>%
mutate(heatwave = ifelse(tempmax > heatwave_threshold, 1, 0))
# Use cumsum to identify consecutive heatwave periods
final_data <- final_data %>%
mutate(heatwave_id = cumsum(heatwave == 0 & lag(heatwave, default = 0) == 0))
# Count the number of heatwave days
num_heatwave_days <- sum(final_data$heatwave == 1)
# Create a message
heatwave_message <- paste("The number of heatwaves in Seoul throughout the year are", num_heatwave_days)
# Print the message
print(heatwave_message)
## [1] "The number of heatwaves in Seoul throughout the year are 45"
# Scatter plot for tempmax with heatwave days highlighted
ggplot(final_data, aes(x = datetime, y = tempmax)) +
# Scatter plot for all data points
geom_point(aes(color = factor(heatwave), shape = factor(heatwave)), size = 3) +
# Customize color scale to distinguish heatwave days
scale_color_manual(values = c("black", "red"), labels = c("Non-Heatwave", "Heatwave")) +
scale_shape_manual(values = c(16, 17), labels = c("Non-Heatwave", "Heatwave")) +
# Add annotation for heatwave days count
annotate("text", x = as.Date("2022-01-01"), y = max(final_data$tempmax),
label = paste("Heatwave Days:", num_heatwave_days),
hjust = 0, color = "blue", size = 2.5) +
labs(title = "Scatter Plot of Max Temperature with Heatwave Days Highlighted",
x = "Date",
y = "Max Temperature (°C)") +
theme_minimal() +
theme(legend.title = element_blank())
Interpretation:
The scatter plot provides a visual representation of heatwave events in relation to maximum temperature. It allows us to identify:
Days exceeding the heatwave threshold: The red triangles clearly highlight the days where the maximum temperature surpassed the defined threshold.
Seasonal patterns: The clustering of red triangles suggests that heatwaves are more likely to occur during specific seasons or periods.
Temperature distribution: The scatter plot shows the overall distribution of maximum temperatures, including both heatwave and non-heatwave days.
The correlation analysis is visualized through a circle heatmap, which highlights the strength and direction of relationships between different variables in the dataset.
cor_matrix <- cor(final_data[, c("tempmax", "tempmin", "temp", "precipitation", "windspeed", "cloudcover")])
corrplot(cor_matrix, method = "circle")
Interpretation of Correlation Plot
The following interpretation is based on the correlation plot created for the dataset. The size and color of the circles represent the strength and direction of the correlation between each pair of variables.
Strong Positive Correlations:
tempmax and tempmin have a very strong
positive correlation. This makes sense as the maximum temperature of a
day is closely related to its minimum temperature, with both typically
increasing or decreasing together.temp and tempmax have a strong positive
correlation. As the temperature of the day increases, the maximum
temperature also tends to increase.temp and tempmin also show a strong
positive correlation. Similarly, as the temperature of the day
increases, the minimum temperature increases as well.Moderate to Weak Correlations:
precipitation and
windspeed is weak, suggesting that changes in wind speed do
not strongly affect precipitation levels.cloudcover and
temp is also weak to moderate, meaning that cloud cover
does not always correlate strongly with the temperature of the day.The seasonal analysis explores the variations in key variables over different seasons, aiming to identify trends, patterns, and anomalies that may influence the overall dataset.
To categorize the data by season, a new ‘season’ column is created based on the month of the year,
with the following assumptions:
Winter: December, January, February
Spring: March, April, May
Summer: June, July, August
Fall: September, October, November
# Create a 'season' column based on month based on assumption
final_data <- final_data %>%
mutate(season = case_when(
format(datetime, "%m") %in% c("12", "01", "02") ~ "Winter",
format(datetime, "%m") %in% c("03", "04", "05") ~ "Spring",
format(datetime, "%m") %in% c("06", "07", "08") ~ "Summer",
format(datetime, "%m") %in% c("09", "10", "11") ~ "Fall"
))
# Aggregate by season
seasonal_data <- final_data %>%
group_by(season) %>%
summarise(
tempmax_avg = mean(tempmax, na.rm = TRUE),
tempmin_avg = mean(tempmin, na.rm = TRUE),
precip_avg = mean(precipitation, na.rm = TRUE),
windspeed_avg = mean(windspeed, na.rm = TRUE),
cloudcover_avg = mean(cloudcover, na.rm = TRUE)
)
# Plot the seasonal data
ggplot(seasonal_data, aes(x = season)) +
geom_bar(aes(y = tempmax_avg), stat = "identity", fill = "red") +
geom_bar(aes(y = tempmin_avg), stat = "identity", fill = "blue", alpha = 0.5) +
labs(title = "Average Temperature by Season",
x = "Season",
y = "Temperature (°C)") +
theme_minimal()
Summer: The highest average temperature is observed in the Summer season. The stacked bar shows a combination of two temperature ranges, with the higher range contributing significantly to the overall average.
Fall and Spring: The average temperatures in Fall and Spring are lower compared to Summer. The stacked bars indicate a similar distribution of temperature ranges between these two seasons.
Winter: The lowest average temperature occurs in Winter. The stacked bar shows a predominance of the lower temperature range, contributing to the overall low average.
ggplot(final_data, aes(x = tempmax)) +
geom_boxplot() +
labs(title = "Outliers in Max Temperature", x = "Temperature (°C)") +
theme_minimal()
ggplot(final_data, aes(x = tempmin)) +
geom_boxplot() +
labs(title = "Outliers in Min Temperature", x = "Temperature (°C)") +
theme_minimal()
ggplot(final_data, aes(x = temp)) +
geom_boxplot() +
labs(title = "Outliers in Temperature", x = "Temperature (°C)") +
theme_minimal()
The boxplots provide a visual representation of the temperature data
of tempmax, tempmin and temp,
highlighting the following characteristics:
No extreme values
Symmetrical distribution
Consistent variations across different temperature metrics (max, min, average)
Feature engineering involves transforming raw data into meaningful
features that enhance the performance of predictive models. In this
analysis, a new categorical feature, ‘weather’, is
created based on the maximum temperature (tempmax) to
categorize days as either ‘Hot’ or
‘Cool’.
This transformation helps in simplifying the analysis and enables models to better capture temperature-related patterns in the dataset.
The logic for creating this feature is as follows:
Hot: Days where the maximum temperature exceeds 30°C.
Cool: Days where the maximum temperature is 30°C or lower.
# Example: Creating a 'weather' column based on temperature (tempmax)
final_data$weather <- as.factor(ifelse(final_data$tempmax > 30, "Hot", "Cool"))
# Set seed for reproducibility
set.seed(123)
# Create training and testing data (80-20 split)
train_indices <- sample(seq_len(nrow(final_data)), size = 0.8 * nrow(final_data))
train_data <- final_data[train_indices, ]
test_data <- final_data[-train_indices, ]
Linear Regression Model
# Build linear regression model
lm_temp <- lm(temp ~ tempmax + tempmin + precipitation + windspeed + cloudcover, data = train_data)
# Predict on test data
pred_temp_lm <- predict(lm_temp, test_data)
# Evaluate the model
rmse_temp_lm <- rmse(test_data$temp, pred_temp_lm)
r2_temp_lm <- summary(lm_temp)$r.squared
r2_temp_lm
## [1] 0.9947049
Random Forest Model
# Build random forest model
rf_temp <- randomForest(temp ~ tempmax + tempmin + precipitation + windspeed + cloudcover, data = train_data)
# Predict on test data
pred_temp_rf <- predict(rf_temp, test_data)
# Evaluate the model
rmse_temp_rf <- rmse(test_data$temp, pred_temp_rf)
rmse_temp_rf
## [1] 0.8442027
Support Vector Regression (SVR)
# Build Support Vector Regression model
svr_temp <- svm(temp ~ tempmax + tempmin + precipitation + windspeed + cloudcover, data = train_data)
# Predict on test data
pred_temp_svr <- predict(svr_temp, test_data)
# Evaluate the model
rmse_temp_svr <- sqrt(mean((pred_temp_svr - test_data$temp)^2))
rmse_temp_svr
## [1] 0.6346145
# Evaluate the models
rmse_temp_rf <- sqrt(mean((pred_temp_rf - test_data$temp)^2)) # Random Forest RMSE
rmse_temp_lm <- sqrt(mean((pred_temp_lm - test_data$temp)^2)) # Linear Regression RMSE
rmse_temp_svr <- sqrt(mean((pred_temp_svr - test_data$temp)^2)) # Support Vector Regression RMSE
# Calculate R-squared for Linear Regression (using the summary of the model)
r2_temp_lm <- summary(lm_temp)$r.squared
# Manually calculate R-squared for Random Forest and SVR
ss_total <- sum((test_data$temp - mean(test_data$temp))^2) # Total Sum of Squares
ss_residual_rf <- sum((test_data$temp - pred_temp_rf)^2) # Residual Sum of Squares (RF)
ss_residual_svr <- sum((test_data$temp - pred_temp_svr)^2) # Residual Sum of Squares (SVR)
r2_temp_rf <- 1 - (ss_residual_rf / ss_total) # R-squared for Random Forest
r2_temp_svr <- 1 - (ss_residual_svr / ss_total) # R-squared for SVR
# Calculate MAE for all models
mae_temp_rf <- mean(abs(pred_temp_rf - test_data$temp))
mae_temp_lm <- mean(abs(pred_temp_lm - test_data$temp))
mae_temp_svr <- mean(abs(pred_temp_svr - test_data$temp))
# Display RMSE, R-squared, and MAE for all three models
model_comparison <- data.frame(
Model = c("Random Forest", "Linear Regression", "Support Vector Regression (SVR)"),
RMSE = c(rmse_temp_rf, rmse_temp_lm, rmse_temp_svr),
R_squared = c(r2_temp_rf, r2_temp_lm, r2_temp_svr), # Include R-squared for all models
MAE = c(mae_temp_rf, mae_temp_lm, mae_temp_svr)
)
# Print the model comparison
print(model_comparison)
## Model RMSE R_squared MAE
## 1 Random Forest 0.8442027 0.9937795 0.6105777
## 2 Linear Regression 0.4955837 0.9947049 0.3816977
## 3 Support Vector Regression (SVR) 0.6346145 0.9964848 0.4463752
# Reshape the model_comparison data frame into long format
model_comparison_long <- model_comparison %>%
pivot_longer(cols = c("RMSE", "MAE",),
names_to = "Metric",
values_to = "Value")
# Plot RMSE and MAE for all models
ggplot(model_comparison_long, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Comparison of RMSE and MAE for Temperature Prediction Models",
x = "Model",
y = "Value") +
scale_fill_manual(values = c("RMSE" = "skyblue", "MAE" = "orange")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Model Comparison
Overall Based on the RMSE values, Linear Regression appears to be the most suitable model for the given dataset in this scenario.
Decision Tree Model
dt_model <- rpart(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data, method = "class")
dt_pred <- predict(dt_model, test_data, type = "class")
dt_conf <- confusionMatrix(dt_pred, test_data$weather)
dt_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cool Hot
## Cool 969 0
## Hot 0 127
##
## Accuracy : 1
## 95% CI : (0.9966, 1)
## No Information Rate : 0.8841
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.8841
## Detection Rate : 0.8841
## Detection Prevalence : 0.8841
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Cool
##
Random Forest
rf_model <- randomForest(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data)
rf_pred <- predict(rf_model, test_data)
rf_conf <- confusionMatrix(rf_pred, test_data$weather)
rf_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cool Hot
## Cool 969 0
## Hot 0 127
##
## Accuracy : 1
## 95% CI : (0.9966, 1)
## No Information Rate : 0.8841
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.8841
## Detection Rate : 0.8841
## Detection Prevalence : 0.8841
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Cool
##
k-NN Model
knn_model <- train(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data, method = "knn", tuneLength = 5)
knn_pred <- predict(knn_model, test_data)
knn_conf <- confusionMatrix(knn_pred, test_data$weather)
knn_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cool Hot
## Cool 962 25
## Hot 7 102
##
## Accuracy : 0.9708
## 95% CI : (0.959, 0.9799)
## No Information Rate : 0.8841
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8482
##
## Mcnemar's Test P-Value : 0.002654
##
## Sensitivity : 0.9928
## Specificity : 0.8031
## Pos Pred Value : 0.9747
## Neg Pred Value : 0.9358
## Prevalence : 0.8841
## Detection Rate : 0.8777
## Detection Prevalence : 0.9005
## Balanced Accuracy : 0.8980
##
## 'Positive' Class : Cool
##
Naive Bayes Model
nb_model <- naiveBayes(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data)
nb_pred <- predict(nb_model, test_data)
nb_conf <- confusionMatrix(nb_pred, test_data$weather)
nb_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cool Hot
## Cool 910 4
## Hot 59 123
##
## Accuracy : 0.9425
## 95% CI : (0.9271, 0.9556)
## No Information Rate : 0.8841
## P-Value [Acc > NIR] : 2.937e-11
##
## Kappa : 0.7639
##
## Mcnemar's Test P-Value : 1.022e-11
##
## Sensitivity : 0.9391
## Specificity : 0.9685
## Pos Pred Value : 0.9956
## Neg Pred Value : 0.6758
## Prevalence : 0.8841
## Detection Rate : 0.8303
## Detection Prevalence : 0.8339
## Balanced Accuracy : 0.9538
##
## 'Positive' Class : Cool
##
SVM Model
svm_model <- svm(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data)
svm_pred <- predict(svm_model, test_data)
svm_conf <- confusionMatrix(svm_pred, test_data$weather)
svm_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cool Hot
## Cool 966 14
## Hot 3 113
##
## Accuracy : 0.9845
## 95% CI : (0.9753, 0.9909)
## No Information Rate : 0.8841
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9213
##
## Mcnemar's Test P-Value : 0.01529
##
## Sensitivity : 0.9969
## Specificity : 0.8898
## Pos Pred Value : 0.9857
## Neg Pred Value : 0.9741
## Prevalence : 0.8841
## Detection Rate : 0.8814
## Detection Prevalence : 0.8942
## Balanced Accuracy : 0.9433
##
## 'Positive' Class : Cool
##
# Collect results for all models with additional metrics
model_metrics <- data.frame(
Model = c("Decision Tree", "Random Forest", "k-NN", "Naive Bayes", "SVM"),
Accuracy = c(dt_conf$overall["Accuracy"],
rf_conf$overall["Accuracy"],
knn_conf$overall["Accuracy"],
nb_conf$overall["Accuracy"],
svm_conf$overall["Accuracy"]),
Precision = c(dt_conf$byClass["Pos Pred Value"],
rf_conf$byClass["Pos Pred Value"],
knn_conf$byClass["Pos Pred Value"],
nb_conf$byClass["Pos Pred Value"],
svm_conf$byClass["Pos Pred Value"]),
Recall = c(dt_conf$byClass["Sensitivity"],
rf_conf$byClass["Sensitivity"],
knn_conf$byClass["Sensitivity"],
nb_conf$byClass["Sensitivity"],
svm_conf$byClass["Sensitivity"]),
F1_Score = c(dt_conf$byClass["F1"],
rf_conf$byClass["F1"],
knn_conf$byClass["F1"],
nb_conf$byClass["F1"],
svm_conf$byClass["F1"])
)
print(model_metrics)
## Model Accuracy Precision Recall F1_Score
## 1 Decision Tree 1.0000000 1.0000000 1.0000000 1.0000000
## 2 Random Forest 1.0000000 1.0000000 1.0000000 1.0000000
## 3 k-NN 0.9708029 0.9746707 0.9927761 0.9836401
## 4 Naive Bayes 0.9425182 0.9956236 0.9391125 0.9665428
## 5 SVM 0.9844891 0.9857143 0.9969040 0.9912776
# Plot the accuracy of the models using a bar plot
ggplot(model_metrics, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme_minimal() +
labs(title = "Model Accuracy Comparison",
x = "Model",
y = "Accuracy") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Decision Tree & Random Forest: Both models achieve perfect scores across all metrics (Accuracy, Precision, Recall, and F1-Score). This suggests that they are able to perfectly classify all instances in the dataset.
Logistic Regression: Also achieves perfect scores, indicating strong performance.
k-NN: Shows very high scores across all metrics, with a slight drop in Accuracy compared to the top three models.
SVM: Performs well, with high scores in all metrics, though slightly lower than k-NN.
Naive Bayes: Has the lowest scores among the models, with lower Accuracy and Recall compared to the others.
This deployment provides an interactive experience where users can explore temperature trends over time through an interactive line plot and view detailed data through a sortable and searchable table. The visualizations allow for dynamic exploration of temperature patterns, while the data table gives users easy access to the raw data, making it a powerful tool for analysis and decision-making.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Conclusion
Key Findings:
Linear Regression emerges as the most accurate model for predicting maximum temperature, with the lowest RMSE (0.4958716).
SVR (Support Vector Regression) and Random Forest showed slightly higher RMSE values, with Random Forest being the least accurate.
Based on the RMSE evaluation, Linear Regression is the preferred model for predicting temperature, making it the most suitable choice for this dataset.
Decision Tree and Random Forest achieved perfect scores across all metrics (Accuracy, Precision, Recall, F1-Score), indicating that they were able to perfectly classify temperature days as “Hot” or “Cool.”
k-NN also performed well, though with a slight drop in accuracy compared to the top models. Naive Bayes showed the lowest performance, particularly in Accuracy and Recall.
For temperature prediction, Linear Regression is the most reliable, whereas for temperature classification, Decision Tree and Random Forest are the best-performing models.
The interactive nature of the deployment allows users to explore temperature trends dynamically over time. This interactive line plot and sortable data table make it a powerful tool for understanding and analyzing temperature data.
The creation of the ‘weather’ feature (categorizing days as “Hot” or “Cool” based on maximum temperature) helped simplify the classification task and improved model interpretability.
Conclusion:
This project successfully utilized machine learning techniques to analyze and model temperature data, achieving valuable insights into daily temperature variations and heatwave patterns. The Linear Regression model was found to be the most effective for predicting daily temperatures, while Decision Tree and Random Forest were the best for classifying days into temperature categories. The interactive deployment provided an accessible tool for exploring and analyzing the data, making it a useful resource for understanding temperature trends.
Overall, the project demonstrates the power of machine learning in climate data analysis and highlights the importance of proper model selection, feature engineering, and interactive visualization for effective data-driven decision-making.