GROUP 10 MEMBERS
Matric	Full_Name
S2110638	Karam Aljanadi
23094825	Siti Hairuneesha Binti Zainuddin
u2005486	Afrina Rosa binti Mohamad Sattar
23085637	Chia Kai Swan
S2119226	Muhammad Taufiq Bin Ismail

Weather Forecasting in Seoul: Insights from 15 Years of Data and Future Trends (2009–2023)

Weather forecasting plays a pivotal role in understanding climatic patterns, preparing for extreme weather events, and optimizing resources across various sectors. This study focuses on Seoul’s weather over a 15-year period (2009–2023), utilizing historical weather data to uncover significant trends and patterns. By analyzing key attributes such as temperature, precipitation, wind speed, and cloud cover, we aim to gain insights into Seoul’s evolving climate and predict future weather conditions.

Dataset Source

The dataset, titled “Seoul Historical Weather Data (1994–2024),” is available on Kaggle.

Research Objective:

To analyze weather trends by identifying patterns and changes in Seoul’s weather from 2009 to 2023.
To predict future conditions by building models that forecast future temperature patterns in Seoul.
To provide actionable insights by offering recommendations for urban planning, sustainability, and resource management.

Research Questions:

What weather trends can be identified in Seoul from 2009 to 2023 ?
How accurate are forecasting models for predicting future weather?
How can visualizations help urban planners use weather data for decision-making?

Methodology

CRISP-DM

1.Understand: Define the problem: Analyze Seoul’s weather data for insights and predictions.

2.Explore: Load and examine the dataset, identifying key attributes.

3.Prepare: Clean and preprocess data for analysis.

4.Model: Build predictive models using statistical/ML methods.

5.Evaluate: Assess model performance using relevant metrics.

6.Deploy: Present findings, forecasts, and actionable recommendations.

Import Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (readr)
library (dplyr)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggplot2)
library(rpart)
library(corrplot)
## corrplot 0.95 loaded
library(Metrics)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:Metrics':
## 
##     precision, recall
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(rsconnect)
library(knitr)

Import Dataset

df<-read_csv("dataset123.csv")

## Rows: 10958 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): name, datetime, preciptype, conditions, description, icon, stations
## dbl  (25): ROW_KEY, tempmax, tempmin, temp, feelslikemax, feelslikemin, feel...
## dttm  (2): sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Show the number of rows in a more descriptive way
cat("The dataset contains", nrow(df), "rows.")

## The dataset contains 10958 rows.

str(df)

## spc_tbl_ [10,958 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ROW_KEY         : num [1:10958] 1 2 3 4 5 6 7 8 9 10 ...
##  $ name            : chr [1:10958] "seoul" "seoul" "seoul" "seoul" ...
##  $ datetime        : chr [1:10958] "1/1/1994" "2/1/1994" "3/1/1994" "4/1/1994" ...
##  $ tempmax         : num [1:10958] 35.2 43 47.9 38.8 40 42.7 37 44.3 48.8 48 ...
##  $ tempmin         : num [1:10958] 16.4 31.5 30.9 22.1 24 26.1 17.7 NA 33.6 27.9 ...
##  $ temp            : num [1:10958] 26.3 36.2 38 30.1 33.1 35.5 27.5 34.2 41 39.7 ...
##  $ feelslikemax    : num [1:10958] 33.4 39.4 44.7 32 40 39.1 37 37.6 46.1 48 ...
##  $ feelslikemin    : num [1:10958] 13 26.7 24.5 18.4 18.5 15.7 13.7 12.6 29.2 27.9 ...
##  $ feelslike       : num [1:10958] 24.3 32.6 35.4 26.3 31 30.1 24.3 28.2 39.1 38.9 ...
##  $ dew             : num [1:10958] 15.5 27.9 27.3 13.6 21.7 25 11.3 24.3 34.3 32.4 ...
##  $ humidity        : num [1:10958] 65.9 72.1 68.1 51.2 63.9 66.7 52.1 67.7 77.4 75.8 ...
##  $ precip          : num [1:10958] 0 0 0 0 0.01 0.1 0 0 0 0 ...
##  $ precipprob      : num [1:10958] 0 0 0 0 100 100 0 0 0 0 ...
##  $ precipcover     : num [1:10958] 0 0 0 0 4.17 4.17 0 0 0 0 ...
##  $ preciptype      : chr [1:10958] NA NA NA NA ...
##  $ snow            : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ snowdepth       : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ windgust        : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ windspeed       : num [1:10958] 5.5 9.1 12.8 11.7 6.8 11.8 9.1 14.7 8.9 4.5 ...
##  $ winddir         : num [1:10958] 115 182 290 302 134 ...
##  $ sealevelpressure: num [1:10958] 1025 1022 1020 1025 1024 ...
##  $ cloudcover      : num [1:10958] 63 88.8 57.4 16.3 90.4 57.2 30.1 46.6 NA 81 ...
##  $ visibility      : num [1:10958] 6.6 6.9 5.3 7.3 6.2 5.1 7.4 7.6 3.7 2.2 ...
##  $ solarradiation  : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ solarenergy     : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ uvindex         : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ severerisk      : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ sunrise         : POSIXct[1:10958], format: "1994-01-01 07:46:54" "1994-01-02 07:47:03" ...
##  $ sunset          : POSIXct[1:10958], format: "1994-01-01 17:23:56" "1994-01-02 17:24:44" ...
##  $ moonphase       : num [1:10958] 0.61 0.65 0.68 0.72 0.75 0.79 0.83 0.86 0.9 0.93 ...
##  $ conditions      : chr [1:10958] "Partially cloudy" "Partially cloudy" "Partially cloudy" "Clear" ...
##  $ description     : chr [1:10958] "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Clear conditions throughout the day." ...
##  $ icon            : chr [1:10958] "partly-cloudy-day" "partly-cloudy-day" "partly-cloudy-day" "clear-day" ...
##  $ stations        : chr [1:10958] "4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000" "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,"| __truncated__ ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ROW_KEY = col_double(),
##   ..   name = col_character(),
##   ..   datetime = col_character(),
##   ..   tempmax = col_double(),
##   ..   tempmin = col_double(),
##   ..   temp = col_double(),
##   ..   feelslikemax = col_double(),
##   ..   feelslikemin = col_double(),
##   ..   feelslike = col_double(),
##   ..   dew = col_double(),
##   ..   humidity = col_double(),
##   ..   precip = col_double(),
##   ..   precipprob = col_double(),
##   ..   precipcover = col_double(),
##   ..   preciptype = col_character(),
##   ..   snow = col_double(),
##   ..   snowdepth = col_double(),
##   ..   windgust = col_double(),
##   ..   windspeed = col_double(),
##   ..   winddir = col_double(),
##   ..   sealevelpressure = col_double(),
##   ..   cloudcover = col_double(),
##   ..   visibility = col_double(),
##   ..   solarradiation = col_double(),
##   ..   solarenergy = col_double(),
##   ..   uvindex = col_double(),
##   ..   severerisk = col_double(),
##   ..   sunrise = col_datetime(format = ""),
##   ..   sunset = col_datetime(format = ""),
##   ..   moonphase = col_double(),
##   ..   conditions = col_character(),
##   ..   description = col_character(),
##   ..   icon = col_character(),
##   ..   stations = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

kable(head(df))

ROW_KEY	name	datetime	tempmax	tempmin	temp	feelslikemax	feelslikemin	feelslike	dew	humidity	precip	precipprob	precipcover	preciptype	snow	snowdepth	windgust	windspeed	winddir	sealevelpressure	cloudcover	visibility	solarradiation	solarenergy	uvindex	severerisk	sunrise	sunset	moonphase	conditions	description	icon	stations
1	seoul	1/1/1994	35.2	16.4	26.3	33.4	13.0	24.3	15.5	65.9	0.00	0	0.00	NA	NA	NA	NA	5.5	115.4	1025.4	63.0	6.6	NA	NA	NA	NA	1994-01-01 07:46:54	1994-01-01 17:23:56	0.61	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
2	seoul	2/1/1994	43.0	31.5	36.2	39.4	26.7	32.6	27.9	72.1	0.00	0	0.00	NA	NA	NA	NA	9.1	181.7	1022.2	88.8	6.9	NA	NA	NA	NA	1994-01-02 07:47:03	1994-01-02 17:24:44	0.65	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
3	seoul	3/1/1994	47.9	30.9	38.0	44.7	24.5	35.4	27.3	68.1	0.00	0	0.00	NA	NA	NA	NA	12.8	289.5	1020.0	57.4	5.3	NA	NA	NA	NA	1994-01-03 07:47:11	1994-01-03 17:25:33	0.68	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
4	seoul	4/1/1994	38.8	22.1	30.1	32.0	18.4	26.3	13.6	51.2	0.00	0	0.00	NA	NA	NA	NA	11.7	301.9	1025.1	16.3	7.3	NA	NA	NA	NA	1994-01-04 07:47:16	1994-01-04 17:26:23	0.72	Clear	Clear conditions throughout the day.	clear-day	47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5	seoul	5/1/1994	40.0	24.0	33.1	40.0	18.5	31.0	21.7	63.9	0.01	100	4.17	rain,snow	NA	NA	NA	6.8	134.1	1023.9	90.4	6.2	NA	NA	NA	NA	1994-01-05 07:47:19	1994-01-05 17:27:14	0.75	Snow, Rain, Overcast	Cloudy skies throughout the day with late afternoon rain or snow.	rain	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
6	seoul	6/1/1994	42.7	26.1	35.5	39.1	15.7	30.1	25.0	66.7	0.10	100	4.17	rain	NA	NA	NA	11.8	290.2	1021.9	57.2	5.1	NA	NA	NA	NA	1994-01-06 07:47:21	1994-01-06 17:28:07	0.79	Rain, Partially cloudy	Partly cloudy throughout the day with morning rain.	rain	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Data Preparation

Filtering Data

Filter data for the 2009-2023 period. This is for data relevancy and potentially leaving out any outdated data.

Filter year of data

# Convert `datetime` column to Date 
df$datetime <- as.Date(df$datetime, format = "%d/%m/%Y")
# Filter the data for the date range 2009 to 2023
filtered_data <- df %>%  filter(datetime >= "2009-01-01" & datetime <= "2023-12-31")

ROW_KEY	name	datetime	tempmax	tempmin	temp	feelslikemax	feelslikemin	feelslike	dew	humidity	preciptype	windgust	windspeed	winddir	sealevelpressure	cloudcover	visibility	solarradiation	solarenergy	uvindex	severerisk	sunrise	sunset	moonphase	conditions	description	icon	stations
5480	seoul	2009-01-01	28.5	14.5	20.9	23.2	6.4	15.3	5.9	52.7	NA	NA	9.5	308.6	1026.5	0.2	7.3	NA	NA	NA	NA	2009-01-01 07:46:55	2009-01-01 17:24:11	0.16	Clear	Clear conditions throughout the day.	clear-day	471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5481	seoul	2009-01-02	35.0	13.3	24.2	30.8	12.7	22.3	12.2	63.0	NA	NA	5.6	326.3	1029.3	0.0	7.0	NA	NA	NA	NA	2009-01-02 07:47:04	2009-01-02 17:24:59	0.19	Clear	Clear conditions throughout the day.	clear-day	4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5482	seoul	2009-01-03	39.4	15.1	27.2	39.4	12.7	25.6	14.2	62.7	NA	NA	5.6	329.4	1029.4	11.2	6.7	NA	NA	NA	NA	2009-01-03 07:47:11	2009-01-03 17:25:48	0.23	Clear	Clear conditions throughout the day.	clear-day	471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5483	seoul	2009-01-04	40.4	22.4	30.5	36.3	22.4	28.6	15.8	58.6	NA	NA	7.8	306.2	1026.9	27.5	7.1	NA	NA	NA	NA	2009-01-04 07:47:15	2009-01-04 17:26:39	0.25	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5484	seoul	2009-01-05	34.0	18.7	27.4	32.2	17.9	24.4	17.6	67.8	NA	NA	9.0	287.0	1026.9	34.0	6.6	NA	NA	NA	NA	2009-01-05 07:47:18	2009-01-05 17:27:31	0.29	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5485	seoul	2009-01-06	34.9	15.8	26.2	34.8	15.8	24.4	14.2	63.1	NA	NA	5.6	326.1	1028.9	38.8	4.6	NA	NA	NA	NA	2009-01-06 07:47:19	2009-01-06 17:28:24	0.33	Partially cloudy	Becoming cloudy in the afternoon.	partly-cloudy-day	4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Filter Column

# Filter the relevant columns
filtered_data <- filtered_data %>%
  select(datetime, tempmax, tempmin, temp, precip, windspeed, cloudcover)

# Check the filtered dataset
str(filtered_data)

## tibble [5,478 × 7] (S3: tbl_df/tbl/data.frame)
##  $ datetime  : Date[1:5478], format: "2009-01-01" "2009-01-02" ...
##  $ tempmax   : num [1:5478] 28.5 35 39.4 40.4 34 34.9 39.5 37.9 32.1 24.9 ...
##  $ tempmin   : num [1:5478] 14.5 13.3 15.1 22.4 18.7 15.8 15.6 17.6 18.5 14.6 ...
##  $ temp      : num [1:5478] 20.9 24.2 27.2 30.5 27.4 26.2 27.5 27.9 25.7 19.7 ...
##  $ precip    : num [1:5478] 0 0 0 0 0 0 0 0 0 0 ...
##  $ windspeed : num [1:5478] 9.5 5.6 5.6 7.8 9 5.6 6.7 7.4 7.9 9.1 ...
##  $ cloudcover: num [1:5478] 0.2 0 11.2 27.5 34 38.8 10.8 36.1 17.2 8.2 ...

Filter relevant columns Key Attributes:

tempmax: Maximum daily temperatures (°C)
tempmin: Minimum daily temperatures (°C)
precip: Daily precipitation totals (mm)
windspeed: Average daily wind speed (km/h)
cloudcover: Average cloud cover percentage

These attributes will help analyze trends such as heatwaves, rainfall patterns, and wind speed variability, which are important for urban planning and sustainable resource management.

Missing Values & Duplicate Handling

Handle missing values by replacing them with mean values.

Handle missing value

# Show number of missing values before handling
missing_before <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values before handling:")

## [1] "Missing values before handling:"

print(missing_before)

##   datetime    tempmax    tempmin       temp     precip  windspeed cloudcover 
##          0          0          0          1          1          0          2

# Fill missing values in numeric columns with the mean
numeric_columns <- names(filtered_data)[sapply(filtered_data, is.numeric)]
filtered_data <- filtered_data %>%
  mutate(across(all_of(numeric_columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Show number of missing values after handling
missing_after <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values after handling:")

## [1] "Missing values after handling:"

print(missing_after)

##   datetime    tempmax    tempmin       temp     precip  windspeed cloudcover 
##          0          0          0          0          0          0          0

Removing duplicated rows to ensure only valid and unique data for analysis

Remove Duplicate

# Remove duplicates
filtered_data <- distinct(filtered_data)

Data Transformation

This transformation is done to make the column name more descriptive or clearer.

precip to precipitation

#transform precip to precipitation

# Rename the column
filtered_data <- filtered_data %>%
  rename(precipitation = precip)

# Check the updated column names
colnames(filtered_data)

## [1] "datetime"      "tempmax"       "tempmin"       "temp"         
## [5] "precipitation" "windspeed"     "cloudcover"

Data Visualization

Time-series

# Time series plot
ggplot(filtered_data, aes(x = datetime, y = tempmax)) +
  geom_line(color = "blue") +  # Line plot
  labs(title = "Time Series of Precipitation",
       x = "Date and Time",
       y = "Temperature Maximum") +
  theme_minimal()

The precipitation data shows a clear cyclical pattern, likely due to seasonal variations. There appears to be a slight upward trend over time. Some anomalies are observed, suggesting the influence of weather events or other factors.

Inspection & Verification of temperature data

Summary of tempmax & tempmin

# Summary statistics for temp columns
summary(filtered_data$tempmax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9.10   39.10   61.00   57.61   79.20  102.30

summary(filtered_data$tempmin)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -17.10   23.70   41.70   41.68   62.50   84.40

Boxplot of tempmax & tempmin

# Check for unusually high or low values
boxplot(filtered_data$tempmax, main = "Boxplot of tempmax", ylab = "Temperature")

boxplot(filtered_data$tempmin, main = "Boxplot of tempmin", ylab = "Temperature")

Both tempmax and tempmin exhibit a wide range and significant variability.
Tempmax shows a higher median and a wider range compared to tempmin.

This suggests substantial daily temperature fluctuations.

Considerations:

Units: The units for temperature are not specified in the plots. Knowing the units (e.g., Celsius, Fahrenheit) would be essential for accurate interpretation.

Temperature conversion Fahrenheit to Celcius

Unit Consistency Using a single unit (Celsius) ensures accurate comparisons and calculations throughout the analysis.

Conversion Logic

# Define conversion functions
celsius_to_fahrenheit <- function(celsius) {
  celsius * 9 / 5 + 32
}

fahrenheit_to_celsius <- function(fahrenheit) {
  (fahrenheit - 32) * 5 / 9
}


# Define the date range for conversion
start_date <- as.Date('2009-01-01')
end_date <- as.Date('2021-12-31')

# Filter the data for the dates that need conversion (2009-2021)
data_to_convert <- filtered_data[filtered_data$datetime >= start_date & filtered_data$datetime <= end_date, ]

# Define the columns to convert
temp_columns <- c("tempmax", "tempmin", "temp")

# Apply temperature conversion for the filtered rows (2009-2021)
for (col in temp_columns) {
  data_to_convert[[col]] <- round(fahrenheit_to_celsius(data_to_convert[[col]]), 1)
}

# Now, merge the converted data back with the original data
data_non_converted <- filtered_data[!(filtered_data$datetime >= start_date & filtered_data$datetime <= end_date), ]

# Combine the data
final_data <- rbind(data_non_converted, data_to_convert)

# Ensure that the combined data is sorted by date
final_data <- final_data[order(final_data$datetime), ]

datetime	tempmax	tempmin	temp	windspeed	cloudcover
2009-01-01	-1.9	-9.7	-6.2	9.5	0.2
2009-01-02	1.7	-10.4	-4.3	5.6	0.0
2009-01-03	4.1	-9.4	-2.7	5.6	11.2
2009-01-04	4.7	-5.3	-0.8	7.8	27.5
2009-01-05	1.1	-7.4	-2.6	9.0	34.0
2009-01-06	1.6	-9.0	-3.2	5.6	38.8

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an essential step in data analysis that helps us understand the dataset, uncover underlying patterns, detect anomalies, and test assumptions before applying any predictive models. In this section, we aim to explore the Seoul Historical Weather data to extract meaningful insights regarding weather patterns over a 15-year period (2009–2023). By visualizing the data, calculating summary statistics, and performing various checks, we will prepare the dataset for further analysis and modeling.

Descriptive Statistics

Calculate summary statistics (mean, median, variance) for key attributes like temperature, precipitation, wind speed, etc.

##     datetime             tempmax          tempmin             temp       
##  Min.   :2009-01-01   Min.   :-11.30   Min.   :-19.000   Min.   :-14.40  
##  1st Qu.:2012-10-01   1st Qu.:  8.40   1st Qu.: -1.000   1st Qu.:  3.80  
##  Median :2016-07-01   Median : 19.55   Median :  9.000   Median : 14.10  
##  Mean   :2016-07-01   Mean   : 17.68   Mean   :  8.331   Mean   : 12.89  
##  3rd Qu.:2020-03-31   3rd Qu.: 27.10   3rd Qu.: 18.275   3rd Qu.: 22.60  
##  Max.   :2023-12-31   Max.   : 39.10   Max.   : 29.100   Max.   : 49.47  
##  precipitation        windspeed       cloudcover    
##  Min.   :  0.0000   Min.   : 2.60   Min.   :  0.00  
##  1st Qu.:  0.0000   1st Qu.: 7.90   1st Qu.: 25.40  
##  Median :  0.0000   Median :10.00   Median : 50.50  
##  Mean   :  0.7568   Mean   :10.61   Mean   : 50.19  
##  3rd Qu.:  0.0400   3rd Qu.:12.60   3rd Qu.: 74.80  
##  Max.   :268.6200   Max.   :44.90   Max.   :100.00

Time Series Plots

Temperature

To observe temperature variations over time

ggplot(final_data, aes(x = datetime)) +
  geom_line(aes(y = tempmax), color = "red") +
  geom_line(aes(y = tempmin), color = "blue") +
  geom_line(aes(y = temp), color = "green") +
  labs(title = "Temperature Over Time", x = "Date", y = "Temperature (°C)") +
  theme_minimal()

Color Interpretation:

Red: Represents the highest temperature readings.
Blue: Represents the lowest temperature readings.
Green: Represents intermediate temperature readings.

Visual Interpretation:

Cyclical Pattern: The most prominent feature is the clear cyclical pattern in the temperature data. There are distinct peaks and troughs that repeat at regular intervals. This suggests a strong seasonal component in the temperature variations.
Amplitude: The amplitude of the fluctuations seems to vary somewhat over time. There are periods with higher peaks and deeper troughs, and others with more moderate variations. This could be due to factors like El Niño or other climate phenomena.
Trend: There appears to be a slight upward trend in the overall temperature levels from 2010 to 2020. This could be attributed to long-term climate change or other regional factors.
Anomalies: There are a few notable anomalies, such as a sharp drop in temperature around 2016. These could be due to specific weather events or other localized factors.

In Summary:

The plot visually depicts the temperature fluctuations over time, highlighting the seasonal variations, potential trends, and notable anomalies. The use of color effectively differentiates between temperature ranges, providing a clear visual representation of the temperature variations.

Precipitation, Cloud Cover & Wind Speed

# Reshape the dataset to long format
final_data_long <- pivot_longer(final_data, cols = c(precipitation, windspeed, cloudcover), 
                                names_to = "variable", values_to = "value")

# Plotting with facets
ggplot(final_data_long, aes(x = datetime, y = value)) +
  geom_line() +
  facet_wrap(~ variable, scales = "free_y") +  # Use free y-axis scale for each variable
  labs(title = "Time Series of Precipitation, Wind Speed, and Cloud Cover", x = "Date", y = "Value") +
  theme_minimal()

Precipitation: The precipitation time series shows distinct peaks and troughs, suggesting a seasonal rainfall pattern. There appear to be periods with higher precipitation and others with lower precipitation.
Wind Speed: The wind speed time series also shows a cyclical pattern, but with less pronounced peaks and troughs compared to precipitation. The overall variation in wind speed seems to be less than that of precipitation.
Cloud Cover: The cloud cover time series exhibits a high degree of variability with frequent fluctuations. There are periods with consistently high cloud cover and others with lower cloud cover.

The most likely explanation for the cyclical patterns is the seasonal variation in weather patterns. In many regions, precipitation, wind speed, and cloud cover are influenced by factors like monsoon seasons, changes in atmospheric circulation, and solar radiation

Heat Wave Analysis

A heatwave threshold is defined (in this case, 35 degrees Celsius). Showing the days with a maximum temperature exceeding the threshold are marked as heatwave days.

Number of Heatwave Days

Heatwave Logic

# Define heatwave threshold
heatwave_threshold <- 35  # Example threshold for heatwaves

# Mark heatwave days
final_data <- final_data %>%
  mutate(heatwave = ifelse(tempmax > heatwave_threshold, 1, 0))

# Use cumsum to identify consecutive heatwave periods
final_data <- final_data %>%
  mutate(heatwave_id = cumsum(heatwave == 0 & lag(heatwave, default = 0) == 0))

# Count the number of heatwave days
num_heatwave_days <- sum(final_data$heatwave == 1)

# Create a message
heatwave_message <- paste("The number of heatwaves in Seoul throughout the year are", num_heatwave_days)

# Print the message
print(heatwave_message)

## [1] "The number of heatwaves in Seoul throughout the year are 45"

Scatter Plot of Heatwave Analysis

# Scatter plot for tempmax with heatwave days highlighted
ggplot(final_data, aes(x = datetime, y = tempmax)) +
  # Scatter plot for all data points
  geom_point(aes(color = factor(heatwave), shape = factor(heatwave)), size = 3) +
  
  # Customize color scale to distinguish heatwave days
  scale_color_manual(values = c("black", "red"), labels = c("Non-Heatwave", "Heatwave")) +
  scale_shape_manual(values = c(16, 17), labels = c("Non-Heatwave", "Heatwave")) +
  
  # Add annotation for heatwave days count
  annotate("text", x = as.Date("2022-01-01"), y = max(final_data$tempmax),
           label = paste("Heatwave Days:", num_heatwave_days), 
           hjust = 0, color = "blue", size = 2.5) +
  
  labs(title = "Scatter Plot of Max Temperature with Heatwave Days Highlighted", 
       x = "Date", 
       y = "Max Temperature (°C)") +
  theme_minimal() +
  theme(legend.title = element_blank())

Interpretation:

The scatter plot provides a visual representation of heatwave events in relation to maximum temperature. It allows us to identify:

Days exceeding the heatwave threshold: The red triangles clearly highlight the days where the maximum temperature surpassed the defined threshold.
Seasonal patterns: The clustering of red triangles suggests that heatwaves are more likely to occur during specific seasons or periods.
Temperature distribution: The scatter plot shows the overall distribution of maximum temperatures, including both heatwave and non-heatwave days.

Correlation Analysis

The correlation analysis is visualized through a circle heatmap, which highlights the strength and direction of relationships between different variables in the dataset.

cor_matrix <- cor(final_data[, c("tempmax", "tempmin", "temp", "precipitation", "windspeed", "cloudcover")])
corrplot(cor_matrix, method = "circle")

Interpretation of Correlation Plot

The following interpretation is based on the correlation plot created for the dataset. The size and color of the circles represent the strength and direction of the correlation between each pair of variables.

Strong Positive Correlations:

tempmax and tempmin have a very strong positive correlation. This makes sense as the maximum temperature of a day is closely related to its minimum temperature, with both typically increasing or decreasing together.
temp and tempmax have a strong positive correlation. As the temperature of the day increases, the maximum temperature also tends to increase.
temp and tempmin also show a strong positive correlation. Similarly, as the temperature of the day increases, the minimum temperature increases as well.

Moderate to Weak Correlations:

Some correlations appear to be moderate or weak:
- The correlation between precipitation and windspeed is weak, suggesting that changes in wind speed do not strongly affect precipitation levels.
- The correlation between cloudcover and temp is also weak to moderate, meaning that cloud cover does not always correlate strongly with the temperature of the day.

Seasonal Analysis

The seasonal analysis explores the variations in key variables over different seasons, aiming to identify trends, patterns, and anomalies that may influence the overall dataset.

To categorize the data by season, a new ‘season’ column is created based on the month of the year,

with the following assumptions:

Winter: December, January, February
Spring: March, April, May
Summer: June, July, August
Fall: September, October, November

# Create a 'season' column based on month based on assumption
final_data <- final_data %>%
  mutate(season = case_when(
    format(datetime, "%m") %in% c("12", "01", "02") ~ "Winter",
    format(datetime, "%m") %in% c("03", "04", "05") ~ "Spring",
    format(datetime, "%m") %in% c("06", "07", "08") ~ "Summer",
    format(datetime, "%m") %in% c("09", "10", "11") ~ "Fall"
  ))

# Aggregate by season
seasonal_data <- final_data %>%
  group_by(season) %>%
  summarise(
    tempmax_avg = mean(tempmax, na.rm = TRUE),
    tempmin_avg = mean(tempmin, na.rm = TRUE),
    precip_avg = mean(precipitation, na.rm = TRUE),
    windspeed_avg = mean(windspeed, na.rm = TRUE),
    cloudcover_avg = mean(cloudcover, na.rm = TRUE)
  )

# Plot the seasonal data
ggplot(seasonal_data, aes(x = season)) +
  geom_bar(aes(y = tempmax_avg), stat = "identity", fill = "red") +
  geom_bar(aes(y = tempmin_avg), stat = "identity", fill = "blue", alpha = 0.5) +
  labs(title = "Average Temperature by Season", 
       x = "Season", 
       y = "Temperature (°C)") +
  theme_minimal()

Summer: The highest average temperature is observed in the Summer season. The stacked bar shows a combination of two temperature ranges, with the higher range contributing significantly to the overall average.
Fall and Spring: The average temperatures in Fall and Spring are lower compared to Summer. The stacked bars indicate a similar distribution of temperature ranges between these two seasons.
Winter: The lowest average temperature occurs in Winter. The stacked bar shows a predominance of the lower temperature range, contributing to the overall low average.

Outlier Detection

ggplot(final_data, aes(x = tempmax)) +
  geom_boxplot() +
  labs(title = "Outliers in Max Temperature", x = "Temperature (°C)") +
  theme_minimal()

ggplot(final_data, aes(x = tempmin)) +
  geom_boxplot() +
  labs(title = "Outliers in Min Temperature", x = "Temperature (°C)") +
  theme_minimal()

ggplot(final_data, aes(x = temp)) +
  geom_boxplot() +
  labs(title = "Outliers in Temperature", x = "Temperature (°C)") +
  theme_minimal()

The boxplots provide a visual representation of the temperature data of tempmax, tempmin and temp, highlighting the following characteristics:

No extreme values
Symmetrical distribution
Consistent variations across different temperature metrics (max, min, average)

Feature Engineering

Feature engineering involves transforming raw data into meaningful features that enhance the performance of predictive models. In this analysis, a new categorical feature, ‘weather’, is created based on the maximum temperature (tempmax) to categorize days as either ‘Hot’ or ‘Cool’.

This transformation helps in simplifying the analysis and enables models to better capture temperature-related patterns in the dataset.

The logic for creating this feature is as follows:

Hot: Days where the maximum temperature exceeds 30°C.
Cool: Days where the maximum temperature is 30°C or lower.

# Example: Creating a 'weather' column based on temperature (tempmax)
final_data$weather <- as.factor(ifelse(final_data$tempmax > 30, "Hot", "Cool"))

Data Modeling

Data Splitting

# Set seed for reproducibility
set.seed(123)

# Create training and testing data (80-20 split)
train_indices <- sample(seq_len(nrow(final_data)), size = 0.8 * nrow(final_data))
train_data <- final_data[train_indices, ]
test_data <- final_data[-train_indices, ]

Modelling Temperature Prediction

1.Linear Regression Model

Linear Regression Model

# Build linear regression model
lm_temp <- lm(temp ~ tempmax + tempmin + precipitation + windspeed + cloudcover, data = train_data)

# Predict on test data
pred_temp_lm <- predict(lm_temp, test_data)

# Evaluate the model
rmse_temp_lm <- rmse(test_data$temp, pred_temp_lm)
r2_temp_lm <- summary(lm_temp)$r.squared
r2_temp_lm

## [1] 0.9947049

2.Random Forest Model

Random Forest Model

# Build random forest model
rf_temp <- randomForest(temp ~ tempmax + tempmin + precipitation + windspeed + cloudcover, data = train_data)

# Predict on test data
pred_temp_rf <- predict(rf_temp, test_data)

# Evaluate the model
rmse_temp_rf <- rmse(test_data$temp, pred_temp_rf)
rmse_temp_rf

## [1] 0.8442027

3.Support Vector Regression (SVR)

Support Vector Regression (SVR)

# Build Support Vector Regression model
svr_temp <- svm(temp ~ tempmax + tempmin + precipitation + windspeed + cloudcover, data = train_data)

# Predict on test data
pred_temp_svr <- predict(svr_temp, test_data)

# Evaluate the model
rmse_temp_svr <- sqrt(mean((pred_temp_svr - test_data$temp)^2))
rmse_temp_svr

## [1] 0.6346145

Evaluation

Model Comparison For Temperature Prediction

# Evaluate the models
rmse_temp_rf <- sqrt(mean((pred_temp_rf - test_data$temp)^2))  # Random Forest RMSE
rmse_temp_lm <- sqrt(mean((pred_temp_lm - test_data$temp)^2))  # Linear Regression RMSE
rmse_temp_svr <- sqrt(mean((pred_temp_svr - test_data$temp)^2))  # Support Vector Regression RMSE

# Calculate R-squared for Linear Regression (using the summary of the model)
r2_temp_lm <- summary(lm_temp)$r.squared

# Manually calculate R-squared for Random Forest and SVR
ss_total <- sum((test_data$temp - mean(test_data$temp))^2)  # Total Sum of Squares
ss_residual_rf <- sum((test_data$temp - pred_temp_rf)^2)    # Residual Sum of Squares (RF)
ss_residual_svr <- sum((test_data$temp - pred_temp_svr)^2)  # Residual Sum of Squares (SVR)

r2_temp_rf <- 1 - (ss_residual_rf / ss_total)  # R-squared for Random Forest
r2_temp_svr <- 1 - (ss_residual_svr / ss_total)  # R-squared for SVR

# Calculate MAE for all models
mae_temp_rf <- mean(abs(pred_temp_rf - test_data$temp))
mae_temp_lm <- mean(abs(pred_temp_lm - test_data$temp))
mae_temp_svr <- mean(abs(pred_temp_svr - test_data$temp))

# Display RMSE, R-squared, and MAE for all three models
model_comparison <- data.frame(
  Model = c("Random Forest", "Linear Regression", "Support Vector Regression (SVR)"),
  RMSE = c(rmse_temp_rf, rmse_temp_lm, rmse_temp_svr),
  R_squared = c(r2_temp_rf, r2_temp_lm, r2_temp_svr),  # Include R-squared for all models
  MAE = c(mae_temp_rf, mae_temp_lm, mae_temp_svr)
)

# Print the model comparison
print(model_comparison)

##                             Model      RMSE R_squared       MAE
## 1                   Random Forest 0.8442027 0.9937795 0.6105777
## 2               Linear Regression 0.4955837 0.9947049 0.3816977
## 3 Support Vector Regression (SVR) 0.6346145 0.9964848 0.4463752

# Reshape the model_comparison data frame into long format
model_comparison_long <- model_comparison %>%
  pivot_longer(cols = c("RMSE", "MAE",), 
               names_to = "Metric", 
               values_to = "Value")

# Plot RMSE and MAE for all models
ggplot(model_comparison_long, aes(x = Model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Comparison of RMSE and MAE for Temperature Prediction Models",
       x = "Model",
       y = "Value") +
  scale_fill_manual(values = c("RMSE" = "skyblue", "MAE" = "orange")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model Comparison

Linear Regression: Achieves the lowest RMSE (0.4958716), suggesting it is the most accurate model among the three in terms of predicting the target variable.
SVR: Has a slightly higher RMSE (0.6344668) compared to Linear Regression, indicating slightly less accurate predictions.
Random Forest: Shows the highest RMSE (0.8793023), indicating the least accurate predictions among the models evaluated.

Overall Based on the RMSE values, Linear Regression appears to be the most suitable model for the given dataset in this scenario.

Temperature Range Classification Approach

1.Decision Tree Model

Decision Tree Model

dt_model <- rpart(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data, method = "class")
dt_pred <- predict(dt_model, test_data, type = "class")
dt_conf <- confusionMatrix(dt_pred, test_data$weather)
dt_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cool Hot
##       Cool  969   0
##       Hot     0 127
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9966, 1)
##     No Information Rate : 0.8841     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.8841     
##          Detection Rate : 0.8841     
##    Detection Prevalence : 0.8841     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Cool       
##

2.Random Forest

Random Forest

rf_model <- randomForest(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data)
rf_pred <- predict(rf_model, test_data)
rf_conf <- confusionMatrix(rf_pred, test_data$weather)
rf_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cool Hot
##       Cool  969   0
##       Hot     0 127
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9966, 1)
##     No Information Rate : 0.8841     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.8841     
##          Detection Rate : 0.8841     
##    Detection Prevalence : 0.8841     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Cool       
##

3.k-NN Model

k-NN Model

knn_model <- train(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data, method = "knn", tuneLength = 5)
knn_pred <- predict(knn_model, test_data)
knn_conf <- confusionMatrix(knn_pred, test_data$weather)
knn_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cool Hot
##       Cool  962  25
##       Hot     7 102
##                                          
##                Accuracy : 0.9708         
##                  95% CI : (0.959, 0.9799)
##     No Information Rate : 0.8841         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8482         
##                                          
##  Mcnemar's Test P-Value : 0.002654       
##                                          
##             Sensitivity : 0.9928         
##             Specificity : 0.8031         
##          Pos Pred Value : 0.9747         
##          Neg Pred Value : 0.9358         
##              Prevalence : 0.8841         
##          Detection Rate : 0.8777         
##    Detection Prevalence : 0.9005         
##       Balanced Accuracy : 0.8980         
##                                          
##        'Positive' Class : Cool           
##

4.Naive Bayes Model

Naive Bayes Model

nb_model <- naiveBayes(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data)
nb_pred <- predict(nb_model, test_data)
nb_conf <- confusionMatrix(nb_pred, test_data$weather)
nb_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cool Hot
##       Cool  910   4
##       Hot    59 123
##                                           
##                Accuracy : 0.9425          
##                  95% CI : (0.9271, 0.9556)
##     No Information Rate : 0.8841          
##     P-Value [Acc > NIR] : 2.937e-11       
##                                           
##                   Kappa : 0.7639          
##                                           
##  Mcnemar's Test P-Value : 1.022e-11       
##                                           
##             Sensitivity : 0.9391          
##             Specificity : 0.9685          
##          Pos Pred Value : 0.9956          
##          Neg Pred Value : 0.6758          
##              Prevalence : 0.8841          
##          Detection Rate : 0.8303          
##    Detection Prevalence : 0.8339          
##       Balanced Accuracy : 0.9538          
##                                           
##        'Positive' Class : Cool            
##

5.SVM Model

SVM Model

svm_model <- svm(weather ~ tempmax + tempmin + windspeed + cloudcover, data = train_data)
svm_pred <- predict(svm_model, test_data)
svm_conf <- confusionMatrix(svm_pred, test_data$weather)
svm_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Cool Hot
##       Cool  966  14
##       Hot     3 113
##                                           
##                Accuracy : 0.9845          
##                  95% CI : (0.9753, 0.9909)
##     No Information Rate : 0.8841          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9213          
##                                           
##  Mcnemar's Test P-Value : 0.01529         
##                                           
##             Sensitivity : 0.9969          
##             Specificity : 0.8898          
##          Pos Pred Value : 0.9857          
##          Neg Pred Value : 0.9741          
##              Prevalence : 0.8841          
##          Detection Rate : 0.8814          
##    Detection Prevalence : 0.8942          
##       Balanced Accuracy : 0.9433          
##                                           
##        'Positive' Class : Cool            
##

Evaluation

# Collect results for all models with additional metrics
model_metrics <- data.frame(
  Model = c("Decision Tree", "Random Forest", "k-NN", "Naive Bayes", "SVM"),
  Accuracy = c(dt_conf$overall["Accuracy"],
               rf_conf$overall["Accuracy"],
               knn_conf$overall["Accuracy"],
               nb_conf$overall["Accuracy"],
               svm_conf$overall["Accuracy"]),
  Precision = c(dt_conf$byClass["Pos Pred Value"],
                rf_conf$byClass["Pos Pred Value"],
                knn_conf$byClass["Pos Pred Value"],
                nb_conf$byClass["Pos Pred Value"],
                svm_conf$byClass["Pos Pred Value"]),
  Recall = c(dt_conf$byClass["Sensitivity"],
             rf_conf$byClass["Sensitivity"],
             knn_conf$byClass["Sensitivity"],
             nb_conf$byClass["Sensitivity"],
             svm_conf$byClass["Sensitivity"]),
  F1_Score = c(dt_conf$byClass["F1"],
               rf_conf$byClass["F1"],
               knn_conf$byClass["F1"],
               nb_conf$byClass["F1"],
               svm_conf$byClass["F1"])
)

print(model_metrics)
##           Model  Accuracy Precision    Recall  F1_Score
## 1 Decision Tree 1.0000000 1.0000000 1.0000000 1.0000000
## 2 Random Forest 1.0000000 1.0000000 1.0000000 1.0000000
## 3          k-NN 0.9708029 0.9746707 0.9927761 0.9836401
## 4   Naive Bayes 0.9425182 0.9956236 0.9391125 0.9665428
## 5           SVM 0.9844891 0.9857143 0.9969040 0.9912776

# Plot the accuracy of the models using a bar plot
ggplot(model_metrics, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Model Accuracy Comparison",
       x = "Model",
       y = "Accuracy") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Decision Tree & Random Forest: Both models achieve perfect scores across all metrics (Accuracy, Precision, Recall, and F1-Score). This suggests that they are able to perfectly classify all instances in the dataset.
Logistic Regression: Also achieves perfect scores, indicating strong performance.
k-NN: Shows very high scores across all metrics, with a slight drop in Accuracy compared to the top three models.
SVM: Performs well, with high scores in all metrics, though slightly lower than k-NN.
Naive Bayes: Has the lowest scores among the models, with lower Accuracy and Recall compared to the others.

Deployment

This deployment provides an interactive experience where users can explore temperature trends over time through an interactive line plot and view detailed data through a sortable and searchable table. The visualizations allow for dynamic exploration of temperature patterns, while the data table gives users easy access to the raw data, making it a powerful tool for analysis and decision-making.

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Conclusion

Key Findings:

Temperature Prediction Models:

Linear Regression emerges as the most accurate model for predicting maximum temperature, with the lowest RMSE (0.4958716).
SVR (Support Vector Regression) and Random Forest showed slightly higher RMSE values, with Random Forest being the least accurate.
Based on the RMSE evaluation, Linear Regression is the preferred model for predicting temperature, making it the most suitable choice for this dataset.

Temperature Range Classification Models:

Decision Tree and Random Forest achieved perfect scores across all metrics (Accuracy, Precision, Recall, F1-Score), indicating that they were able to perfectly classify temperature days as “Hot” or “Cool.”
k-NN also performed well, though with a slight drop in accuracy compared to the top models. Naive Bayes showed the lowest performance, particularly in Accuracy and Recall.

Model Performance:

For temperature prediction, Linear Regression is the most reliable, whereas for temperature classification, Decision Tree and Random Forest are the best-performing models.

Deployment Insights:

The interactive nature of the deployment allows users to explore temperature trends dynamically over time. This interactive line plot and sortable data table make it a powerful tool for understanding and analyzing temperature data.

Feature Engineering:

The creation of the ‘weather’ feature (categorizing days as “Hot” or “Cool” based on maximum temperature) helped simplify the classification task and improved model interpretability.

Conclusion:

This project successfully utilized machine learning techniques to analyze and model temperature data, achieving valuable insights into daily temperature variations and heatwave patterns. The Linear Regression model was found to be the most effective for predicting daily temperatures, while Decision Tree and Random Forest were the best for classifying days into temperature categories. The interactive deployment provided an accessible tool for exploring and analyzing the data, making it a useful resource for understanding temperature trends.

Overall, the project demonstrates the power of machine learning in climate data analysis and highlights the importance of proper model selection, feature engineering, and interactive visualization for effective data-driven decision-making.

WQD7004 SEOUL HISTORICAL WEATHER 2009 TO 2023

AFRINA ROSA, TAUFIQ ISMAIL, KARAM ALJANADI, CHIA KAI SWAN, SITI HAIRUNEESHA

2025-01-06

Weather Forecasting in Seoul: Insights from 15 Years of Data and Future Trends (2009–2023)

Dataset Source

Methodology

Data Preparation

Filtering Data

Missing Values & Duplicate Handling

Data Transformation

Data Visualization

Time-series

Inspection & Verification of temperature data

Temperature conversion Fahrenheit to Celcius

Exploratory Data Analysis

Descriptive Statistics

Time Series Plots

Temperature

Precipitation, Cloud Cover & Wind Speed

Heat Wave Analysis

Number of Heatwave Days

Scatter Plot of Heatwave Analysis

Correlation Analysis

Seasonal Analysis

Outlier Detection

Feature Engineering

Data Modeling

Data Splitting

Modelling Temperature Prediction

Evaluation

Model Comparison For Temperature Prediction

Temperature Range Classification Approach

Evaluation

Deployment