GROUP 10 MEMBERS
Matric	Full_Name
S2110638	Karam Aljanadi
23094825	Siti Hairuneesha Binti Zainuddin
u2005486	Afrina Rosa binti Mohamad Sattar
23085637	Chia Kai Swan
S2119226	Muhammad Taufiq Bin Ismail

Weather Forecasting in Seoul: Insights from 15 Years of Data and Future Trends (2009–2023)

Weather forecasting plays a pivotal role in understanding climatic patterns, preparing for extreme weather events, and optimizing resources across various sectors. This study focuses on Seoul’s weather over a 15-year period (2009–2023), utilizing historical weather data to uncover significant trends and patterns. By analyzing key attributes such as temperature, precipitation, wind speed, and cloud cover, we aim to gain insights into Seoul’s evolving climate and predict future weather conditions.

Dataset Source

The dataset, titled “Seoul Historical Weather Data (1994–2024),” is available on Kaggle.

Research Objective:

To analyze weather trends by identifying patterns and changes in Seoul’s weather from 2009 to 2023.
To predict future conditions by building models that forecast future temperature patterns in Seoul.
To provide actionable insights by offering recommendations for urban planning, sustainability, and resource management.

Research Questions:

What weather trends can be identified in Seoul from 2009 to 2023 ?
How accurate are forecasting models for predicting future weather?
How can visualizations help urban planners use weather data for decision-making?

Methodology

CRISP-DM

1.Understand: Define the problem: Analyze Seoul’s weather data for insights and predictions.

2.Explore: Load and examine the dataset, identifying key attributes.

3.Prepare: Clean and preprocess data for analysis.

4.Model: Build predictive models using statistical/ML methods.

5.Evaluate: Assess model performance using relevant metrics.

6.Deploy: Present findings, forecasts, and actionable recommendations.

Import Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (readr)
library (dplyr)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggplot2)
library(rpart)
library(corrplot)
## corrplot 0.95 loaded
library(Metrics)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:Metrics':
## 
##     precision, recall
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(rsconnect)
library(knitr)

Import Dataset

df<-read_csv("dataset123.csv")

## Rows: 10958 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): name, datetime, preciptype, conditions, description, icon, stations
## dbl  (25): ROW_KEY, tempmax, tempmin, temp, feelslikemax, feelslikemin, feel...
## dttm  (2): sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Show the number of rows in a more descriptive way
cat("The dataset contains", nrow(df), "rows.")

## The dataset contains 10958 rows.

str(df)

## spc_tbl_ [10,958 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ROW_KEY         : num [1:10958] 1 2 3 4 5 6 7 8 9 10 ...
##  $ name            : chr [1:10958] "seoul" "seoul" "seoul" "seoul" ...
##  $ datetime        : chr [1:10958] "1/1/1994" "2/1/1994" "3/1/1994" "4/1/1994" ...
##  $ tempmax         : num [1:10958] 35.2 43 47.9 38.8 40 42.7 37 44.3 48.8 48 ...
##  $ tempmin         : num [1:10958] 16.4 31.5 30.9 22.1 24 26.1 17.7 NA 33.6 27.9 ...
##  $ temp            : num [1:10958] 26.3 36.2 38 30.1 33.1 35.5 27.5 34.2 41 39.7 ...
##  $ feelslikemax    : num [1:10958] 33.4 39.4 44.7 32 40 39.1 37 37.6 46.1 48 ...
##  $ feelslikemin    : num [1:10958] 13 26.7 24.5 18.4 18.5 15.7 13.7 12.6 29.2 27.9 ...
##  $ feelslike       : num [1:10958] 24.3 32.6 35.4 26.3 31 30.1 24.3 28.2 39.1 38.9 ...
##  $ dew             : num [1:10958] 15.5 27.9 27.3 13.6 21.7 25 11.3 24.3 34.3 32.4 ...
##  $ humidity        : num [1:10958] 65.9 72.1 68.1 51.2 63.9 66.7 52.1 67.7 77.4 75.8 ...
##  $ precip          : num [1:10958] 0 0 0 0 0.01 0.1 0 0 0 0 ...
##  $ precipprob      : num [1:10958] 0 0 0 0 100 100 0 0 0 0 ...
##  $ precipcover     : num [1:10958] 0 0 0 0 4.17 4.17 0 0 0 0 ...
##  $ preciptype      : chr [1:10958] NA NA NA NA ...
##  $ snow            : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ snowdepth       : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ windgust        : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ windspeed       : num [1:10958] 5.5 9.1 12.8 11.7 6.8 11.8 9.1 14.7 8.9 4.5 ...
##  $ winddir         : num [1:10958] 115 182 290 302 134 ...
##  $ sealevelpressure: num [1:10958] 1025 1022 1020 1025 1024 ...
##  $ cloudcover      : num [1:10958] 63 88.8 57.4 16.3 90.4 57.2 30.1 46.6 NA 81 ...
##  $ visibility      : num [1:10958] 6.6 6.9 5.3 7.3 6.2 5.1 7.4 7.6 3.7 2.2 ...
##  $ solarradiation  : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ solarenergy     : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ uvindex         : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ severerisk      : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
##  $ sunrise         : POSIXct[1:10958], format: "1994-01-01 07:46:54" "1994-01-02 07:47:03" ...
##  $ sunset          : POSIXct[1:10958], format: "1994-01-01 17:23:56" "1994-01-02 17:24:44" ...
##  $ moonphase       : num [1:10958] 0.61 0.65 0.68 0.72 0.75 0.79 0.83 0.86 0.9 0.93 ...
##  $ conditions      : chr [1:10958] "Partially cloudy" "Partially cloudy" "Partially cloudy" "Clear" ...
##  $ description     : chr [1:10958] "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Clear conditions throughout the day." ...
##  $ icon            : chr [1:10958] "partly-cloudy-day" "partly-cloudy-day" "partly-cloudy-day" "clear-day" ...
##  $ stations        : chr [1:10958] "4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000" "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,"| __truncated__ ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ROW_KEY = col_double(),
##   ..   name = col_character(),
##   ..   datetime = col_character(),
##   ..   tempmax = col_double(),
##   ..   tempmin = col_double(),
##   ..   temp = col_double(),
##   ..   feelslikemax = col_double(),
##   ..   feelslikemin = col_double(),
##   ..   feelslike = col_double(),
##   ..   dew = col_double(),
##   ..   humidity = col_double(),
##   ..   precip = col_double(),
##   ..   precipprob = col_double(),
##   ..   precipcover = col_double(),
##   ..   preciptype = col_character(),
##   ..   snow = col_double(),
##   ..   snowdepth = col_double(),
##   ..   windgust = col_double(),
##   ..   windspeed = col_double(),
##   ..   winddir = col_double(),
##   ..   sealevelpressure = col_double(),
##   ..   cloudcover = col_double(),
##   ..   visibility = col_double(),
##   ..   solarradiation = col_double(),
##   ..   solarenergy = col_double(),
##   ..   uvindex = col_double(),
##   ..   severerisk = col_double(),
##   ..   sunrise = col_datetime(format = ""),
##   ..   sunset = col_datetime(format = ""),
##   ..   moonphase = col_double(),
##   ..   conditions = col_character(),
##   ..   description = col_character(),
##   ..   icon = col_character(),
##   ..   stations = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

kable(head(df))

ROW_KEY	name	datetime	tempmax	tempmin	temp	feelslikemax	feelslikemin	feelslike	dew	humidity	precip	precipprob	precipcover	preciptype	snow	snowdepth	windgust	windspeed	winddir	sealevelpressure	cloudcover	visibility	solarradiation	solarenergy	uvindex	severerisk	sunrise	sunset	moonphase	conditions	description	icon	stations
1	seoul	1/1/1994	35.2	16.4	26.3	33.4	13.0	24.3	15.5	65.9	0.00	0	0.00	NA	NA	NA	NA	5.5	115.4	1025.4	63.0	6.6	NA	NA	NA	NA	1994-01-01 07:46:54	1994-01-01 17:23:56	0.61	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
2	seoul	2/1/1994	43.0	31.5	36.2	39.4	26.7	32.6	27.9	72.1	0.00	0	0.00	NA	NA	NA	NA	9.1	181.7	1022.2	88.8	6.9	NA	NA	NA	NA	1994-01-02 07:47:03	1994-01-02 17:24:44	0.65	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
3	seoul	3/1/1994	47.9	30.9	38.0	44.7	24.5	35.4	27.3	68.1	0.00	0	0.00	NA	NA	NA	NA	12.8	289.5	1020.0	57.4	5.3	NA	NA	NA	NA	1994-01-03 07:47:11	1994-01-03 17:25:33	0.68	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
4	seoul	4/1/1994	38.8	22.1	30.1	32.0	18.4	26.3	13.6	51.2	0.00	0	0.00	NA	NA	NA	NA	11.7	301.9	1025.1	16.3	7.3	NA	NA	NA	NA	1994-01-04 07:47:16	1994-01-04 17:26:23	0.72	Clear	Clear conditions throughout the day.	clear-day	47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5	seoul	5/1/1994	40.0	24.0	33.1	40.0	18.5	31.0	21.7	63.9	0.01	100	4.17	rain,snow	NA	NA	NA	6.8	134.1	1023.9	90.4	6.2	NA	NA	NA	NA	1994-01-05 07:47:19	1994-01-05 17:27:14	0.75	Snow, Rain, Overcast	Cloudy skies throughout the day with late afternoon rain or snow.	rain	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
6	seoul	6/1/1994	42.7	26.1	35.5	39.1	15.7	30.1	25.0	66.7	0.10	100	4.17	rain	NA	NA	NA	11.8	290.2	1021.9	57.2	5.1	NA	NA	NA	NA	1994-01-06 07:47:21	1994-01-06 17:28:07	0.79	Rain, Partially cloudy	Partly cloudy throughout the day with morning rain.	rain	471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Data Preparation

Filtering Data

Filter data for the 2009-2023 period. This is for data relevancy and potentially leaving out any outdated data.

Filter year of data

# Convert `datetime` column to Date 
df$datetime <- as.Date(df$datetime, format = "%d/%m/%Y")
# Filter the data for the date range 2009 to 2023
filtered_data <- df %>%  filter(datetime >= "2009-01-01" & datetime <= "2023-12-31")

ROW_KEY	name	datetime	tempmax	tempmin	temp	feelslikemax	feelslikemin	feelslike	dew	humidity	preciptype	windgust	windspeed	winddir	sealevelpressure	cloudcover	visibility	solarradiation	solarenergy	uvindex	severerisk	sunrise	sunset	moonphase	conditions	description	icon	stations
5480	seoul	2009-01-01	28.5	14.5	20.9	23.2	6.4	15.3	5.9	52.7	NA	NA	9.5	308.6	1026.5	0.2	7.3	NA	NA	NA	NA	2009-01-01 07:46:55	2009-01-01 17:24:11	0.16	Clear	Clear conditions throughout the day.	clear-day	471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5481	seoul	2009-01-02	35.0	13.3	24.2	30.8	12.7	22.3	12.2	63.0	NA	NA	5.6	326.3	1029.3	0.0	7.0	NA	NA	NA	NA	2009-01-02 07:47:04	2009-01-02 17:24:59	0.19	Clear	Clear conditions throughout the day.	clear-day	4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5482	seoul	2009-01-03	39.4	15.1	27.2	39.4	12.7	25.6	14.2	62.7	NA	NA	5.6	329.4	1029.4	11.2	6.7	NA	NA	NA	NA	2009-01-03 07:47:11	2009-01-03 17:25:48	0.23	Clear	Clear conditions throughout the day.	clear-day	471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5483	seoul	2009-01-04	40.4	22.4	30.5	36.3	22.4	28.6	15.8	58.6	NA	NA	7.8	306.2	1026.9	27.5	7.1	NA	NA	NA	NA	2009-01-04 07:47:15	2009-01-04 17:26:39	0.25	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5484	seoul	2009-01-05	34.0	18.7	27.4	32.2	17.9	24.4	17.6	67.8	NA	NA	9.0	287.0	1026.9	34.0	6.6	NA	NA	NA	NA	2009-01-05 07:47:18	2009-01-05 17:27:31	0.29	Partially cloudy	Partly cloudy throughout the day.	partly-cloudy-day	471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
5485	seoul	2009-01-06	34.9	15.8	26.2	34.8	15.8	24.4	14.2	63.1	NA	NA	5.6	326.1	1028.9	38.8	4.6	NA	NA	NA	NA	2009-01-06 07:47:19	2009-01-06 17:28:24	0.33	Partially cloudy	Becoming cloudy in the afternoon.	partly-cloudy-day	4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Filter Column

# Filter the relevant columns
filtered_data <- filtered_data %>%
  select(datetime, tempmax, tempmin, temp, precip, windspeed, cloudcover,winddir, visibility, moonphase)

# Check the filtered dataset
str(filtered_data)

## tibble [5,478 × 10] (S3: tbl_df/tbl/data.frame)
##  $ datetime  : Date[1:5478], format: "2009-01-01" "2009-01-02" ...
##  $ tempmax   : num [1:5478] 28.5 35 39.4 40.4 34 34.9 39.5 37.9 32.1 24.9 ...
##  $ tempmin   : num [1:5478] 14.5 13.3 15.1 22.4 18.7 15.8 15.6 17.6 18.5 14.6 ...
##  $ temp      : num [1:5478] 20.9 24.2 27.2 30.5 27.4 26.2 27.5 27.9 25.7 19.7 ...
##  $ precip    : num [1:5478] 0 0 0 0 0 0 0 0 0 0 ...
##  $ windspeed : num [1:5478] 9.5 5.6 5.6 7.8 9 5.6 6.7 7.4 7.9 9.1 ...
##  $ cloudcover: num [1:5478] 0.2 0 11.2 27.5 34 38.8 10.8 36.1 17.2 8.2 ...
##  $ winddir   : num [1:5478] 309 326 329 306 287 ...
##  $ visibility: num [1:5478] 7.3 7 6.7 7.1 6.6 4.6 5.2 6.1 6.6 7.4 ...
##  $ moonphase : num [1:5478] 0.16 0.19 0.23 0.25 0.29 0.33 0.36 0.4 0.43 0.47 ...

Filter relevant columns Key Attributes:

datetime: Records the specific date for each observation, formatted as “YYYY-MM-DD.”
tempmax: Represents the maximum temperature recorded each day in degrees Celsius (°C), useful for identifying heatwaves and hot days.
tempmin: Represents the minimum temperature recorded each day in degrees Celsius (°C), helpful for detecting cold days and frost conditions.
temp: Provides the average temperature recorded each day in degrees Celsius (°C), offering a balanced view of daily temperature conditions.
precip: Indicates the total daily precipitation in millimeters (mm), capturing the amount of rainfall or snowfall.
windspeed: Measures the average daily wind speed in kilometers per hour (km/h), which is essential for understanding wind variability and weather patterns.
cloudcover: Shows the average percentage of cloud cover during the day, providing insights into daily sky conditions and solar exposure.
winddir: Represents the average wind direction in degrees, measured clockwise from true north, which helps in studying prevailing wind patterns.
visibility: Measures the average visibility distance per day in kilometers, useful for assessing air quality and weather clarity.
moonphase: Indicates the moon phase as a numeric value (e.g., 0 for new moon, 0.5 for full moon), providing information on the lunar cycle that might correlate with certain natural phenomena.

These attributes collectively help analyze a wide range of environmental and climatic trends, such as temperature extremes, precipitation patterns, wind dynamics, and atmospheric conditions, all of which are valuable for urban planning, agriculture, and environmental studies.

Missing Values & Duplicate Handling

Handle missing values by replacing them with mean values.

Handle missing value

# Show number of missing values before handling
missing_before <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values before handling:")

## [1] "Missing values before handling:"

print(missing_before)

##   datetime    tempmax    tempmin       temp     precip  windspeed cloudcover 
##          0          0          0          1          1          0          2 
##    winddir visibility  moonphase 
##          0          0          0

# Fill missing values in numeric columns with the mean
numeric_columns <- names(filtered_data)[sapply(filtered_data, is.numeric)]
filtered_data <- filtered_data %>%
  mutate(across(all_of(numeric_columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Show number of missing values after handling
missing_after <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values after handling:")

## [1] "Missing values after handling:"

print(missing_after)

##   datetime    tempmax    tempmin       temp     precip  windspeed cloudcover 
##          0          0          0          0          0          0          0 
##    winddir visibility  moonphase 
##          0          0          0

Removing duplicated rows to ensure only valid and unique data for analysis

Remove Duplicate

# Remove duplicates
filtered_data <- distinct(filtered_data)

Data Transformation

This transformation is done to make the column name more descriptive or clearer.

precip to precipitation

#transform precip to precipitation

# Rename the column
filtered_data <- filtered_data %>%
  rename(precipitation = precip)

# Check the updated column names
colnames(filtered_data)

##  [1] "datetime"      "tempmax"       "tempmin"       "temp"         
##  [5] "precipitation" "windspeed"     "cloudcover"    "winddir"      
##  [9] "visibility"    "moonphase"

Data Visualization

Time-series

# Time series plot
ggplot(filtered_data, aes(x = datetime, y = tempmax)) +
  geom_line(color = "blue") +  # Line plot
  labs(title = "Max Temperature",
       x = "Date and Time",
       y = "Temperature Maximum") +
  theme_minimal()

The maximum temperature data shows a clear cyclical pattern, likely due to seasonal variations. There appears to be a slight upward trend over time. Some anomalies are observed, suggesting the influence of weather events or other factors.

Inspection & Verification of temperature data

Summary of tempmax & tempmin

# Summary statistics for temp columns
summary(filtered_data$tempmax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9.10   39.10   61.00   57.61   79.20  102.30

summary(filtered_data$tempmin)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -17.10   23.70   41.70   41.68   62.50   84.40

Boxplot of tempmax & tempmin

# Check for unusually high or low values
boxplot(filtered_data$tempmax, main = "Boxplot of tempmax", ylab = "Temperature")

boxplot(filtered_data$tempmin, main = "Boxplot of tempmin", ylab = "Temperature")

Both tempmax and tempmin exhibit a wide range and significant variability.
Tempmax shows a higher median and a wider range compared to tempmin.

This suggests substantial daily temperature fluctuations.

Considerations:

Units: The units for temperature are not specified in the plots. Knowing the units (e.g., Celsius, Fahrenheit) would be essential for accurate interpretation.

Temperature conversion Fahrenheit to Celcius

Unit Consistency Using a single unit (Celsius) ensures accurate comparisons and calculations throughout the analysis.

Conversion Logic

# Define conversion functions
celsius_to_fahrenheit <- function(celsius) {
  celsius * 9 / 5 + 32
}

fahrenheit_to_celsius <- function(fahrenheit) {
  (fahrenheit - 32) * 5 / 9
}


# Define the date range for conversion
start_date <- as.Date('2009-01-01')
end_date <- as.Date('2021-12-31')

# Filter the data for the dates that need conversion (2009-2021)
data_to_convert <- filtered_data[filtered_data$datetime >= start_date & filtered_data$datetime <= end_date, ]

# Define the columns to convert
temp_columns <- c("tempmax", "tempmin", "temp")

# Apply temperature conversion for the filtered rows (2009-2021)
for (col in temp_columns) {
  data_to_convert[[col]] <- round(fahrenheit_to_celsius(data_to_convert[[col]]), 1)
}

# Now, merge the converted data back with the original data
data_non_converted <- filtered_data[!(filtered_data$datetime >= start_date & filtered_data$datetime <= end_date), ]

# Combine the data
final_data <- rbind(data_non_converted, data_to_convert)

# Ensure that the combined data is sorted by date
final_data <- final_data[order(final_data$datetime), ]

datetime	tempmax	tempmin	temp	windspeed	cloudcover	winddir	visibility	moonphase
2009-01-01	-1.9	-9.7	-6.2	9.5	0.2	308.6	7.3	0.16
2009-01-02	1.7	-10.4	-4.3	5.6	0.0	326.3	7.0	0.19
2009-01-03	4.1	-9.4	-2.7	5.6	11.2	329.4	6.7	0.23
2009-01-04	4.7	-5.3	-0.8	7.8	27.5	306.2	7.1	0.25
2009-01-05	1.1	-7.4	-2.6	9.0	34.0	287.0	6.6	0.29
2009-01-06	1.6	-9.0	-3.2	5.6	38.8	326.1	4.6	0.33

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an essential step in data analysis that helps us understand the dataset, uncover underlying patterns, detect anomalies, and test assumptions before applying any predictive models. In this section, we aim to explore the Seoul Historical Weather data to extract meaningful insights regarding weather patterns over a 15-year period (2009–2023). By visualizing the data, calculating summary statistics, and performing various checks, we will prepare the dataset for further analysis and modeling.

Descriptive Statistics

Calculate summary statistics (mean, median, variance) for key attributes like temperature, precipitation, wind speed, etc.

##     datetime             tempmax          tempmin             temp       
##  Min.   :2009-01-01   Min.   :-11.30   Min.   :-19.000   Min.   :-14.40  
##  1st Qu.:2012-10-01   1st Qu.:  8.40   1st Qu.: -1.000   1st Qu.:  3.80  
##  Median :2016-07-01   Median : 19.55   Median :  9.000   Median : 14.10  
##  Mean   :2016-07-01   Mean   : 17.68   Mean   :  8.331   Mean   : 12.89  
##  3rd Qu.:2020-03-31   3rd Qu.: 27.10   3rd Qu.: 18.275   3rd Qu.: 22.60  
##  Max.   :2023-12-31   Max.   : 39.10   Max.   : 29.100   Max.   : 49.47  
##  precipitation        windspeed       cloudcover        winddir     
##  Min.   :  0.0000   Min.   : 2.60   Min.   :  0.00   Min.   :  0.2  
##  1st Qu.:  0.0000   1st Qu.: 7.90   1st Qu.: 25.40   1st Qu.:167.6  
##  Median :  0.0000   Median :10.00   Median : 50.50   Median :262.7  
##  Mean   :  0.7568   Mean   :10.61   Mean   : 50.19   Mean   :225.0  
##  3rd Qu.:  0.0400   3rd Qu.:12.60   3rd Qu.: 74.80   3rd Qu.:299.7  
##  Max.   :268.6200   Max.   :44.90   Max.   :100.00   Max.   :359.9  
##    visibility       moonphase     
##  Min.   : 0.700   Min.   :0.0000  
##  1st Qu.: 5.600   1st Qu.:0.2500  
##  Median : 7.100   Median :0.5000  
##  Mean   : 7.986   Mean   :0.4833  
##  3rd Qu.: 8.200   3rd Qu.:0.7500  
##  Max.   :46.200   Max.   :0.9800

Time Series Plots

Temperature

To observe temperature variations over time

ggplot(final_data, aes(x = datetime)) +
  geom_line(aes(y = tempmax), color = "red") +
  geom_line(aes(y = tempmin), color = "blue") +
  geom_line(aes(y = temp), color = "green") +
  labs(title = "Temperature Over Time", x = "Date", y = "Temperature (°C)") +
  theme_minimal()

Color Interpretation:

Red: Represents the highest temperature readings.
Blue: Represents the lowest temperature readings.
Green: Represents intermediate temperature readings.

Visual Interpretation:

Cyclical Pattern: The most prominent feature is the clear cyclical pattern in the temperature data. There are distinct peaks and troughs that repeat at regular intervals. This suggests a strong seasonal component in the temperature variations.
Amplitude: The amplitude of the fluctuations seems to vary somewhat over time. There are periods with higher peaks and deeper troughs, and others with more moderate variations. This could be due to factors like El Niño or other climate phenomena.
Trend: There appears to be a slight upward trend in the overall temperature levels from 2010 to 2020. This could be attributed to long-term climate change or other regional factors.
Anomalies: There are a few notable anomalies, such as a sharp drop in temperature around 2016. These could be due to specific weather events or other localized factors.

In Summary:

The plot visually depicts the temperature fluctuations over time, highlighting the seasonal variations, potential trends, and notable anomalies. The use of color effectively differentiates between temperature ranges, providing a clear visual representation of the temperature variations.

Precipitation, Cloud Cover & Wind Speed

# Reshape the dataset to long format
final_data_long <- pivot_longer(
  final_data,
  cols = -datetime,  # Exclude 'datetime' since it's the identifier
  names_to = "variable",
  values_to = "value"
)

# Plotting with facets
ggplot(final_data_long, aes(x = datetime, y = value)) +
  geom_line() +
  facet_wrap(~ variable, scales = "free_y") +  # Free y-axis scale for better visualization
  labs(
    title = "Time Series of Weather Variables",
    x = "Date",
    y = "Value"
  ) +
  theme_minimal()

Precipitation: The precipitation time series shows distinct peaks and troughs, suggesting a seasonal rainfall pattern. There appear to be periods with higher precipitation and others with lower precipitation.
Wind Speed: The wind speed time series also shows a cyclical pattern, but with less pronounced peaks and troughs compared to precipitation. The overall variation in wind speed seems to be less than that of precipitation.
Cloud Cover: The cloud cover time series exhibits a high degree of variability with frequent fluctuations. There are periods with consistently high cloud cover and others with lower cloud cover.

The most likely explanation for the cyclical patterns is the seasonal variation in weather patterns. In many regions, precipitation, wind speed, and cloud cover are influenced by factors like monsoon seasons, changes in atmospheric circulation, and solar radiation

Heat Wave Analysis

A heatwave threshold is defined (in this case, 35 degrees Celsius). Showing the days with a maximum temperature exceeding the threshold are marked as heatwave days.

Number of Heatwave Days

Heatwave Logic

# Define heatwave threshold
heatwave_threshold <- 35  # Example threshold for heatwaves

# Mark heatwave days
final_data <- final_data %>%
  mutate(heatwave = ifelse(tempmax > heatwave_threshold, 1, 0))

# Use cumsum to identify consecutive heatwave periods
final_data <- final_data %>%
  mutate(heatwave_id = cumsum(heatwave == 0 & lag(heatwave, default = 0) == 0))

# Count the number of heatwave days
num_heatwave_days <- sum(final_data$heatwave == 1)

# Create a message
heatwave_message <- paste("The number of heatwaves in Seoul throughout the year are", num_heatwave_days)

# Print the message
print(heatwave_message)

## [1] "The number of heatwaves in Seoul throughout the year are 45"

Scatter Plot of Heatwave Analysis

# Scatter plot for tempmax with heatwave days highlighted
ggplot(final_data, aes(x = datetime, y = tempmax)) +
  # Scatter plot for all data points
  geom_point(aes(color = factor(heatwave), shape = factor(heatwave)), size = 3) +
  
  # Customize color scale to distinguish heatwave days
  scale_color_manual(values = c("black", "red"), labels = c("Non-Heatwave", "Heatwave")) +
  scale_shape_manual(values = c(16, 17), labels = c("Non-Heatwave", "Heatwave")) +
  
  # Add annotation for heatwave days count
  annotate("text", x = as.Date("2022-01-01"), y = max(final_data$tempmax),
           label = paste("Heatwave Days:", num_heatwave_days), 
           hjust = 0, color = "blue", size = 2.5) +
  
  labs(title = "Scatter Plot of Max Temperature with Heatwave Days Highlighted", 
       x = "Date", 
       y = "Max Temperature (°C)") +
  theme_minimal() +
  theme(legend.title = element_blank())

Interpretation:

The scatter plot provides a visual representation of heatwave events in relation to maximum temperature. It allows us to identify:

Days exceeding the heatwave threshold: The red triangles clearly highlight the days where the maximum temperature surpassed the defined threshold.
Seasonal patterns: The clustering of red triangles suggests that heatwaves are more likely to occur during specific seasons or periods.
Temperature distribution: The scatter plot shows the overall distribution of maximum temperatures, including both heatwave and non-heatwave days.

Correlation Analysis

The correlation analysis is visualized through a circle heatmap, which highlights the strength and direction of relationships between different variables in the dataset.

cor_matrix <- cor(final_data[, c("tempmax", "tempmin", "temp", "precipitation", "windspeed", "cloudcover","winddir", "visibility", "moonphase")])
corrplot(cor_matrix, method = "circle")

Interpretation of Correlation Plot

The following interpretation is based on the correlation plot created for the dataset. The size and color of the circles represent the strength and direction of the correlation between each pair of variables.

Strong Positive Correlations:

tempmax and tempmin have a very strong positive correlation. This makes sense as the maximum temperature of a day is closely related to its minimum temperature, with both typically increasing or decreasing together.
temp and tempmax have a strong positive correlation. As the temperature of the day increases, the maximum temperature also tends to increase.
temp and tempmin also show a strong positive correlation. Similarly, as the temperature of the day increases, the minimum temperature increases as well.

Moderate to Weak Correlations:

Some correlations appear to be moderate or weak:
- The correlation between precipitation and windspeed is weak, suggesting that changes in wind speed do not strongly affect precipitation levels.
- The correlation between cloudcover and temp is also weak to moderate, meaning that cloud cover does not always correlate strongly with the temperature of the day.

Seasonal Analysis

The seasonal analysis explores the variations in key variables over different seasons, aiming to identify trends, patterns, and anomalies that may influence the overall dataset.

To categorize the data by season, a new ‘season’ column is created based on the month of the year,

with the following assumptions:

Winter: December, January, February
Spring: March, April, May
Summer: June, July, August
Fall: September, October, November

# Create a 'season' column based on month based on assumption
final_data <- final_data %>%
  mutate(season = case_when(
    format(datetime, "%m") %in% c("12", "01", "02") ~ "Winter",
    format(datetime, "%m") %in% c("03", "04", "05") ~ "Spring",
    format(datetime, "%m") %in% c("06", "07", "08") ~ "Summer",
    format(datetime, "%m") %in% c("09", "10", "11") ~ "Fall"
  ))

# Aggregate by season
seasonal_data <- final_data %>%
  group_by(season) %>%
  summarise(
    tempmax_avg = mean(tempmax, na.rm = TRUE),
    tempmin_avg = mean(tempmin, na.rm = TRUE),
    precip_avg = mean(precipitation, na.rm = TRUE),
    windspeed_avg = mean(windspeed, na.rm = TRUE),
    cloudcover_avg = mean(cloudcover, na.rm = TRUE)
  )

# Plot the seasonal data
ggplot(seasonal_data, aes(x = season)) +
  geom_bar(aes(y = tempmax_avg), stat = "identity", fill = "red") +
  geom_bar(aes(y = tempmin_avg), stat = "identity", fill = "blue", alpha = 0.5) +
  labs(title = "Average Temperature by Season", 
       x = "Season", 
       y = "Temperature (°C)") +
  theme_minimal()

Summer: The highest average temperature is observed in the Summer season. The stacked bar shows a combination of two temperature ranges, with the higher range contributing significantly to the overall average.
Fall and Spring: The average temperatures in Fall and Spring are lower compared to Summer. The stacked bars indicate a similar distribution of temperature ranges between these two seasons.
Winter: The lowest average temperature occurs in Winter. The stacked bar shows a predominance of the lower temperature range, contributing to the overall low average.

Outlier Detection

ggplot(final_data, aes(x = tempmax)) +
  geom_boxplot() +
  labs(title = "Outliers in Max Temperature", x = "Temperature (°C)") +
  theme_minimal()

ggplot(final_data, aes(x = tempmin)) +
  geom_boxplot() +
  labs(title = "Outliers in Min Temperature", x = "Temperature (°C)") +
  theme_minimal()

ggplot(final_data, aes(x = temp)) +
  geom_boxplot() +
  labs(title = "Outliers in Temperature", x = "Temperature (°C)") +
  theme_minimal()

The boxplots provide a visual representation of the temperature data of tempmax, tempmin and temp, highlighting the following characteristics:

No extreme values
Symmetrical distribution
Consistent variations across different temperature metrics (max, min, average)

# Example: Creating a 'weather' column based on temperature (tempmax)
final_data$weather <- as.factor(ifelse(final_data$tempmax > 30, "Hot", "Cool"))

Modeling and Evaluation

From the EDA, we now have a analytical understanding of our dataset, and its structure, based on that we decided to implement four supervised machine learning algorithms, the first two will serve a regression approach, while the second two are utilized for classification. By implementing two models per problem, we can have a general understanding of the models performance, and how the features and data types effects the evaluation and output of our problem.

Feature Engineering

The first step is to perform feature engineering approaches, this will enable a better performance by preventing feature biasness, and appropriate attributes datatypes. Also, since the each model has different feature requirements, we perform features engineering suitable for each model.

# Identify numerical columns
numerical_columns <- c("tempmax", "tempmin", "tempavg", "precip", "windspeed", 
                       "winddir", "cloudcover", "visibility", "moonphase")

# Normalize numerical columns
data[numerical_columns] <- scale(data[numerical_columns])


# Encode categorical column (e.g., 'conditions')
data$conditions <- as.numeric(factor(data$conditions))

# Verify encoding
print(unique(data$conditions))

##  [1]  3  1 11  6  5  9 12  2  4  7 10  8

Data Splitting

We are Splitting our Dataset into a training and a testing sets with a 70 to 30 proportions.

# Step 1: Split the dataset into training (70%) and holdout (30%) sets
set.seed(123)
index <- createDataPartition(data$conditions, p = 0.7, list = FALSE)
train_set <- data[index, ]
holdout_set <- data[-index, ]

# Step 2: Check the distribution of classes in training and holdout sets

print(table(train_set$conditions))

## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 1267  109 3513   52  801 1480   42    1  101   18   49  240

print(table(holdout_set$conditions))

## 
##    1    2    3    4    5    6    7    9   10   11   12 
##  524   48 1522   19  348  632   17   38    4   20  113

# Prepare features and target variables
X_train <- train_set %>% select(-conditions)
y_train <- train_set$conditions
X_holdout <- holdout_set %>% select(-conditions)
y_holdout <- holdout_set$conditions

Random Forest

Random Forest builds multiple decision trees during the training process, afterward it combines their outputs to make a prediction.

Random Forest is used for the following reasons:

Handels complex and Nonlinear Relationships effectively
Robustness to Overfitting
Suitability for Multi-Class Classification

# Train the Random Forest model
set.seed(123)
rf_model <- randomForest(as.factor(conditions) ~ ., data = train_set, ntree = 500, mtry = 3)

# Predict on the holdout set
rf_pred <- predict(rf_model, holdout_set)

# Evaluate the model
rf_accuracy <- mean(rf_pred == holdout_set$conditions)
rf_conf_matrix <- table(Predicted = rf_pred, Actual = holdout_set$conditions)

cat("Random Forest Metrics:\n")

## Random Forest Metrics:

cat("Accuracy:", rf_accuracy, "\n")

## Accuracy: 0.9817352

cat("Confusion Matrix:\n")

## Confusion Matrix:

print(rf_conf_matrix)

##          Actual
## Predicted    1    2    3    4    5    6    7    9   10   11   12
##        1   524    0    0    0    0    0    0    0    0    0    0
##        2     0   48    0    0    0    0    0    0    0    0    0
##        3     0    0 1522    0    0    0    0    0    0    0    0
##        4     0    0    0   15    0    0    0    0    0    0    0
##        5     0    0    0    0  344    0    0    0    0    4    0
##        6     0    0    0    0    0  620    0    0    0    0   15
##        7     0    0    0    3    0    0   17    0    1    0    0
##        8     0    0    0    0    0    0    0    0    0    0    0
##        9     0    0    0    0    0    0    0   26    0    0    8
##        10    0    0    0    1    0    0    0    0    3    0    0
##        11    0    0    0    0    4    0    0    0    0   16    0
##        12    0    0    0    0    0   12    0   12    0    0   90

KNN

K-Nearest Neighbors (KNN) is a widely used and simple supervised machine learning algorithm, that is used for both classification and regression problems, the reason for using it in our problem is that it can predict multi-class classification solutions without the use of ‘One-vs-Rest’, and ‘One-vs-One’, approaches, which is needed to enable binary classification algorithms to perform muliple classifications. First we need to find the optimal number of neighbors (k) using the Elbow Graph

library(class)
# Train the KNN model
# Scale training and holdout features
X_train_scaled <- scale(X_train)
X_holdout_scaled <- scale(X_holdout, center = attr(X_train_scaled, "scaled:center"), 
                          scale = attr(X_train_scaled, "scaled:scale"))


# Set up a range of k values to evaluate
k_values <- seq(1, 20, by = 2)  # Odd values of k from 1 to 20
error_rates <- numeric(length(k_values))  # To store error rates for each k

# Loop through each k value and calculate the error rate
for (i in seq_along(k_values)) {
  k <- k_values[i]
  
  # Train and predict using KNN
  knn_pred <- knn(train = X_train, test = X_holdout, cl = y_train, k = k)
  
  # Calculate error rate
  error_rates[i] <- mean(knn_pred != y_holdout)
}

# Find the best k (lowest error rate)
best_k <- k_values[which.min(error_rates)]

The Elbow Graph below is used to identify the number of neighbors with the least error

# Plot the elbow graph
plot(k_values, error_rates, type = "b", pch = 19, col = "blue",
     xlab = "Number of Neighbors (k)", ylab = "Error Rate",
     main = "Elbow Method for Choosing k")
abline(v = best_k, col = "red", lty = 2)
text(best_k, min(error_rates), labels = paste("Best k =", best_k), pos = 4, col = "red")

# Use the best k value
k <- 9

# Train and predict using KNN
knn_pred <- knn(train = X_train, test = X_holdout, cl = y_train, k = k)

# Evaluate KNN Model
knn_accuracy <- mean(knn_pred == y_holdout)
knn_conf_matrix <- table(Predicted = knn_pred, Actual = y_holdout)

# Display metrics
cat("\nKNN Metrics with k =", k, ":\n")

## 
## KNN Metrics with k = 9 :

cat("Accuracy:", knn_accuracy, "\n")

## Accuracy: 0.7038052

cat("Confusion Matrix:\n")

## Confusion Matrix:

print(knn_conf_matrix)

##          Actual
## Predicted    1    2    3    4    5    6    7    9   10   11   12
##        1   473    0   48   14    0    4   17    0    4    0    1
##        2     0    1    0    0    2    0    0    0    0    0    0
##        3    50    9 1298    4   17  347    0   28    0    5   79
##        4     0    0    0    0    0    0    0    0    0    0    0
##        5     0   28    7    0  281   43    0    0    0    6    1
##        6     1    9  149    1   45  230    0    0    0    2    8
##        7     0    0    0    0    0    0    0    0    0    0    0
##        8     0    0    0    0    0    0    0    0    0    0    0
##        9     0    0    7    0    0    0    0    5    0    0    2
##        10    0    0    0    0    0    0    0    0    0    0    0
##        11    0    1    0    0    2    1    0    0    0    2    0
##        12    0    0   13    0    1    7    0    5    0    5   22

Linear Regression

Linear Regression is a simple yet powerful statistical and machine learning technique, which is implemented to model the relationship between the target feature(dependent variable), and one or more predictors (independent variable).

We implement Linear Regression for the following reasons:

Simple and Easy to Implement
Interpretable Results
Efficient for Small and Medium-Sized Datasets

# Train the Linear Regression Model
linear_model <- lm(tempavg ~ ., data = train_set)

# Predict on the holdout set
linear_pred <- predict(linear_model, newdata = holdout_set)

# Evaluate the Model
# Mean Squared Error (MSE)
mse <- mean((linear_pred - holdout_set$tempavg)^2)

# Root Mean Squared Error (RMSE)
rmse <- sqrt(mse)

# R-Squared
ss_total <- sum((holdout_set$tempavg - mean(holdout_set$tempavg))^2)
ss_residual <- sum((holdout_set$tempavg - linear_pred)^2)
r_squared <- 1 - (ss_residual / ss_total)

# Print Metrics
cat("Linear Regression Metrics:\n")

## Linear Regression Metrics:

cat("MSE:", mse, "\n")

## MSE: 0.002129343

cat("RMSE:", rmse, "\n")

## RMSE: 0.04614481

cat("R-Squared:", r_squared, "\n")

## R-Squared: 0.9979184

Random Forest Regression

ensemble learning method that combines multiple decision trees to predict continuous target variables

We implement Random Forest Regression for the following reasons:

Handles Nonlinear Relationships
Feature Importance and Interpretability
Handles High-Dimensional Data

#Load necessary library
library(randomForest)

# Train the Random Forest Regression Model
set.seed(123)  # For reproducibility
rf_reg_model <- randomForest(tempavg ~ ., data = train_set, ntree = 500, mtry = 3)

# Predict on the holdout set
rf_reg_pred <- predict(rf_reg_model, newdata = holdout_set)

# Evaluate the Model
# Mean Squared Error (MSE)
rf_mse <- mean((rf_reg_pred - holdout_set$tempavg)^2)

# Root Mean Squared Error (RMSE)
rf_rmse <- sqrt(rf_mse)

# R-Squared
rf_ss_total <- sum((holdout_set$tempavg - mean(holdout_set$tempavg))^2)
rf_ss_residual <- sum((holdout_set$tempavg - rf_reg_pred)^2)
rf_r_squared <- 1 - (rf_ss_residual / rf_ss_total)

# Print Metrics
cat("Random Forest Regression Metrics:\n")

## Random Forest Regression Metrics:

cat("MSE:", rf_mse, "\n")

## MSE: 0.0022477

cat("RMSE:", rf_rmse, "\n")

## RMSE: 0.04740992

cat("R-Squared:", rf_r_squared, "\n")

## R-Squared: 0.9978027

Evaluation

Evaluation Metrics for Classification Model

We start our Evaluation by visualizing the confusion matrices of both the Random Forest, and KNN, by comparing the prediction class labels with the actual lables

# Random Forest Confusion Matrix Heatmap
rf_conf_matrix_melt <- melt(as.matrix(rf_conf_matrix))

ggplot(data = rf_conf_matrix_melt, aes(x = Actual, y = Predicted, fill = value)) +
  geom_tile(color = "black") +
  geom_text(aes(label = value), color = "black", size = 4) +  # Numbers in black
  scale_fill_gradient(low = "lightgray", high = "blue") +
  labs(title = "Random Forest Confusion Matrix Heatmap",
       x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

# KNN Confusion Matrix Heatmap
knn_conf_matrix_melt <- melt(as.matrix(knn_conf_matrix))

ggplot(data = knn_conf_matrix_melt, aes(x = Actual, y = Predicted, fill = value)) +
  geom_tile(color = "black") +
  geom_text(aes(label = value), color = "black", size = 4) +  # Numbers in black
  scale_fill_gradient(low = "lightgray", high = "green") +
  labs(title = "KNN Confusion Matrix Heatmap",
       x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

Comparison of Classification Models Metrics
Metric	Random_Forest	KNN
Accuracy	0.9817352	0.7038052
Macro Precision	0.8158052	0.2811662
Macro Recall	0.8232156	0.3561035
Macro F1 Score	0.8174549	0.2924496

we observe from the evaluation metrics table that the Random Forest classifier had a superior performance across all metrics

We implement additional performance visualizations Below:

#------------------------------------------------------------------------------
# Prepare data for visualization
# Predict class probabilities for the holdout set
rf_prob <- predict(rf_model, holdout_set, type = "prob")

metrics <- data.frame(
  Metric = c("Accuracy", "Macro Precision", "Macro Recall", "Macro F1 Score"),
  Random_Forest = c(rf_accuracy, rf_metrics$precision_macro, rf_metrics$recall_macro, rf_metrics$f1_score_macro),
  KNN = c(knn_accuracy, knn_metrics$precision_macro, knn_metrics$recall_macro, knn_metrics$f1_score_macro)
)

# Reshape data for ggplot
library(reshape2)
metrics_melt <- melt(metrics, id.vars = "Metric", variable.name = "Model", value.name = "Score")

# Plot the comparison
ggplot(metrics_melt, aes(x = Metric, y = Score, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Comparison of Classification Models",
       x = "Evaluation Metric", y = "Score") +
  scale_fill_manual(values = c("Random_Forest" = "blue", "KNN" = "green")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

macro-averaged metrics provide a holistic view when comparing the performance of two multi-class classification algorithms.

#---------------------------------------------------------------------------------------------------
library(PRROC)

# Create PR curve for each class
pr_curves <- list()
for (class in colnames(rf_prob)) {
  pr_curves[[class]] <- pr.curve(scores.class0 = rf_prob[, class],
                                 weights.class0 = holdout_set$conditions == class, curve = TRUE)
}

# Plot all PR curves
plot(pr_curves[[1]]$curve, type = "l", col = "blue", xlab = "Recall", ylab = "Precision", main = "Precision-Recall Curves")
for (i in 2:length(pr_curves)) {
  lines(pr_curves[[i]]$curve, col = i + 1)
}
legend("bottomleft", legend = colnames(rf_prob), col = 1:length(pr_curves), lty = 1)

Precision-Recall Curve is an evaluation visualization tool used to evaluate the performance of a classification model. It is especially useful for multi-class problems when the goal is to analyze the model’s performance for each class individually. Precision-Recall curves are a kind of graph which gives the degree of performance of the model classification for balanced and imbalanced datasets, it shows Precision against recall, for different thresholds, it showcases the trade-offs between them.

Evaluation of Regression models

For the Regression models we are going to evaluate using the most commonly used metrics for regression models those are:

Mean Squared Error (MSE) which measure the average squared error, this metric penalizes large errors.
Root Mean Squared Error (RMSE) same as MSE but in the same units as the target variable, making it easier to interpret.
R-Squared showed the variance proportions captured by the model.

# Visualize Predictions vs Actual Values
ggplot(data = holdout_set, aes(x = tempavg, y = rf_reg_pred)) +
  geom_point(color = "darkgreen", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Random Forest Regression: Predictions vs Actual",
       x = "Actual Average Temperature",
       y = "Predicted Average Temperature") +
  theme_minimal()

The Scatter plot shows the distribution of the actual values in comparison with the best fit line used for prediction, we observe a linear relationship indicating that a simple Regression model will suffice

# Create a data frame for the metrics
rf_metrics_table <- data.frame(
  Metric = c("Mean Squared Error (MSE)", "Root Mean Squared Error (RMSE)", "R-Squared"),
  Value = c(rf_mse, rf_rmse, rf_r_squared)
)

# Display the table
knitr::kable(rf_metrics_table, caption = "Random Forest Regression Metrics")

Random Forest Regression Metrics
Metric	Value
Mean Squared Error (MSE)	0.0022477
Root Mean Squared Error (RMSE)	0.0474099
R-Squared	0.9978027

The values indicate that there is a minimal variation between the the actual and predicted values, also since we are using MSE, this variation is penalized even further.

# Visualize Predictions vs Actual Values
library(ggplot2)

ggplot(data = holdout_set, aes(x = tempavg, y = linear_pred)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Linear Regression: Predictions vs Actual",
       x = "Actual Average Temperature",
       y = "Predicted Average Temperature") +
  theme_minimal()

A similar scatter plot can be observed from the random forest regression model, there is a slight improvement in the metrics, however it is unnoticeable

# Create a data frame for the metrics
linear_metrics_table <- data.frame(
  Metric = c("Mean Squared Error (MSE)", "Root Mean Squared Error (RMSE)", "R-Squared"),
  Value = c(mse, rmse, r_squared)
)

# Display the table
knitr::kable(linear_metrics_table, caption = "Linear Regression Metrics")

Linear Regression Metrics
Metric	Value
Mean Squared Error (MSE)	0.0021293
Root Mean Squared Error (RMSE)	0.0461448
R-Squared	0.9979184

Deployment

This deployment provides an interactive experience where users can explore temperature trends over time through an interactive line plot and view detailed data through a sortable and searchable table. The visualizations allow for dynamic exploration of temperature patterns, while the data table gives users easy access to the raw data, making it a powerful tool for analysis and decision-making.

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Conclusion

Key Findings:

Temperature Prediction Models:

The Linear Regression model provides a straightforward evaluation of the relationship between dependent and independent variables, however if introduced to complex non-linear structure we believe the model showed limitations.
on the other hand, the Random Forest Regression model achieved better outcomes. such as lower mean squared error (MSE) and higher R-Squared value, however the difference is insignificant.

Condition multi-class Classification Models:

The Random Forest Classifier showed a superior performance across all evaluation metrics, this induces macro precision, macro recall and macro F1-score, which suggests its strength in handling multi-class problems.
The KNN Classifier showed much lower accuracy and precision, while keeping in mind that its simpler to implement and understand, this was due to it’s sensitivity to data scaling and class distribution

Deployment Insights:

The interactive nature of the deployment allows users to explore temperature trends dynamically over time. This interactive line plot and sortable data table make it a powerful tool for understanding and analyzing temperature data.

Conclusion:

Over all the results showed the significance of selecting models that align with the structure of the data in action, along with the problem’s complexity. simpler models like linear regression and KNN are good for baseline evaluations, but advanced models such as Random Forest, showed a clear advantage in catching patterns and delivering more trusted predictions.

WQD7004 SEOUL HISTORICAL WEATHER 2009 TO 2023

AFRINA ROSA, TAUFIQ ISMAIL, KARAM ALJANADI, CHIA KAI SWAN, SITI HAIRUNEESHA

2025-01-06