| Matric | Full_Name |
|---|---|
| S2110638 | Karam Aljanadi |
| 23094825 | Siti Hairuneesha Binti Zainuddin |
| u2005486 | Afrina Rosa binti Mohamad Sattar |
| 23085637 | Chia Kai Swan |
| S2119226 | Muhammad Taufiq Bin Ismail |
Weather forecasting plays a pivotal role in understanding climatic patterns, preparing for extreme weather events, and optimizing resources across various sectors. This study focuses on Seoul’s weather over a 15-year period (2009–2023), utilizing historical weather data to uncover significant trends and patterns. By analyzing key attributes such as temperature, precipitation, wind speed, and cloud cover, we aim to gain insights into Seoul’s evolving climate and predict future weather conditions.
The dataset, titled “Seoul Historical Weather Data (1994–2024),” is available on Kaggle.
Research Objective:
Research Questions:
What weather trends can be identified in Seoul from 2009 to 2023 ?
How accurate are forecasting models for predicting future weather?
How can visualizations help urban planners use weather data for decision-making?
CRISP-DM
1.Understand: Define the problem: Analyze Seoul’s weather data for insights and predictions.
2.Explore: Load and examine the dataset, identifying key attributes.
3.Prepare: Clean and preprocess data for analysis.
4.Model: Build predictive models using statistical/ML methods.
5.Evaluate: Assess model performance using relevant metrics.
6.Deploy: Present findings, forecasts, and actionable recommendations.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (readr)
library (dplyr)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(ggplot2)
library(rpart)
library(corrplot)
## corrplot 0.95 loaded
library(Metrics)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:Metrics':
##
## precision, recall
##
## The following object is masked from 'package:purrr':
##
## lift
library(rsconnect)
library(knitr)
df<-read_csv("dataset123.csv")
## Rows: 10958 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, datetime, preciptype, conditions, description, icon, stations
## dbl (25): ROW_KEY, tempmax, tempmin, temp, feelslikemax, feelslikemin, feel...
## dttm (2): sunrise, sunset
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show the number of rows in a more descriptive way
cat("The dataset contains", nrow(df), "rows.")
## The dataset contains 10958 rows.
str(df)
## spc_tbl_ [10,958 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ROW_KEY : num [1:10958] 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr [1:10958] "seoul" "seoul" "seoul" "seoul" ...
## $ datetime : chr [1:10958] "1/1/1994" "2/1/1994" "3/1/1994" "4/1/1994" ...
## $ tempmax : num [1:10958] 35.2 43 47.9 38.8 40 42.7 37 44.3 48.8 48 ...
## $ tempmin : num [1:10958] 16.4 31.5 30.9 22.1 24 26.1 17.7 NA 33.6 27.9 ...
## $ temp : num [1:10958] 26.3 36.2 38 30.1 33.1 35.5 27.5 34.2 41 39.7 ...
## $ feelslikemax : num [1:10958] 33.4 39.4 44.7 32 40 39.1 37 37.6 46.1 48 ...
## $ feelslikemin : num [1:10958] 13 26.7 24.5 18.4 18.5 15.7 13.7 12.6 29.2 27.9 ...
## $ feelslike : num [1:10958] 24.3 32.6 35.4 26.3 31 30.1 24.3 28.2 39.1 38.9 ...
## $ dew : num [1:10958] 15.5 27.9 27.3 13.6 21.7 25 11.3 24.3 34.3 32.4 ...
## $ humidity : num [1:10958] 65.9 72.1 68.1 51.2 63.9 66.7 52.1 67.7 77.4 75.8 ...
## $ precip : num [1:10958] 0 0 0 0 0.01 0.1 0 0 0 0 ...
## $ precipprob : num [1:10958] 0 0 0 0 100 100 0 0 0 0 ...
## $ precipcover : num [1:10958] 0 0 0 0 4.17 4.17 0 0 0 0 ...
## $ preciptype : chr [1:10958] NA NA NA NA ...
## $ snow : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ snowdepth : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ windgust : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ windspeed : num [1:10958] 5.5 9.1 12.8 11.7 6.8 11.8 9.1 14.7 8.9 4.5 ...
## $ winddir : num [1:10958] 115 182 290 302 134 ...
## $ sealevelpressure: num [1:10958] 1025 1022 1020 1025 1024 ...
## $ cloudcover : num [1:10958] 63 88.8 57.4 16.3 90.4 57.2 30.1 46.6 NA 81 ...
## $ visibility : num [1:10958] 6.6 6.9 5.3 7.3 6.2 5.1 7.4 7.6 3.7 2.2 ...
## $ solarradiation : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ solarenergy : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ uvindex : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ severerisk : num [1:10958] NA NA NA NA NA NA NA NA NA NA ...
## $ sunrise : POSIXct[1:10958], format: "1994-01-01 07:46:54" "1994-01-02 07:47:03" ...
## $ sunset : POSIXct[1:10958], format: "1994-01-01 17:23:56" "1994-01-02 17:24:44" ...
## $ moonphase : num [1:10958] 0.61 0.65 0.68 0.72 0.75 0.79 0.83 0.86 0.9 0.93 ...
## $ conditions : chr [1:10958] "Partially cloudy" "Partially cloudy" "Partially cloudy" "Clear" ...
## $ description : chr [1:10958] "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Clear conditions throughout the day." ...
## $ icon : chr [1:10958] "partly-cloudy-day" "partly-cloudy-day" "partly-cloudy-day" "clear-day" ...
## $ stations : chr [1:10958] "4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000" "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000"| __truncated__ "47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,"| __truncated__ ...
## - attr(*, "spec")=
## .. cols(
## .. ROW_KEY = col_double(),
## .. name = col_character(),
## .. datetime = col_character(),
## .. tempmax = col_double(),
## .. tempmin = col_double(),
## .. temp = col_double(),
## .. feelslikemax = col_double(),
## .. feelslikemin = col_double(),
## .. feelslike = col_double(),
## .. dew = col_double(),
## .. humidity = col_double(),
## .. precip = col_double(),
## .. precipprob = col_double(),
## .. precipcover = col_double(),
## .. preciptype = col_character(),
## .. snow = col_double(),
## .. snowdepth = col_double(),
## .. windgust = col_double(),
## .. windspeed = col_double(),
## .. winddir = col_double(),
## .. sealevelpressure = col_double(),
## .. cloudcover = col_double(),
## .. visibility = col_double(),
## .. solarradiation = col_double(),
## .. solarenergy = col_double(),
## .. uvindex = col_double(),
## .. severerisk = col_double(),
## .. sunrise = col_datetime(format = ""),
## .. sunset = col_datetime(format = ""),
## .. moonphase = col_double(),
## .. conditions = col_character(),
## .. description = col_character(),
## .. icon = col_character(),
## .. stations = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
kable(head(df))
| ROW_KEY | name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | preciptype | snow | snowdepth | windgust | windspeed | winddir | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | seoul | 1/1/1994 | 35.2 | 16.4 | 26.3 | 33.4 | 13.0 | 24.3 | 15.5 | 65.9 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 5.5 | 115.4 | 1025.4 | 63.0 | 6.6 | NA | NA | NA | NA | 1994-01-01 07:46:54 | 1994-01-01 17:23:56 | 0.61 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 4,711,109,999,947,110,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 2 | seoul | 2/1/1994 | 43.0 | 31.5 | 36.2 | 39.4 | 26.7 | 32.6 | 27.9 | 72.1 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 9.1 | 181.7 | 1022.2 | 88.8 | 6.9 | NA | NA | NA | NA | 1994-01-02 07:47:03 | 1994-01-02 17:24:44 | 0.65 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 3 | seoul | 3/1/1994 | 47.9 | 30.9 | 38.0 | 44.7 | 24.5 | 35.4 | 27.3 | 68.1 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 12.8 | 289.5 | 1020.0 | 57.4 | 5.3 | NA | NA | NA | NA | 1994-01-03 07:47:11 | 1994-01-03 17:25:33 | 0.68 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 4 | seoul | 4/1/1994 | 38.8 | 22.1 | 30.1 | 32.0 | 18.4 | 26.3 | 13.6 | 51.2 | 0.00 | 0 | 0.00 | NA | NA | NA | NA | 11.7 | 301.9 | 1025.1 | 16.3 | 7.3 | NA | NA | NA | NA | 1994-01-04 07:47:16 | 1994-01-04 17:26:23 | 0.72 | Clear | Clear conditions throughout the day. | clear-day | 47,111,099,999,471,100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5 | seoul | 5/1/1994 | 40.0 | 24.0 | 33.1 | 40.0 | 18.5 | 31.0 | 21.7 | 63.9 | 0.01 | 100 | 4.17 | rain,snow | NA | NA | NA | 6.8 | 134.1 | 1023.9 | 90.4 | 6.2 | NA | NA | NA | NA | 1994-01-05 07:47:19 | 1994-01-05 17:27:14 | 0.75 | Snow, Rain, Overcast | Cloudy skies throughout the day with late afternoon rain or snow. | rain | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 6 | seoul | 6/1/1994 | 42.7 | 26.1 | 35.5 | 39.1 | 15.7 | 30.1 | 25.0 | 66.7 | 0.10 | 100 | 4.17 | rain | NA | NA | NA | 11.8 | 290.2 | 1021.9 | 57.2 | 5.1 | NA | NA | NA | NA | 1994-01-06 07:47:21 | 1994-01-06 17:28:07 | 0.79 | Rain, Partially cloudy | Partly cloudy throughout the day with morning rain. | rain | 471,110,999,994,711,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
Filter data for the 2009-2023 period. This is for data relevancy and potentially leaving out any outdated data.
# Convert `datetime` column to Date
df$datetime <- as.Date(df$datetime, format = "%d/%m/%Y")
# Filter the data for the date range 2009 to 2023
filtered_data <- df %>% filter(datetime >= "2009-01-01" & datetime <= "2023-12-31")
| ROW_KEY | name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | preciptype | snow | snowdepth | windgust | windspeed | winddir | sealevelpressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5480 | seoul | 2009-01-01 | 28.5 | 14.5 | 20.9 | 23.2 | 6.4 | 15.3 | 5.9 | 52.7 | 0 | 0 | 0 | NA | 0 | 0 | NA | 9.5 | 308.6 | 1026.5 | 0.2 | 7.3 | NA | NA | NA | NA | 2009-01-01 07:46:55 | 2009-01-01 17:24:11 | 0.16 | Clear | Clear conditions throughout the day. | clear-day | 471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5481 | seoul | 2009-01-02 | 35.0 | 13.3 | 24.2 | 30.8 | 12.7 | 22.3 | 12.2 | 63.0 | 0 | 0 | 0 | NA | 0 | 0 | NA | 5.6 | 326.3 | 1029.3 | 0.0 | 7.0 | NA | NA | NA | NA | 2009-01-02 07:47:04 | 2009-01-02 17:24:59 | 0.19 | Clear | Clear conditions throughout the day. | clear-day | 4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5482 | seoul | 2009-01-03 | 39.4 | 15.1 | 27.2 | 39.4 | 12.7 | 25.6 | 14.2 | 62.7 | 0 | 0 | 0 | NA | 0 | 0 | NA | 5.6 | 329.4 | 1029.4 | 11.2 | 6.7 | NA | NA | NA | NA | 2009-01-03 07:47:11 | 2009-01-03 17:25:48 | 0.23 | Clear | Clear conditions throughout the day. | clear-day | 471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5483 | seoul | 2009-01-04 | 40.4 | 22.4 | 30.5 | 36.3 | 22.4 | 28.6 | 15.8 | 58.6 | 0 | 0 | 0 | NA | 0 | 0 | NA | 7.8 | 306.2 | 1026.9 | 27.5 | 7.1 | NA | NA | NA | NA | 2009-01-04 07:47:15 | 2009-01-04 17:26:39 | 0.25 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5484 | seoul | 2009-01-05 | 34.0 | 18.7 | 27.4 | 32.2 | 17.9 | 24.4 | 17.6 | 67.8 | 0 | 0 | 0 | NA | 0 | 0 | NA | 9.0 | 287.0 | 1026.9 | 34.0 | 6.6 | NA | NA | NA | NA | 2009-01-05 07:47:18 | 2009-01-05 17:27:31 | 0.29 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 471,110,999,994,709,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
| 5485 | seoul | 2009-01-06 | 34.9 | 15.8 | 26.2 | 34.8 | 15.8 | 24.4 | 14.2 | 63.1 | 0 | 0 | 0 | NA | 0 | 0 | NA | 5.6 | 326.1 | 1028.9 | 38.8 | 4.6 | NA | NA | NA | NA | 2009-01-06 07:47:19 | 2009-01-06 17:28:24 | 0.33 | Partially cloudy | Becoming cloudy in the afternoon. | partly-cloudy-day | 4,711,109,999,947,090,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 |
# Filter the relevant columns
filtered_data <- filtered_data %>%
select(datetime, tempmax, tempmin, temp, precip, windspeed, cloudcover,winddir, visibility, moonphase)
# Check the filtered dataset
str(filtered_data)
## tibble [5,478 × 10] (S3: tbl_df/tbl/data.frame)
## $ datetime : Date[1:5478], format: "2009-01-01" "2009-01-02" ...
## $ tempmax : num [1:5478] 28.5 35 39.4 40.4 34 34.9 39.5 37.9 32.1 24.9 ...
## $ tempmin : num [1:5478] 14.5 13.3 15.1 22.4 18.7 15.8 15.6 17.6 18.5 14.6 ...
## $ temp : num [1:5478] 20.9 24.2 27.2 30.5 27.4 26.2 27.5 27.9 25.7 19.7 ...
## $ precip : num [1:5478] 0 0 0 0 0 0 0 0 0 0 ...
## $ windspeed : num [1:5478] 9.5 5.6 5.6 7.8 9 5.6 6.7 7.4 7.9 9.1 ...
## $ cloudcover: num [1:5478] 0.2 0 11.2 27.5 34 38.8 10.8 36.1 17.2 8.2 ...
## $ winddir : num [1:5478] 309 326 329 306 287 ...
## $ visibility: num [1:5478] 7.3 7 6.7 7.1 6.6 4.6 5.2 6.1 6.6 7.4 ...
## $ moonphase : num [1:5478] 0.16 0.19 0.23 0.25 0.29 0.33 0.36 0.4 0.43 0.47 ...
Filter relevant columns Key Attributes:
datetime: Records the specific date for each observation, formatted as “YYYY-MM-DD.”
tempmax: Represents the maximum temperature recorded each day in degrees Celsius (°C), useful for identifying heatwaves and hot days.
tempmin: Represents the minimum temperature recorded each day in degrees Celsius (°C), helpful for detecting cold days and frost conditions.
temp: Provides the average temperature recorded each day in degrees Celsius (°C), offering a balanced view of daily temperature conditions.
precip: Indicates the total daily precipitation in millimeters (mm), capturing the amount of rainfall or snowfall.
windspeed: Measures the average daily wind speed in kilometers per hour (km/h), which is essential for understanding wind variability and weather patterns.
cloudcover: Shows the average percentage of cloud cover during the day, providing insights into daily sky conditions and solar exposure.
winddir: Represents the average wind direction in degrees, measured clockwise from true north, which helps in studying prevailing wind patterns.
visibility: Measures the average visibility distance per day in kilometers, useful for assessing air quality and weather clarity.
moonphase: Indicates the moon phase as a numeric value (e.g., 0 for new moon, 0.5 for full moon), providing information on the lunar cycle that might correlate with certain natural phenomena.
These attributes collectively help analyze a wide range of environmental and climatic trends, such as temperature extremes, precipitation patterns, wind dynamics, and atmospheric conditions, all of which are valuable for urban planning, agriculture, and environmental studies.
# Show number of missing values before handling
missing_before <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values before handling:")
## [1] "Missing values before handling:"
print(missing_before)
## datetime tempmax tempmin temp precip windspeed cloudcover
## 0 0 0 1 1 0 2
## winddir visibility moonphase
## 0 0 0
# Fill missing values in numeric columns with the mean
numeric_columns <- names(filtered_data)[sapply(filtered_data, is.numeric)]
filtered_data <- filtered_data %>%
mutate(across(all_of(numeric_columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Show number of missing values after handling
missing_after <- sapply(filtered_data, function(x) sum(is.na(x)))
print("Missing values after handling:")
## [1] "Missing values after handling:"
print(missing_after)
## datetime tempmax tempmin temp precip windspeed cloudcover
## 0 0 0 0 0 0 0
## winddir visibility moonphase
## 0 0 0
# Remove duplicates
filtered_data <- distinct(filtered_data)
This transformation is done to make the column name more descriptive or clearer.
precip to precipitation
#transform precip to precipitation
# Rename the column
filtered_data <- filtered_data %>%
rename(precipitation = precip)
# Check the updated column names
colnames(filtered_data)
## [1] "datetime" "tempmax" "tempmin" "temp"
## [5] "precipitation" "windspeed" "cloudcover" "winddir"
## [9] "visibility" "moonphase"
# Time series plot
ggplot(filtered_data, aes(x = datetime, y = tempmax)) +
geom_line(color = "blue") + # Line plot
labs(title = "Max Temperature",
x = "Date and Time",
y = "Temperature Maximum") +
theme_minimal()
The maximum temperature data shows a clear cyclical pattern, likely due to seasonal variations. There appears to be a slight upward trend over time. Some anomalies are observed, suggesting the influence of weather events or other factors.
# Summary statistics for temp columns
summary(filtered_data$tempmax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.10 39.10 61.00 57.61 79.20 102.30
summary(filtered_data$tempmin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17.10 23.70 41.70 41.68 62.50 84.40
# Check for unusually high or low values
boxplot(filtered_data$tempmax, main = "Boxplot of tempmax", ylab = "Temperature")
boxplot(filtered_data$tempmin, main = "Boxplot of tempmin", ylab = "Temperature")
Both tempmax and tempmin exhibit a wide
range and significant variability.
Tempmax shows a higher median and a wider range
compared to tempmin.
This suggests substantial daily temperature fluctuations.
Considerations:
Units: The units for temperature are not specified in the plots. Knowing the units (e.g., Celsius, Fahrenheit) would be essential for accurate interpretation.
Unit Consistency Using a single unit (Celsius) ensures accurate comparisons and calculations throughout the analysis.
# Define conversion functions
celsius_to_fahrenheit <- function(celsius) {
celsius * 9 / 5 + 32
}
fahrenheit_to_celsius <- function(fahrenheit) {
(fahrenheit - 32) * 5 / 9
}
# Define the date range for conversion
start_date <- as.Date('2009-01-01')
end_date <- as.Date('2021-12-31')
# Filter the data for the dates that need conversion (2009-2021)
data_to_convert <- filtered_data[filtered_data$datetime >= start_date & filtered_data$datetime <= end_date, ]
# Define the columns to convert
temp_columns <- c("tempmax", "tempmin", "temp")
# Apply temperature conversion for the filtered rows (2009-2021)
for (col in temp_columns) {
data_to_convert[[col]] <- round(fahrenheit_to_celsius(data_to_convert[[col]]), 1)
}
# Now, merge the converted data back with the original data
data_non_converted <- filtered_data[!(filtered_data$datetime >= start_date & filtered_data$datetime <= end_date), ]
# Combine the data
final_data <- rbind(data_non_converted, data_to_convert)
# Ensure that the combined data is sorted by date
final_data <- final_data[order(final_data$datetime), ]
| datetime | tempmax | tempmin | temp | precipitation | windspeed | cloudcover | winddir | visibility | moonphase |
|---|---|---|---|---|---|---|---|---|---|
| 2009-01-01 | -1.9 | -9.7 | -6.2 | 0 | 9.5 | 0.2 | 308.6 | 7.3 | 0.16 |
| 2009-01-02 | 1.7 | -10.4 | -4.3 | 0 | 5.6 | 0.0 | 326.3 | 7.0 | 0.19 |
| 2009-01-03 | 4.1 | -9.4 | -2.7 | 0 | 5.6 | 11.2 | 329.4 | 6.7 | 0.23 |
| 2009-01-04 | 4.7 | -5.3 | -0.8 | 0 | 7.8 | 27.5 | 306.2 | 7.1 | 0.25 |
| 2009-01-05 | 1.1 | -7.4 | -2.6 | 0 | 9.0 | 34.0 | 287.0 | 6.6 | 0.29 |
| 2009-01-06 | 1.6 | -9.0 | -3.2 | 0 | 5.6 | 38.8 | 326.1 | 4.6 | 0.33 |
Exploratory Data Analysis (EDA) is an essential step in data analysis that helps us understand the dataset, uncover underlying patterns, detect anomalies, and test assumptions before applying any predictive models. In this section, we aim to explore the Seoul Historical Weather data to extract meaningful insights regarding weather patterns over a 15-year period (2009–2023). By visualizing the data, calculating summary statistics, and performing various checks, we will prepare the dataset for further analysis and modeling.
Calculate summary statistics (mean, median, variance) for key attributes like temperature, precipitation, wind speed, etc.
## datetime tempmax tempmin temp
## Min. :2009-01-01 Min. :-11.30 Min. :-19.000 Min. :-14.40
## 1st Qu.:2012-10-01 1st Qu.: 8.40 1st Qu.: -1.000 1st Qu.: 3.80
## Median :2016-07-01 Median : 19.55 Median : 9.000 Median : 14.10
## Mean :2016-07-01 Mean : 17.68 Mean : 8.331 Mean : 12.89
## 3rd Qu.:2020-03-31 3rd Qu.: 27.10 3rd Qu.: 18.275 3rd Qu.: 22.60
## Max. :2023-12-31 Max. : 39.10 Max. : 29.100 Max. : 49.47
## precipitation windspeed cloudcover winddir
## Min. : 0.0000 Min. : 2.60 Min. : 0.00 Min. : 0.2
## 1st Qu.: 0.0000 1st Qu.: 7.90 1st Qu.: 25.40 1st Qu.:167.6
## Median : 0.0000 Median :10.00 Median : 50.50 Median :262.7
## Mean : 0.7568 Mean :10.61 Mean : 50.19 Mean :225.0
## 3rd Qu.: 0.0400 3rd Qu.:12.60 3rd Qu.: 74.80 3rd Qu.:299.7
## Max. :268.6200 Max. :44.90 Max. :100.00 Max. :359.9
## visibility moonphase
## Min. : 0.700 Min. :0.0000
## 1st Qu.: 5.600 1st Qu.:0.2500
## Median : 7.100 Median :0.5000
## Mean : 7.986 Mean :0.4833
## 3rd Qu.: 8.200 3rd Qu.:0.7500
## Max. :46.200 Max. :0.9800
To observe temperature variations over time
ggplot(final_data, aes(x = datetime)) +
geom_line(aes(y = tempmax), color = "red") +
geom_line(aes(y = tempmin), color = "blue") +
geom_line(aes(y = temp), color = "green") +
labs(title = "Temperature Over Time", x = "Date", y = "Temperature (°C)") +
theme_minimal()
Color Interpretation:
Red: Represents the highest temperature readings.
Blue: Represents the lowest temperature readings.
Green: Represents intermediate temperature readings.
Visual Interpretation:
Cyclical Pattern: The most prominent feature is the clear cyclical pattern in the temperature data. There are distinct peaks and troughs that repeat at regular intervals. This suggests a strong seasonal component in the temperature variations.
Amplitude: The amplitude of the fluctuations seems to vary somewhat over time. There are periods with higher peaks and deeper troughs, and others with more moderate variations. This could be due to factors like El Niño or other climate phenomena.
Trend: There appears to be a slight upward trend in the overall temperature levels from 2010 to 2020. This could be attributed to long-term climate change or other regional factors.
Anomalies: There are a few notable anomalies, such as a sharp drop in temperature around 2016. These could be due to specific weather events or other localized factors.
In Summary:
The plot visually depicts the temperature fluctuations over time, highlighting the seasonal variations, potential trends, and notable anomalies. The use of color effectively differentiates between temperature ranges, providing a clear visual representation of the temperature variations.
# Reshape the dataset to long format
final_data_long <- pivot_longer(
final_data,
cols = -datetime, # Exclude 'datetime' since it's the identifier
names_to = "variable",
values_to = "value"
)
# Plotting with facets
ggplot(final_data_long, aes(x = datetime, y = value)) +
geom_line() +
facet_wrap(~ variable, scales = "free_y") + # Free y-axis scale for better visualization
labs(
title = "Time Series of Weather Variables",
x = "Date",
y = "Value"
) +
theme_minimal()
Precipitation: The precipitation time series shows distinct peaks and troughs, suggesting a seasonal rainfall pattern. There appear to be periods with higher precipitation and others with lower precipitation.
Wind Speed: The wind speed time series also shows a cyclical pattern, but with less pronounced peaks and troughs compared to precipitation. The overall variation in wind speed seems to be less than that of precipitation.
Cloud Cover: The cloud cover time series exhibits a high degree of variability with frequent fluctuations. There are periods with consistently high cloud cover and others with lower cloud cover.
The most likely explanation for the cyclical patterns is the seasonal variation in weather patterns. In many regions, precipitation, wind speed, and cloud cover are influenced by factors like monsoon seasons, changes in atmospheric circulation, and solar radiation
A heatwave threshold is defined (in this case, 35 degrees Celsius). Showing the days with a maximum temperature exceeding the threshold are marked as heatwave days.
# Define heatwave threshold
heatwave_threshold <- 35 # Example threshold for heatwaves
# Mark heatwave days
final_data <- final_data %>%
mutate(heatwave = ifelse(tempmax > heatwave_threshold, 1, 0))
# Use cumsum to identify consecutive heatwave periods
final_data <- final_data %>%
mutate(heatwave_id = cumsum(heatwave == 0 & lag(heatwave, default = 0) == 0))
# Count the number of heatwave days
num_heatwave_days <- sum(final_data$heatwave == 1)
# Create a message
heatwave_message <- paste("The number of heatwaves in Seoul throughout the year are", num_heatwave_days)
# Print the message
print(heatwave_message)
## [1] "The number of heatwaves in Seoul throughout the year are 45"
# Scatter plot for tempmax with heatwave days highlighted
ggplot(final_data, aes(x = datetime, y = tempmax)) +
# Scatter plot for all data points
geom_point(aes(color = factor(heatwave), shape = factor(heatwave)), size = 3) +
# Customize color scale to distinguish heatwave days
scale_color_manual(values = c("black", "red"), labels = c("Non-Heatwave", "Heatwave")) +
scale_shape_manual(values = c(16, 17), labels = c("Non-Heatwave", "Heatwave")) +
# Add annotation for heatwave days count
annotate("text", x = as.Date("2022-01-01"), y = max(final_data$tempmax),
label = paste("Heatwave Days:", num_heatwave_days),
hjust = 0, color = "blue", size = 2.5) +
labs(title = "Scatter Plot of Max Temperature with Heatwave Days Highlighted",
x = "Date",
y = "Max Temperature (°C)") +
theme_minimal() +
theme(legend.title = element_blank())
Interpretation:
The scatter plot provides a visual representation of heatwave events in relation to maximum temperature. It allows us to identify:
Days exceeding the heatwave threshold: The red triangles clearly highlight the days where the maximum temperature surpassed the defined threshold.
Seasonal patterns: The clustering of red triangles suggests that heatwaves are more likely to occur during specific seasons or periods.
Temperature distribution: The scatter plot shows the overall distribution of maximum temperatures, including both heatwave and non-heatwave days.
The correlation analysis is visualized through a circle heatmap, which highlights the strength and direction of relationships between different variables in the dataset.
cor_matrix <- cor(final_data[, c("tempmax", "tempmin", "temp", "precipitation", "windspeed", "cloudcover","winddir", "visibility", "moonphase")])
corrplot(cor_matrix, method = "circle")
Interpretation of Correlation Plot
The following interpretation is based on the correlation plot created for the dataset. The size and color of the circles represent the strength and direction of the correlation between each pair of variables.
Strong Positive Correlations:
tempmax and tempmin have a very strong
positive correlation. This makes sense as the maximum temperature of a
day is closely related to its minimum temperature, with both typically
increasing or decreasing together.temp and tempmax have a strong positive
correlation. As the temperature of the day increases, the maximum
temperature also tends to increase.temp and tempmin also show a strong
positive correlation. Similarly, as the temperature of the day
increases, the minimum temperature increases as well.Moderate to Weak Correlations:
precipitation and
windspeed is weak, suggesting that changes in wind speed do
not strongly affect precipitation levels.cloudcover and
temp is also weak to moderate, meaning that cloud cover
does not always correlate strongly with the temperature of the day.The seasonal analysis explores the variations in key variables over different seasons, aiming to identify trends, patterns, and anomalies that may influence the overall dataset.
To categorize the data by season, a new ‘season’ column is created based on the month of the year,
with the following assumptions:
Winter: December, January, February
Spring: March, April, May
Summer: June, July, August
Fall: September, October, November
# Create a 'season' column based on month based on assumption
final_data <- final_data %>%
mutate(season = case_when(
format(datetime, "%m") %in% c("12", "01", "02") ~ "Winter",
format(datetime, "%m") %in% c("03", "04", "05") ~ "Spring",
format(datetime, "%m") %in% c("06", "07", "08") ~ "Summer",
format(datetime, "%m") %in% c("09", "10", "11") ~ "Fall"
))
# Aggregate by season
seasonal_data <- final_data %>%
group_by(season) %>%
summarise(
tempmax_avg = mean(tempmax, na.rm = TRUE),
tempmin_avg = mean(tempmin, na.rm = TRUE),
precip_avg = mean(precipitation, na.rm = TRUE),
windspeed_avg = mean(windspeed, na.rm = TRUE),
cloudcover_avg = mean(cloudcover, na.rm = TRUE)
)
# Plot the seasonal data
ggplot(seasonal_data, aes(x = season)) +
geom_bar(aes(y = tempmax_avg), stat = "identity", fill = "red") +
geom_bar(aes(y = tempmin_avg), stat = "identity", fill = "blue", alpha = 0.5) +
labs(title = "Average Temperature by Season",
x = "Season",
y = "Temperature (°C)") +
theme_minimal()
Summer: The highest average temperature is observed in the Summer season. The stacked bar shows a combination of two temperature ranges, with the higher range contributing significantly to the overall average.
Fall and Spring: The average temperatures in Fall and Spring are lower compared to Summer. The stacked bars indicate a similar distribution of temperature ranges between these two seasons.
Winter: The lowest average temperature occurs in Winter. The stacked bar shows a predominance of the lower temperature range, contributing to the overall low average.
ggplot(final_data, aes(x = tempmax)) +
geom_boxplot() +
labs(title = "Outliers in Max Temperature", x = "Temperature (°C)") +
theme_minimal()
ggplot(final_data, aes(x = tempmin)) +
geom_boxplot() +
labs(title = "Outliers in Min Temperature", x = "Temperature (°C)") +
theme_minimal()
ggplot(final_data, aes(x = temp)) +
geom_boxplot() +
labs(title = "Outliers in Temperature", x = "Temperature (°C)") +
theme_minimal()
The boxplots provide a visual representation of the temperature data
of tempmax, tempmin and temp,
highlighting the following characteristics:
No extreme values
Symmetrical distribution
Consistent variations across different temperature metrics (max, min, average)
# Example: Creating a 'weather' column based on temperature (tempmax)
final_data$weather <- as.factor(ifelse(final_data$tempmax > 30, "Hot", "Cool"))
From the EDA, we now have a analytical understanding of our dataset, and its structure, based on that we decided to implement four supervised machine learning algorithms, the first two will serve a regression approach, while the second two are utilized for classification. By implementing two models per problem, we can have a general understanding of the models performance, and how the features and data types effects the evaluation and output of our problem.
The first step is to perform feature engineering approaches, this will enable a better performance by preventing feature biasness, and appropriate attributes datatypes. Also, since the each model has different feature requirements, we perform features engineering suitable for each model.
# Identify numerical columns
numerical_columns <- c("tempmax", "tempmin", "tempavg", "precip", "windspeed",
"winddir", "cloudcover", "visibility", "moonphase")
# Normalize numerical columns
data[numerical_columns] <- scale(data[numerical_columns])
# Encode categorical column (e.g., 'conditions')
data$conditions <- as.numeric(factor(data$conditions))
# Verify encoding
print(unique(data$conditions))
## [1] 3 1 11 6 5 9 12 2 4 7 10 8
We are Splitting our Dataset into a training and a testing sets with a 70 to 30 proportions.
# Step 1: Split the dataset into training (70%) and holdout (30%) sets
set.seed(123)
index <- createDataPartition(data$conditions, p = 0.7, list = FALSE)
train_set <- data[index, ]
holdout_set <- data[-index, ]
# Step 2: Check the distribution of classes in training and holdout sets
print(table(train_set$conditions))
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1267 109 3513 52 801 1480 42 1 101 18 49 240
print(table(holdout_set$conditions))
##
## 1 2 3 4 5 6 7 9 10 11 12
## 524 48 1522 19 348 632 17 38 4 20 113
# Prepare features and target variables
X_train <- train_set %>% select(-conditions)
y_train <- train_set$conditions
X_holdout <- holdout_set %>% select(-conditions)
y_holdout <- holdout_set$conditions
Random Forest builds multiple decision trees during the training process, afterward it combines their outputs to make a prediction.
Handels complex and Nonlinear Relationships effectively
Robustness to Overfitting
Suitability for Multi-Class Classification
# Train the Random Forest model
set.seed(123)
rf_model <- randomForest(as.factor(conditions) ~ ., data = train_set, ntree = 500, mtry = 3)
# Predict on the holdout set
rf_pred <- predict(rf_model, holdout_set)
# Evaluate the model
rf_accuracy <- mean(rf_pred == holdout_set$conditions)
rf_conf_matrix <- table(Predicted = rf_pred, Actual = holdout_set$conditions)
cat("Random Forest Metrics:\n")
## Random Forest Metrics:
cat("Accuracy:", rf_accuracy, "\n")
## Accuracy: 0.9817352
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(rf_conf_matrix)
## Actual
## Predicted 1 2 3 4 5 6 7 9 10 11 12
## 1 524 0 0 0 0 0 0 0 0 0 0
## 2 0 48 0 0 0 0 0 0 0 0 0
## 3 0 0 1522 0 0 0 0 0 0 0 0
## 4 0 0 0 15 0 0 0 0 0 0 0
## 5 0 0 0 0 344 0 0 0 0 4 0
## 6 0 0 0 0 0 620 0 0 0 0 15
## 7 0 0 0 3 0 0 17 0 1 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 26 0 0 8
## 10 0 0 0 1 0 0 0 0 3 0 0
## 11 0 0 0 0 4 0 0 0 0 16 0
## 12 0 0 0 0 0 12 0 12 0 0 90
K-Nearest Neighbors (KNN) is a widely used and simple supervised machine learning algorithm, that is used for both classification and regression problems, the reason for using it in our problem is that it can predict multi-class classification solutions without the use of ‘One-vs-Rest’, and ‘One-vs-One’, approaches, which is needed to enable binary classification algorithms to perform muliple classifications. First we need to find the optimal number of neighbors (k) using the Elbow Graph
library(class)
# Train the KNN model
# Scale training and holdout features
X_train_scaled <- scale(X_train)
X_holdout_scaled <- scale(X_holdout, center = attr(X_train_scaled, "scaled:center"),
scale = attr(X_train_scaled, "scaled:scale"))
# Set up a range of k values to evaluate
k_values <- seq(1, 20, by = 2) # Odd values of k from 1 to 20
error_rates <- numeric(length(k_values)) # To store error rates for each k
# Loop through each k value and calculate the error rate
for (i in seq_along(k_values)) {
k <- k_values[i]
# Train and predict using KNN
knn_pred <- knn(train = X_train, test = X_holdout, cl = y_train, k = k)
# Calculate error rate
error_rates[i] <- mean(knn_pred != y_holdout)
}
# Find the best k (lowest error rate)
best_k <- k_values[which.min(error_rates)]
The Elbow Graph below is used to identify the number of neighbors with the least error
# Plot the elbow graph
plot(k_values, error_rates, type = "b", pch = 19, col = "blue",
xlab = "Number of Neighbors (k)", ylab = "Error Rate",
main = "Elbow Method for Choosing k")
abline(v = best_k, col = "red", lty = 2)
text(best_k, min(error_rates), labels = paste("Best k =", best_k), pos = 4, col = "red")
# Use the best k value
k <- 9
# Train and predict using KNN
knn_pred <- knn(train = X_train, test = X_holdout, cl = y_train, k = k)
# Evaluate KNN Model
knn_accuracy <- mean(knn_pred == y_holdout)
knn_conf_matrix <- table(Predicted = knn_pred, Actual = y_holdout)
# Display metrics
cat("\nKNN Metrics with k =", k, ":\n")
##
## KNN Metrics with k = 9 :
cat("Accuracy:", knn_accuracy, "\n")
## Accuracy: 0.7038052
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(knn_conf_matrix)
## Actual
## Predicted 1 2 3 4 5 6 7 9 10 11 12
## 1 473 0 48 14 0 4 17 0 4 0 1
## 2 0 1 0 0 2 0 0 0 0 0 0
## 3 50 9 1298 4 17 347 0 28 0 5 79
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 28 7 0 281 43 0 0 0 6 1
## 6 1 9 149 1 45 230 0 0 0 2 8
## 7 0 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 7 0 0 0 0 5 0 0 2
## 10 0 0 0 0 0 0 0 0 0 0 0
## 11 0 1 0 0 2 1 0 0 0 2 0
## 12 0 0 13 0 1 7 0 5 0 5 22
Linear Regression is a simple yet powerful statistical and machine learning technique, which is implemented to model the relationship between the target feature(dependent variable), and one or more predictors (independent variable).
Simple and Easy to Implement
Interpretable Results
Efficient for Small and Medium-Sized Datasets
# Train the Linear Regression Model
linear_model <- lm(tempavg ~ ., data = train_set)
# Predict on the holdout set
linear_pred <- predict(linear_model, newdata = holdout_set)
# Evaluate the Model
# Mean Squared Error (MSE)
mse <- mean((linear_pred - holdout_set$tempavg)^2)
# Root Mean Squared Error (RMSE)
rmse <- sqrt(mse)
# R-Squared
ss_total <- sum((holdout_set$tempavg - mean(holdout_set$tempavg))^2)
ss_residual <- sum((holdout_set$tempavg - linear_pred)^2)
r_squared <- 1 - (ss_residual / ss_total)
# Print Metrics
cat("Linear Regression Metrics:\n")
## Linear Regression Metrics:
cat("MSE:", mse, "\n")
## MSE: 0.002129343
cat("RMSE:", rmse, "\n")
## RMSE: 0.04614481
cat("R-Squared:", r_squared, "\n")
## R-Squared: 0.9979184
ensemble learning method that combines multiple decision trees to predict continuous target variables
Handles Nonlinear Relationships
Feature Importance and Interpretability
Handles High-Dimensional Data
#Load necessary library
library(randomForest)
# Train the Random Forest Regression Model
set.seed(123) # For reproducibility
rf_reg_model <- randomForest(tempavg ~ ., data = train_set, ntree = 500, mtry = 3)
# Predict on the holdout set
rf_reg_pred <- predict(rf_reg_model, newdata = holdout_set)
# Evaluate the Model
# Mean Squared Error (MSE)
rf_mse <- mean((rf_reg_pred - holdout_set$tempavg)^2)
# Root Mean Squared Error (RMSE)
rf_rmse <- sqrt(rf_mse)
# R-Squared
rf_ss_total <- sum((holdout_set$tempavg - mean(holdout_set$tempavg))^2)
rf_ss_residual <- sum((holdout_set$tempavg - rf_reg_pred)^2)
rf_r_squared <- 1 - (rf_ss_residual / rf_ss_total)
# Print Metrics
cat("Random Forest Regression Metrics:\n")
## Random Forest Regression Metrics:
cat("MSE:", rf_mse, "\n")
## MSE: 0.0022477
cat("RMSE:", rf_rmse, "\n")
## RMSE: 0.04740992
cat("R-Squared:", rf_r_squared, "\n")
## R-Squared: 0.9978027
We start our Evaluation by visualizing the confusion matrices of both the Random Forest, and KNN, by comparing the prediction class labels with the actual lables
# Random Forest Confusion Matrix Heatmap
rf_conf_matrix_melt <- melt(as.matrix(rf_conf_matrix))
ggplot(data = rf_conf_matrix_melt, aes(x = Actual, y = Predicted, fill = value)) +
geom_tile(color = "black") +
geom_text(aes(label = value), color = "black", size = 4) + # Numbers in black
scale_fill_gradient(low = "lightgray", high = "blue") +
labs(title = "Random Forest Confusion Matrix Heatmap",
x = "Actual Class", y = "Predicted Class") +
theme_minimal()
# KNN Confusion Matrix Heatmap
knn_conf_matrix_melt <- melt(as.matrix(knn_conf_matrix))
ggplot(data = knn_conf_matrix_melt, aes(x = Actual, y = Predicted, fill = value)) +
geom_tile(color = "black") +
geom_text(aes(label = value), color = "black", size = 4) + # Numbers in black
scale_fill_gradient(low = "lightgray", high = "green") +
labs(title = "KNN Confusion Matrix Heatmap",
x = "Actual Class", y = "Predicted Class") +
theme_minimal()
| Metric | Random_Forest | KNN |
|---|---|---|
| Accuracy | 0.9817352 | 0.7038052 |
| Macro Precision | 0.8158052 | 0.2811662 |
| Macro Recall | 0.8232156 | 0.3561035 |
| Macro F1 Score | 0.8174549 | 0.2924496 |
we observe from the evaluation metrics table that the Random Forest classifier had a superior performance across all metrics
We implement additional performance visualizations Below:
#------------------------------------------------------------------------------
# Prepare data for visualization
# Predict class probabilities for the holdout set
rf_prob <- predict(rf_model, holdout_set, type = "prob")
metrics <- data.frame(
Metric = c("Accuracy", "Macro Precision", "Macro Recall", "Macro F1 Score"),
Random_Forest = c(rf_accuracy, rf_metrics$precision_macro, rf_metrics$recall_macro, rf_metrics$f1_score_macro),
KNN = c(knn_accuracy, knn_metrics$precision_macro, knn_metrics$recall_macro, knn_metrics$f1_score_macro)
)
# Reshape data for ggplot
library(reshape2)
metrics_melt <- melt(metrics, id.vars = "Metric", variable.name = "Model", value.name = "Score")
# Plot the comparison
ggplot(metrics_melt, aes(x = Metric, y = Score, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Comparison of Classification Models",
x = "Evaluation Metric", y = "Score") +
scale_fill_manual(values = c("Random_Forest" = "blue", "KNN" = "green")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#---------------------------------------------------------------------------------------------------
library(PRROC)
# Create PR curve for each class
pr_curves <- list()
for (class in colnames(rf_prob)) {
pr_curves[[class]] <- pr.curve(scores.class0 = rf_prob[, class],
weights.class0 = holdout_set$conditions == class, curve = TRUE)
}
# Plot all PR curves
plot(pr_curves[[1]]$curve, type = "l", col = "blue", xlab = "Recall", ylab = "Precision", main = "Precision-Recall Curves")
for (i in 2:length(pr_curves)) {
lines(pr_curves[[i]]$curve, col = i + 1)
}
legend("bottomleft", legend = colnames(rf_prob), col = 1:length(pr_curves), lty = 1)
For the Regression models we are going to evaluate using the most commonly used metrics for regression models those are:
Mean Squared Error (MSE) which measure the average squared error, this metric penalizes large errors.
Root Mean Squared Error (RMSE) same as MSE but in the same units as the target variable, making it easier to interpret.
R-Squared showed the variance proportions captured by the model.
# Visualize Predictions vs Actual Values
ggplot(data = holdout_set, aes(x = tempavg, y = rf_reg_pred)) +
geom_point(color = "darkgreen", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Random Forest Regression: Predictions vs Actual",
x = "Actual Average Temperature",
y = "Predicted Average Temperature") +
theme_minimal()
The Scatter plot shows the distribution of the actual values in comparison with the best fit line used for prediction, we observe a linear relationship indicating that a simple Regression model will suffice
# Create a data frame for the metrics
rf_metrics_table <- data.frame(
Metric = c("Mean Squared Error (MSE)", "Root Mean Squared Error (RMSE)", "R-Squared"),
Value = c(rf_mse, rf_rmse, rf_r_squared)
)
# Display the table
knitr::kable(rf_metrics_table, caption = "Random Forest Regression Metrics")
| Metric | Value |
|---|---|
| Mean Squared Error (MSE) | 0.0022477 |
| Root Mean Squared Error (RMSE) | 0.0474099 |
| R-Squared | 0.9978027 |
The values indicate that there is a minimal variation between the the actual and predicted values, also since we are using MSE, this variation is penalized even further.
# Visualize Predictions vs Actual Values
library(ggplot2)
ggplot(data = holdout_set, aes(x = tempavg, y = linear_pred)) +
geom_point(color = "blue", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Linear Regression: Predictions vs Actual",
x = "Actual Average Temperature",
y = "Predicted Average Temperature") +
theme_minimal()
A similar scatter plot can be observed from the random forest regression model, there is a slight improvement in the metrics, however it is unnoticeable
# Create a data frame for the metrics
linear_metrics_table <- data.frame(
Metric = c("Mean Squared Error (MSE)", "Root Mean Squared Error (RMSE)", "R-Squared"),
Value = c(mse, rmse, r_squared)
)
# Display the table
knitr::kable(linear_metrics_table, caption = "Linear Regression Metrics")
| Metric | Value |
|---|---|
| Mean Squared Error (MSE) | 0.0021293 |
| Root Mean Squared Error (RMSE) | 0.0461448 |
| R-Squared | 0.9979184 |
This deployment provides an interactive experience where users can explore temperature trends over time through an interactive line plot and view detailed data through a sortable and searchable table. The visualizations allow for dynamic exploration of temperature patterns, while the data table gives users easy access to the raw data, making it a powerful tool for analysis and decision-making.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Conclusion
Key Findings:
The Linear Regression model provides a straightforward evaluation of the relationship between dependent and independent variables, however if introduced to complex non-linear structure we believe the model showed limitations.
on the other hand, the Random Forest Regression model achieved better outcomes. such as lower mean squared error (MSE) and higher R-Squared value, however the difference is insignificant.
The Random Forest Classifier showed a superior performance across all evaluation metrics, this induces macro precision, macro recall and macro F1-score, which suggests its strength in handling multi-class problems.
The KNN Classifier showed much lower accuracy and precision, while keeping in mind that its simpler to implement and understand, this was due to it’s sensitivity to data scaling and class distribution
Conclusion:
Over all the results showed the significance of selecting models that align with the structure of the data in action, along with the problem’s complexity. simpler models like linear regression and KNN are good for baseline evaluations, but advanced models such as Random Forest, showed a clear advantage in catching patterns and delivering more trusted predictions.