INTRODUCTION
This project utilizes a dataset collated by the NOAA Weather Station located at John F. Kennedy (JFK) International Airport in New York. The original dataset contains 114,546 hourly observations of 12 local Climatological variables(such as temperature, wind speed,humidity,dew point,and pressure) collected at the JFK airport.5% of the whole dataset would be used for this project.
AIM
Develop a Model capable of predicting weather precipitation using Key Climatological factors.
QUESTIONS
What are the key features that have the strongest correlation with precipitation?
How many Models can be trained with the features in the dataset?
Which of the Model is best fitted?
How does the the Best Model perform on unseen data?
OBJECTIVES
Perform Data Preprocessing.
Develop Linear Regression Models.
Evaluate the Model Performance
Deploy the Best Model
Step1: Load the Required Packages
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 4.0.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.9 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ purrr 1.2.0 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(rlang)
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
library(corrplot)
## corrplot 0.95 loaded
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ rlang::flatten() masks purrr::flatten()
## ✖ rlang::flatten_chr() masks purrr::flatten_chr()
## ✖ rlang::flatten_dbl() masks purrr::flatten_dbl()
## ✖ rlang::flatten_int() masks purrr::flatten_int()
## ✖ rlang::flatten_lgl() masks purrr::flatten_lgl()
## ✖ rlang::flatten_raw() masks purrr::flatten_raw()
## ✖ rlang::invoke() masks purrr::invoke()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ✖ rlang::splice() masks purrr::splice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Step2: READ IN THE DATASET
pal<-read.csv("/Users/mac/Documents/jfk_weather_sample.csv")
glimpse(pal)#to see the different data types
## Rows: 5,727
## Columns: 9
## $ DATE <chr> "2015-07-25T13:51:00Z", "2016-11-18T23:51:00Z",…
## $ HOURLYDewPointTempF <chr> "60", "34", "33", "18", "27", "35", "4", "14", …
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF <int> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
Step3:Select Subset of Columns.
The columns needed for this project includes: - HOURLYRelativeHumidity: is the relative humidity given to the nearest whole percentage.
HOURLYDRYBULBTEMPF: is the dry-bulb temperature and is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
HOURLYPrecip: is the amount of precipitation in inches to hundredths over the past hour. For certain automated stations, precipitation will be reported at sub-hourly intervals (e.g. every 15 or 20 minutes) as an accumulated amount of all precipitation within the preceding hour. A “T” indicates a trace amount of precipitation.
HOURLYWindSpeed: is the speed of the wind at the time of observation given in miles per hour (mph).
HOURLYStationPressure: is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).
Useful<-pal %>%select(HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYStationPressure)#to select columns
head(Useful)#to show the first six rows of the dataframe
## HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1 46 83 0.00 13
## 2 48 53 0.00 6
## 3 89 36 0.00 13
## 4 48 36 0.00 14
## 5 61 39 T 11
## 6 79 41 0.00 6
## HOURLYStationPressure
## 1 29.99
## 2 30.03
## 3 30.12
## 4 29.80
## 5 30.50
## 6 29.92
tail(Useful)#to show the last six rows of the dataframe
## HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 5722 64 72 0.00 9
## 5723 67 61 0.00 6
## 5724 NA NA <NA> NA
## 5725 48 53 0.00 3
## 5726 75 32 0.00 7
## 5727 52 38 0.00 0
## HOURLYStationPressure
## 5722 30.21
## 5723 29.72
## 5724 NA
## 5725 30.26
## 5726 29.69
## 5727 29.97
glimpse(Useful)#to show the number of rows, columns and the data type
## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
Step4: Check the Unique values in the HourlyPrecip Column.
unique(Useful$HOURLYPrecip)#to remove duplicate values and return only the distinct values, to show all the values repeated in the column.
## [1] "0.00" "T" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07"
## [10] "0.16" "0.09" "0.22" "0.02s" "0.24" "0.18" "0.05" "0.04" "0.09s"
## [19] "0.11" "0.14" "0.25" "0.10" "0.01s" "0.58" "0.12" "0.13" "0.46"
## [28] "1.07" "1.19" "0.34" "0.20" "0.36s" "0.42" "0.17" "0.27" "0.35"
## [37] "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19" "0.36" "0.03s"
## [46] "0.07s" "0.54" "0.59" "0.21"
From the dataframe(Useful), we can see that the column HOURLYPrecip - which is the hourly measure of precipitation levels - contains both NA and T values. T specifies trace amounts of precipitation (meaning essentially no precipitation), while NA means not available, and is used to denote missing values. Additionally, some values also have “s” at the end of them, indicating that the precipitation was snow.
Having characters in values (like the “T” and “s” seen in the unique values) will cause problems; values for precipitation should be numerical.
Step5: Replace “T” with 0.00 and remove “s” since the HOURLYPrecip is not specified to be snow, hail or rainfall.
library(tidyverse)
Useful<- Useful %>% mutate(HOURLYPrecip=HOURLYPrecip%>% #update the existing column
gsub("T", "0.0",.) %>%#replace all the T values with "0.0" %>%
str_remove(pattern="s$"))#to remove "s" from values using regex pattern
unique(Useful$HOURLYPrecip)# to check if unique values of HOURLYPrecip still contain any T or s.
## [1] "0.00" "0.0" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07" "0.16"
## [11] "0.09" "0.22" "0.24" "0.18" "0.05" "0.04" "0.11" "0.14" "0.25" "0.10"
## [21] "0.58" "0.12" "0.13" "0.46" "1.07" "1.19" "0.34" "0.20" "0.36" "0.42"
## [31] "0.17" "0.27" "0.35" "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19"
## [41] "0.54" "0.59" "0.21"
Step6: Convert Column (Precip) to Numerical Types.
Useful<-Useful %>% mutate(HOURLYPrecip=HOURLYPrecip %>% as.numeric(HOURLYPrecip))#to convert thr column to numeric
glimpse(Useful$HOURLYPrecip)
## num [1:5727] 0 0 0 0 0 0 0 0 0.06 NA ...
Step 7: Rename Columns
Useful<-Useful %>% rename(relative_humidity=HOURLYRelativeHumidity,
dry_bulb_temp_f=HOURLYDRYBULBTEMPF,
precip=HOURLYPrecip,
wind_speed=HOURLYWindSpeed,
station_pressure=HOURLYStationPressure)# to rename all the columns
head(Useful)
## relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
## 1 46 83 0 13 29.99
## 2 48 53 0 6 30.03
## 3 89 36 0 13 30.12
## 4 48 36 0 14 29.80
## 5 61 39 0 11 30.50
## 6 79 41 0 6 29.92
Step8: Check the Total Number of Missing Values for each column
colSums(is.na(Useful))#to check the sums of each column's missing values
## relative_humidity dry_bulb_temp_f precip wind_speed
## 151 151 1660 156
## station_pressure
## 158
Step9: Replace Missing Values with Statistical Measures
Useful$relative_humidity[is.na(Useful$relative_humidity)] <- mean(Useful$relative_humidity, na.rm = TRUE)
sum(is.na(Useful$relative_humidity))
## [1] 0
Useful$dry_bulb_temp_f[is.na(Useful$dry_bulb_temp_f)]<- median(Useful$dry_bulb_temp_f,na.rm=TRUE)
sum(is.na(Useful$dry_bulb_temp_f))
## [1] 0
Useful$precip[is.na(Useful$precip)] <- mean(Useful$precip, na.rm = TRUE)
sum(is.na(Useful$precip))
## [1] 0
Useful$wind_speed[is.na(Useful$wind_speed)] <- median(Useful$wind_speed, na.rm = TRUE)
sum(is.na(Useful$wind_speed))
## [1] 0
Useful$station_pressure[is.na(Useful$station_pressure)] <- mean(Useful$station_pressure, na.rm = TRUE)
sum(is.na(Useful$station_pressure))
## [1] 0
colSums(is.na(Useful))
## relative_humidity dry_bulb_temp_f precip wind_speed
## 0 0 0 0
## station_pressure
## 0
# We no longer have any missing values
Step 10: Check the summary of the dataset
summary(Useful)
## relative_humidity dry_bulb_temp_f precip wind_speed
## Min. : 13.00 Min. : 5.00 Min. :0.000000 Min. : 0.00
## 1st Qu.: 51.00 1st Qu.:43.00 1st Qu.:0.000000 1st Qu.: 7.00
## Median : 67.00 Median :56.00 Median :0.000000 Median :10.00
## Mean : 66.89 Mean :55.61 Mean :0.006548 Mean :11.16
## 3rd Qu.: 84.00 3rd Qu.:70.00 3rd Qu.:0.006548 3rd Qu.:15.00
## Max. :100.00 Max. :99.00 Max. :1.190000 Max. :52.00
## station_pressure
## Min. :28.69
## 1st Qu.:29.85
## Median :29.99
## Mean :29.99
## 3rd Qu.:30.14
## Max. :30.82
Step11:Check the strength of linear relationship between the features and the target variable, using the correlation matrix
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Useful <- subset(Useful, select = c(relative_humidity, wind_speed, dry_bulb_temp_f,station_pressure,precip))
cor_matrix <- cor(Useful, use = "pairwise.complete.obs") # Handles NAs if any remain
# 2. Melt the correlation matrix into a long format data frame for ggplot2
melted_cor_matrix <- melt(cor_matrix)
# 3. Create the heatmap using ggplot2
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") + # Adds white borders to tiles
scale_fill_gradient2(low = "lightgreen", high = "red", mid = "purple",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
labs(title = "Correlation matrix of Climatological Features") +
coord_fixed() + # Ensures squares are square
# Add the correlation coefficients as text labels
geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
Step12:Visualize this correlation matrix, by plotting one column of correlations using heatmap.
library(dplyr)
library(corrr)
library(ggplot2)
target_correlations <- Useful %>%
correlate() %>%
focus(precip) %>% # Focus on the target variable column
rename(Feature = term, Correlation = precip) %>% # Rename columns for ggplot
filter(Feature != "precip") # Remove the 1.0 correlation with itself
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
ggplot(target_correlations, aes(x = "Correlation with Target", y = reorder(Feature, Correlation), fill = Correlation)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "deeppink", midpoint = 0, limit = c(-1, 1)) +
geom_text(aes(label = round(Correlation, 3)), color = "blue", size = 4) +
theme_minimal() +
labs(title = "Feature Correlation with Target (precip)", y = NULL, x = NULL) +
# Use a theme adjustment to get a similar clean look to the image
theme(legend.position = "none", # Hides the legend since numbers are on the plot
axis.text.x = element_blank()) # Hides the X-axis label
Step13: Split the data into a training and testing set
library(rsample)
set.seed(1234)#This function sets the starting number used to generate a sequence of random numbers.
#This ensures that you get the same result if you start with that same seed each time.
weather_data <- initial_split(Useful,prop=0.8)#to split the data
train_data <- training(weather_data)#the splitted training dataset
test_data <- testing(weather_data)#the splitted testing dataset
Step14: Train a Linear Regression Using each feature
The first model will be trained using relative_humidity as the predictor
linear_model1 <- lm(precip~relative_humidity, data=train_data)
summary(linear_model1)
##
## Call:
## lm(formula = precip ~ relative_humidity, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01616 -0.00802 -0.00366 0.00069 1.17500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.290e-02 1.877e-03 -6.87 7.26e-12 ***
## relative_humidity 2.906e-04 2.691e-05 10.80 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.036 on 4579 degrees of freedom
## Multiple R-squared: 0.02483, Adjusted R-squared: 0.02462
## F-statistic: 116.6 on 1 and 4579 DF, p-value: < 2.2e-16
Visualize
ggplot(train_data, aes(x = relative_humidity, y = precip)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
labs(
title = "Linear Model 1",
x = "Relative Humidity",
y = "Precip"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The second model will be trained using dry_bulb_temp_f as the predictor
linear_model2<-lm(precip~dry_bulb_temp_f,data=train_data)
summary(linear_model2)
##
## Call:
## lm(formula = precip ~ dry_bulb_temp_f, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.00758 -0.00672 -0.00617 -0.00012 1.18316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.216e-03 1.860e-03 2.804 0.00507 **
## dry_bulb_temp_f 2.388e-05 3.204e-05 0.745 0.45626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03646 on 4579 degrees of freedom
## Multiple R-squared: 0.0001212, Adjusted R-squared: -9.714e-05
## F-statistic: 0.5551 on 1 and 4579 DF, p-value: 0.4563
ggplot(train_data, aes(x = dry_bulb_temp_f, y = precip)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
labs(
title = "Linear Model 2",
x = "dry_bulb_temp_f",
y = "Precip"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The third model will be trained using wind_speed as predictor
linear_model3<-lm(precip~wind_speed,data=train_data)
summary(linear_model3)
##
## Call:
## lm(formula = precip ~ wind_speed, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01244 -0.00650 -0.00542 0.00005 1.18053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.532e-03 1.109e-03 3.186 0.00145 **
## wind_speed 2.700e-04 8.697e-05 3.105 0.00192 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03642 on 4579 degrees of freedom
## Multiple R-squared: 0.002101, Adjusted R-squared: 0.001883
## F-statistic: 9.639 on 1 and 4579 DF, p-value: 0.001916
ggplot(train_data, aes(x = wind_speed, y = precip)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
labs(
title = "Linear Model 3",
x = "Wind Speed",
y = "Precip"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The fourth model will be trained using station_pressure as predictor
linear_model4<-lm(precip~station_pressure,data=train_data)
summary(linear_model4)
##
## Call:
## lm(formula = precip ~ station_pressure, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02058 -0.00736 -0.00406 0.00004 1.17851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.501903 0.068231 7.356 2.23e-13 ***
## station_pressure -0.016518 0.002275 -7.260 4.51e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03625 on 4579 degrees of freedom
## Multiple R-squared: 0.01138, Adjusted R-squared: 0.01116
## F-statistic: 52.71 on 1 and 4579 DF, p-value: 4.515e-13
ggplot(train_data, aes(x = station_pressure, y = precip)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
labs(
title = "Linear Model 4",
x = "Station Pressure",
y = "Precip"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The fifth Model will be trained using two features, using wind_speed and
station pressure
linear_model5<-lm(precip~wind_speed + station_pressure,data=train_data)
summary(linear_model5)
##
## Call:
## lm(formula = precip ~ wind_speed + station_pressure, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02235 -0.00731 -0.00410 0.00016 1.17758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.778e-01 7.119e-02 6.712 2.16e-11 ***
## wind_speed 1.068e-04 8.996e-05 1.187 0.235
## station_pressure -1.575e-02 2.364e-03 -6.663 3.00e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03625 on 4578 degrees of freedom
## Multiple R-squared: 0.01168, Adjusted R-squared: 0.01125
## F-statistic: 27.06 on 2 and 4578 DF, p-value: 2.069e-12
ggplot(train_data, aes(x = wind_speed, y = precip, color = station_pressure)) +
geom_point() +
labs(title = "Trained Model with Two Numeric Predictors") +
theme_minimal()
The last model will be built using all the features
multiple_reg_model<-lm(precip~.,data=train_data)
summary(multiple_reg_model)
##
## Call:
## lm(formula = precip ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03033 -0.00818 -0.00343 0.00120 1.16781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.019e-01 7.755e-02 2.604 0.00925 **
## relative_humidity 3.032e-04 2.930e-05 10.350 < 2e-16 ***
## wind_speed 4.315e-04 9.636e-05 4.478 7.71e-06 ***
## dry_bulb_temp_f -4.515e-06 3.269e-05 -0.138 0.89015
## station_pressure -7.344e-03 2.536e-03 -2.896 0.00380 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03584 on 4576 degrees of freedom
## Multiple R-squared: 0.03431, Adjusted R-squared: 0.03346
## F-statistic: 40.64 on 4 and 4576 DF, p-value: < 2.2e-16
TASK9:For each of the models created, check the model performance using the testing set
Step15: Predict using the Testing Data
actual_values<-test_data$precip
linear_model1_predictions<-predict(linear_model1,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model1_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.0004680524 -0.0004680524
## 2 0.000000000 0.0100572526 -0.0100572526
## 3 0.000000000 0.0059891070 -0.0059891070
## 4 0.000000000 -0.0021471841 0.0021471841
## 5 0.006547824 0.0077325980 -0.0011847740
## 6 0.006547824 -0.0024377659 0.0089855898
actual_values<-test_data$precip
linear_model2_predictions<-predict(linear_model2,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model2_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.007197842 -0.0071978425
## 2 0.000000000 0.006195052 -0.0061950518
## 3 0.000000000 0.005789160 -0.0057891604
## 4 0.000000000 0.006266680 -0.0062666797
## 5 0.006547824 0.006911331 -0.0003635069
## 6 0.006547824 0.006338308 0.0002095163
actual_values<-test_data$precip
linear_model3_predictions<-predict(linear_model3,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model3_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.007042806 -0.0070428064
## 2 0.000000000 0.005152632 -0.0051526317
## 3 0.000000000 0.006502756 -0.0065027565
## 4 0.000000000 0.005422657 -0.0054226567
## 5 0.006547824 0.005152632 0.0013951922
## 6 0.006547824 0.007042806 -0.0004949825
actual_values<-test_data$precip
linear_model4_predictions<-predict(linear_model4,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model4_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.006536439 -0.0065364392
## 2 0.000000000 0.007692680 -0.0076926804
## 3 0.000000000 0.000590056 -0.0005900560
## 4 0.000000000 0.002241829 -0.0022418291
## 5 0.006547824 0.007362326 -0.0008145018
## 6 0.006547824 0.008848922 -0.0023010976
actual_values<-test_data$precip
linear_model5_predictions<-predict(linear_model5,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model5_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.0067343816 -0.0067343816
## 2 0.000000000 0.0070898301 -0.0070898301
## 3 0.000000000 0.0008495988 -0.0008495988
## 4 0.000000000 0.0019979238 -0.0019979238
## 5 0.006547824 0.0067747596 -0.0002269357
## 6 0.006547824 0.0089398751 -0.0023920511
actual_values<-test_data$precip
multiple_reg_predictions<-predict(multiple_reg_model,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=multiple_reg_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.0008754209 -0.0008754209
## 2 0.000000000 0.0085656690 -0.0085656690
## 3 0.000000000 0.0033967785 -0.0033967785
## 4 0.000000000 -0.0061761547 0.0061761547
## 5 0.006547824 0.0058573531 0.0006904708
## 6 0.006547824 -0.0009663807 0.0075142047
Step16: Model Evaluation using RMSE to get the best model
Let’s evaluate all the models, compare and get the best model.
RME stands for Root Mean Square Error which is widely used for evaluating the performance of regression models. It measures the average difference between the values predicted by a model and the actaul values. It is in the same unit as the target variable.
library(Metrics)
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:rlang':
##
## ll
## The following objects are masked from 'package:yardstick':
##
## accuracy, mae, mape, mase, precision, recall, rmse, smape
class(actual_values)
## [1] "numeric"
class(multiple_reg_predictions)
## [1] "numeric"
rmse_value1_test<-Metrics::rmse(actual_values,linear_model1_predictions)
rmse_value1_test
## [1] 0.0285869
rmse_value2_test<-Metrics::rmse(actual_values,linear_model2_predictions)
rmse_value2_test
## [1] 0.02922909
rmse_value3_test<-Metrics::rmse(actual_values,linear_model3_predictions)
rmse_value3_test
## [1] 0.02899481
rmse_value4_test<-Metrics::rmse(actual_values,linear_model4_predictions)
rmse_value4_test
## [1] 0.0288756
rmse_value5_test<-Metrics::rmse(actual_values,linear_model5_predictions)
rmse_value5_test
## [1] 0.028801
rmse_value6_test<-Metrics::rmse(actual_values,multiple_reg_predictions)
rmse_value6_test
## [1] 0.02802316
Step17: Use a table to compare the evaluation metrics
model_names <- c("linear_model1","linear_model2","linear_model3","linear_model4","linear_model5","multiple_reg_model")
RMSE<- c ("0.0285869","0.02922909","0.02899481","0.0288756","0.028801","0.02802316")
comparison_df <- data.frame(model_names, RMSE)
comparison_df
## model_names RMSE
## 1 linear_model1 0.0285869
## 2 linear_model2 0.02922909
## 3 linear_model3 0.02899481
## 4 linear_model4 0.0288756
## 5 linear_model5 0.028801
## 6 multiple_reg_model 0.02802316
From the table above, multiple_reg_model has the lowest RMSE The lower the the RMSE, the accurate our model is. So multiple_reg_model is the best model in this project
Save the best model in Rdata.
saveRDS(multiple_reg_model,file="my_multiple_linear_regression_model.rds")
list.files()
## [1] "1. Overview-of-Excel-Exercise-Solved.xlsx"
## [2] "1. Overview-of-Excel-Exercise-Unsolved.xlsx"
## [3] "2. Data-Entry-Exercise - Solved.xlsx"
## [4] "2. Data-Entry-Exercise - Unsolved.xlsx"
## [5] "3. Formatting- Exercise - Solved.xlsx"
## [6] "3. Formatting-Exercise - Unsolved.xlsx"
## [7] "3uTools_v3.10.002.dmg"
## [8] "3uTools_v3.11.001.dmg"
## [9] "3uToolsDownloads"
## [10] "4684_down.pdf"
## [11] "48. Round-Exercise - Solved.xlsx"
## [12] "48. Round-Exercise - Unsolved.xlsx"
## [13] "5. Formulas - Exercise - Unsolved.xlsx"
## [14] "5. Formulas- Exercise - Solved.xlsx"
## [15] "6. Functions- Exercise - Solved.xlsx"
## [16] "6. Functions-Exercise - Unsolved.xlsx"
## [17] "7. Copy-Cut-Paste-Exercise - Unsolved.xlsx"
## [18] "8. Paste-Special-Exercise - Solved.xlsx"
## [19] "8. Paste-Special-Exercise - Unsolved.xlsx"
## [20] "applicantImage_copy.pdf"
## [21] "applicantImage.pdf"
## [22] "archive (1).zip"
## [23] "archive (2).zip"
## [24] "archive (3).zip"
## [25] "archive.zip"
## [26] "avast_security_online.dmg"
## [27] "Basic R for Data Science Project 1 (1).Rmd"
## [28] "Basic-R-for-Data-Science-Project-1--1-.html"
## [29] "Bella.pptx"
## [30] "Bella[1] presentation (1).pptx"
## [31] "Bella[1] presentation.pptx"
## [32] "CamScanner 04-24-2025 11.29.pdf"
## [33] "CamScanner 05-02-2025 06.43 (1).pdf"
## [34] "CamScanner 9-13-25 15.19.pdf"
## [35] "certificate.pdf"
## [36] "clean copy.png"
## [37] "clean.png"
## [38] "climate_change_indicators.csv"
## [39] "covid.csv"
## [40] "covid1.csv"
## [41] "customer_data_cleaning_project.R"
## [42] "Customer_Purchase_Data.csv"
## [43] "dashyboard Tolu.R"
## [44] "Data Cleaning Practice (1).Rmd"
## [45] "Data Cleaning Practice.Rmd"
## [46] "Data Cleaning Project 1.Rmd"
## [47] "Data Cleaning Project 2.Rmd"
## [48] "Data Entry_Lecture.xlsx"
## [49] "Data-Cleaning-2-Practice.Rmd"
## [50] "Data-Cleaning-Practice.html"
## [51] "Data-Cleaning-Project-1.html"
## [52] "Data-Cleaning-Project-2.html"
## [53] "Data-Cleaning-Tolu.html"
## [54] "Data-Cleaning-Tolu.knit.md"
## [55] "djia_analysis.md"
## [56] "eDX course 3.Rmd"
## [57] "edX Project 1 (1).Rmd"
## [58] "edX Project 1 (3).Rmd"
## [59] "EDX R PROGRAMMING 1.R"
## [60] "edX-Project-1 (1).html"
## [61] "edX-Project-1--1-.html"
## [62] "edX-Project-1.html"
## [63] "Format cells_Lecture_after.xlsx"
## [64] "Formatting-Lecture_after.xlsx"
## [65] "Formatting-Lecture_before.xlsx"
## [66] "Formulas-Lecture_after.xlsx"
## [67] "Formulas-Lecture_before.xlsx"
## [68] "Functions-Lecture_after.xlsx"
## [69] "Functions-Lecture_before.xlsx"
## [70] "GIT Report format (1).docx"
## [71] "GIT Report format (2).docx"
## [72] "GIT Report format.docx"
## [73] "googlechrome.dmg"
## [74] "histogram.png"
## [75] "icarefone-mac_11735733101042763901.dmg"
## [76] "imm1294e.pdf"
## [77] "imm5645e (1).pdf"
## [78] "imm5645e.pdf"
## [79] "Install Spotify.app"
## [80] "Introduction to Excel - Course notes.pdf"
## [81] "introduction to ML (1) (1).pptm"
## [82] "introduction to ML (1) (2).pptm"
## [83] "introduction to ML (1) (3).pptm"
## [84] "introduction to ML (1).pptm"
## [85] "introduction_to_data_cleaning_with_R.pdf"
## [86] "lax_to_jfk"
## [87] "lax_to_jfk.tar.gz"
## [88] "Long_and_Wide_Data_Format_Updated.Rmd"
## [89] "Long_Wide_Corporate_Executive.pptx"
## [90] "Long_Wide_Data_Format.html"
## [91] "Long_Wide_Data_Format.pptx"
## [92] "Long_Wide_Data_Format.Rmd"
## [93] "Long_Wide_Format_Presentation.pptx"
## [94] "maradun.txt"
## [95] "median_average_wages.csv"
## [96] "messy_customer_data.csv"
## [97] "MetaTrader 5.pkg"
## [98] "MetaTrader5.pkg.zip"
## [99] "model_performance_table.png"
## [100] "MY SLIDE (1).pptx"
## [101] "MY SLIDE (2).pptx"
## [102] "MY SLIDE.pptx"
## [103] "my_correlation_matrix (1).png"
## [104] "my_correlation_matrix.png"
## [105] "my_multiple_linear_regression_model.rds"
## [106] "MY_SLIDE_BEAUTIFIED.pptx"
## [107] "MY_SLIDE_UPDATED.pptx"
## [108] "Nike Sales Cleaned Data Sales.Rmd"
## [109] "Nike sales cleaned data.csv"
## [110] "Nike sales cleaned dataset.csv"
## [111] "Nike_Sales_Analysis_2026.pptx"
## [112] "Nike_Sales_Uncleaned.csv"
## [113] "Nike-Sales-Cleaned-Data-Code.html"
## [114] "Nike-Sales-Cleaned-Data-Sales.html"
## [115] "Nike-Sales-Trend-Analysis (1).Rmd"
## [116] "Nike-Sales-Trend-Analysis.docx"
## [117] "Nike-Sales-Trend-Analysis.Rmd"
## [118] "noaa-weather-sample-data"
## [119] "noaa-weather-sample-data (1).tar.gz"
## [120] "noaa-weather-sample-data.tar.gz"
## [121] "notconfirmed.Rmd"
## [122] "output_table.png"
## [123] "Positron-2025.11.0-234-arm64.dmg"
## [124] "Precipitation dashboard.R"
## [125] "Predictive Project.Rmd"
## [126] "Predictive-Project_files"
## [127] "Predictive-Project.html"
## [128] "Predictive-Project.Rmd"
## [129] "Presentation.pptx"
## [130] "Project Details .pdf"
## [131] "R-4.5.2-arm64.pkg"
## [132] "Real Project.Rmd"
## [133] "Real-Project.html"
## [134] "Resources for ICAMMDA.docx"
## [135] "rsconnect"
## [136] "RStudio-2025.09.2-418.dmg"
## [137] "SpotifyInstaller (1).zip"
## [138] "SpotifyInstaller (2).zip"
## [139] "SpotifyInstaller.zip"
## [140] "st.R"
## [141] "Stock_Analysis_Project.html"
## [142] "Stock_Analysis_Project.Rmd"
## [143] "Student Mental health (1).csv"
## [144] "STUDY CORNER.docx"
## [145] "TASK.pdf"
## [146] "The Traders Bible .pdf"
## [147] "Tidying data (1).qmd"
## [148] "Tidying data (2).qmd"
## [149] "Tidying data.qmd"
## [150] "Toluwani ABIOLA CV_025611.docx"
## [151] "Toluwani edx Course 2 (1).Rmd"
## [152] "Toluwani edx Course 2 (2).Rmd"
## [153] "Toluwani edx Course 2.Rmd"
## [154] "TOLUWANI PRECIOUS fitness (1).pdf"
## [155] "TOLUWANI PRECIOUS fitness.pdf"
## [156] "Toluwani-Chapter-7-and-8.html"
## [157] "Transformed Dataset.Rmd"
## [158] "Transformed-Dataset.html"
## [159] "UDS0028020498213.png"
## [160] "unclean.png"
## [161] "VICTORIA PRESENTATION SLIDE .pptx"
## [162] "Visual Studio Code 2.app"
## [163] "Visual Studio Code 3.app"
## [164] "Visual Studio Code.app"
## [165] "VSCode-darwin-universal (1).zip"
## [166] "VSCode-darwin-universal.zip"
## [167] "WB_ES_T_FIN12.csv"
## [168] "WB_FINDEX_FIN11_2A.csv"
## [169] "weather-weka.csv"
## [170] "WhatsApp Image 2025-11-14 at 11.24.17 PM.jpeg"
## [171] "WhatsApp Image 2025-11-14 at 11.24.18 PM (1).jpeg"
## [172] "WhatsApp Image 2025-11-14 at 11.24.18 PM.jpeg"
## [173] "WhatsApp Image 2025-11-15 at 4.58.13 PM.jpeg"
## [174] "WhatsApp Image 2025-11-15 at 8.37.29 PM.jpeg"
## [175] "yahoo_data 2.xlsx"
## [176] "yahoo_data.xlsx"
Now, using shiny app, We deploy our model.
FINDINGS
All the features have very weak correlation with Precipitation.
Alot of Models, but 6 Models were trained.
The Multiple Linear Regression(used all features).
The Model makes an average prediction error of 2.8% on the unseen data.
CONCLUSION/DISCUSSION
All the features have very weak correlation with precipitation(not closer to one)
Additional data collection is needed to develop a truly highly correlated and accurate forecasting system.
Further analysis should be conducted to predict the exact type of precipitation.
More features should be added to get the specific climatological factors that have strong correlation with Precipitation.(Use the whole dataset or increase the percentage since 5% was used for this project).