A Predictive Analysis On Weather Precipitation Using Key Climatological Variables(A Case Study of JFK Airport, New York)

INTRODUCTION

This project utilizes a dataset collated by the NOAA Weather Station located at John F. Kennedy (JFK) International Airport in New York. The original dataset contains 114,546 hourly observations of 12 local Climatological variables(such as temperature, wind speed,humidity,dew point,and pressure) collected at the JFK airport.5% of the whole dataset would be used for this project.

AIM

Develop a Model capable of predicting weather precipitation using Key Climatological factors.

QUESTIONS

What are the key features that have the strongest correlation with precipitation?
How many Models can be trained with the features in the dataset?
Which of the Model is best fitted?
How does the the Best Model perform on unseen data?

OBJECTIVES

Perform Data Preprocessing.
Develop Linear Regression Models.
Evaluate the Model Performance
Deploy the Best Model

Step1: Load the Required Packages

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.9      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
## ✔ purrr        1.2.0      ✔ yardstick    1.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(rlang)

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:purrr':
## 
##     flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

library(corrplot)

## corrplot 0.95 loaded

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor()  masks scales::col_factor()
## ✖ purrr::discard()     masks scales::discard()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ stringr::fixed()     masks recipes::fixed()
## ✖ rlang::flatten()     masks purrr::flatten()
## ✖ rlang::flatten_chr() masks purrr::flatten_chr()
## ✖ rlang::flatten_dbl() masks purrr::flatten_dbl()
## ✖ rlang::flatten_int() masks purrr::flatten_int()
## ✖ rlang::flatten_lgl() masks purrr::flatten_lgl()
## ✖ rlang::flatten_raw() masks purrr::flatten_raw()
## ✖ rlang::invoke()      masks purrr::invoke()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ readr::spec()        masks yardstick::spec()
## ✖ rlang::splice()      masks purrr::splice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step2: READ IN THE DATASET

pal<-read.csv("/Users/mac/Documents/jfk_weather_sample.csv")
glimpse(pal)#to see the different data types

## Rows: 5,727
## Columns: 9
## $ DATE                   <chr> "2015-07-25T13:51:00Z", "2016-11-18T23:51:00Z",…
## $ HOURLYDewPointTempF    <chr> "60", "34", "33", "18", "27", "35", "4", "14", …
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF     <int> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

Step3:Select Subset of Columns.

The columns needed for this project includes: - HOURLYRelativeHumidity: is the relative humidity given to the nearest whole percentage.

HOURLYDRYBULBTEMPF: is the dry-bulb temperature and is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
HOURLYPrecip: is the amount of precipitation in inches to hundredths over the past hour. For certain automated stations, precipitation will be reported at sub-hourly intervals (e.g. every 15 or 20 minutes) as an accumulated amount of all precipitation within the preceding hour. A “T” indicates a trace amount of precipitation.
HOURLYWindSpeed: is the speed of the wind at the time of observation given in miles per hour (mph).
HOURLYStationPressure: is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).

Useful<-pal %>%select(HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYStationPressure)#to select columns 

head(Useful)#to show the first six rows of the dataframe

##   HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1                     46                 83         0.00              13
## 2                     48                 53         0.00               6
## 3                     89                 36         0.00              13
## 4                     48                 36         0.00              14
## 5                     61                 39            T              11
## 6                     79                 41         0.00               6
##   HOURLYStationPressure
## 1                 29.99
## 2                 30.03
## 3                 30.12
## 4                 29.80
## 5                 30.50
## 6                 29.92

tail(Useful)#to show the last six rows of the dataframe

##      HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 5722                     64                 72         0.00               9
## 5723                     67                 61         0.00               6
## 5724                     NA                 NA         <NA>              NA
## 5725                     48                 53         0.00               3
## 5726                     75                 32         0.00               7
## 5727                     52                 38         0.00               0
##      HOURLYStationPressure
## 5722                 30.21
## 5723                 29.72
## 5724                    NA
## 5725                 30.26
## 5726                 29.69
## 5727                 29.97

glimpse(Useful)#to show the number of rows, columns and the data type

## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

Step4: Check the Unique values in the HourlyPrecip Column.

unique(Useful$HOURLYPrecip)#to remove duplicate values and return only the distinct values, to show all the values repeated in the column.

##  [1] "0.00"  "T"     "0.06"  NA      "0.03"  "0.02"  "0.08"  "0.01"  "0.07" 
## [10] "0.16"  "0.09"  "0.22"  "0.02s" "0.24"  "0.18"  "0.05"  "0.04"  "0.09s"
## [19] "0.11"  "0.14"  "0.25"  "0.10"  "0.01s" "0.58"  "0.12"  "0.13"  "0.46" 
## [28] "1.07"  "1.19"  "0.34"  "0.20"  "0.36s" "0.42"  "0.17"  "0.27"  "0.35" 
## [37] "0.31"  "0.33"  "0.23"  "0.26"  "0.28"  "0.75"  "0.19"  "0.36"  "0.03s"
## [46] "0.07s" "0.54"  "0.59"  "0.21"

From the dataframe(Useful), we can see that the column HOURLYPrecip - which is the hourly measure of precipitation levels - contains both NA and T values. T specifies trace amounts of precipitation (meaning essentially no precipitation), while NA means not available, and is used to denote missing values. Additionally, some values also have “s” at the end of them, indicating that the precipitation was snow.

Having characters in values (like the “T” and “s” seen in the unique values) will cause problems; values for precipitation should be numerical.

Step5: Replace “T” with 0.00 and remove “s” since the HOURLYPrecip is not specified to be snow, hail or rainfall.

library(tidyverse)
Useful<- Useful %>%  mutate(HOURLYPrecip=HOURLYPrecip%>% #update the existing column
 gsub("T", "0.0",.) %>%#replace all the T values with "0.0" %>%
str_remove(pattern="s$"))#to remove "s" from values using regex pattern
unique(Useful$HOURLYPrecip)# to check if unique values of HOURLYPrecip still contain any T or s.

##  [1] "0.00" "0.0"  "0.06" NA     "0.03" "0.02" "0.08" "0.01" "0.07" "0.16"
## [11] "0.09" "0.22" "0.24" "0.18" "0.05" "0.04" "0.11" "0.14" "0.25" "0.10"
## [21] "0.58" "0.12" "0.13" "0.46" "1.07" "1.19" "0.34" "0.20" "0.36" "0.42"
## [31] "0.17" "0.27" "0.35" "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19"
## [41] "0.54" "0.59" "0.21"

Step6: Convert Column (Precip) to Numerical Types.

Useful<-Useful %>% mutate(HOURLYPrecip=HOURLYPrecip %>% as.numeric(HOURLYPrecip))#to convert thr column to numeric
glimpse(Useful$HOURLYPrecip)

##  num [1:5727] 0 0 0 0 0 0 0 0 0.06 NA ...

Step 7: Rename Columns

Useful<-Useful %>% rename(relative_humidity=HOURLYRelativeHumidity,
                          dry_bulb_temp_f=HOURLYDRYBULBTEMPF, 
                          precip=HOURLYPrecip,
                          wind_speed=HOURLYWindSpeed,
                          station_pressure=HOURLYStationPressure)# to rename all the columns 
head(Useful)

##   relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
## 1                46              83      0         13            29.99
## 2                48              53      0          6            30.03
## 3                89              36      0         13            30.12
## 4                48              36      0         14            29.80
## 5                61              39      0         11            30.50
## 6                79              41      0          6            29.92

Step8: Check the Total Number of Missing Values for each column

colSums(is.na(Useful))#to check the sums of each column's missing values

## relative_humidity   dry_bulb_temp_f            precip        wind_speed 
##               151               151              1660               156 
##  station_pressure 
##               158

Step9: Replace Missing Values with Statistical Measures

Useful$relative_humidity[is.na(Useful$relative_humidity)] <- mean(Useful$relative_humidity, na.rm = TRUE)
sum(is.na(Useful$relative_humidity))

## [1] 0

Useful$dry_bulb_temp_f[is.na(Useful$dry_bulb_temp_f)]<- median(Useful$dry_bulb_temp_f,na.rm=TRUE)
sum(is.na(Useful$dry_bulb_temp_f))

## [1] 0

Useful$precip[is.na(Useful$precip)] <- mean(Useful$precip, na.rm = TRUE)
sum(is.na(Useful$precip))

## [1] 0

Useful$wind_speed[is.na(Useful$wind_speed)] <- median(Useful$wind_speed, na.rm = TRUE)
sum(is.na(Useful$wind_speed))

## [1] 0

Useful$station_pressure[is.na(Useful$station_pressure)] <- mean(Useful$station_pressure, na.rm = TRUE)
sum(is.na(Useful$station_pressure))

## [1] 0

colSums(is.na(Useful))

## relative_humidity   dry_bulb_temp_f            precip        wind_speed 
##                 0                 0                 0                 0 
##  station_pressure 
##                 0

# We no longer have any missing values

Step 10: Check the summary of the dataset

summary(Useful)

##  relative_humidity dry_bulb_temp_f     precip           wind_speed   
##  Min.   : 13.00    Min.   : 5.00   Min.   :0.000000   Min.   : 0.00  
##  1st Qu.: 51.00    1st Qu.:43.00   1st Qu.:0.000000   1st Qu.: 7.00  
##  Median : 67.00    Median :56.00   Median :0.000000   Median :10.00  
##  Mean   : 66.89    Mean   :55.61   Mean   :0.006548   Mean   :11.16  
##  3rd Qu.: 84.00    3rd Qu.:70.00   3rd Qu.:0.006548   3rd Qu.:15.00  
##  Max.   :100.00    Max.   :99.00   Max.   :1.190000   Max.   :52.00  
##  station_pressure
##  Min.   :28.69   
##  1st Qu.:29.85   
##  Median :29.99   
##  Mean   :29.99   
##  3rd Qu.:30.14   
##  Max.   :30.82

Step11:Check the strength of linear relationship between the features and the target variable, using the correlation matrix

library(ggplot2)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Useful <- subset(Useful, select = c(relative_humidity, wind_speed, dry_bulb_temp_f,station_pressure,precip))

cor_matrix <- cor(Useful, use = "pairwise.complete.obs") # Handles NAs if any remain

# 2. Melt the correlation matrix into a long format data frame for ggplot2
melted_cor_matrix <- melt(cor_matrix)

# 3. Create the heatmap using ggplot2
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") + # Adds white borders to tiles
  scale_fill_gradient2(low = "lightgreen", high = "red", mid = "purple",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() +
  labs(title = "Correlation matrix of Climatological Features") +
  coord_fixed() + # Ensures squares are square
  # Add the correlation coefficients as text labels
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Step12:Visualize this correlation matrix, by plotting one column of correlations using heatmap.

library(dplyr)
library(corrr)
library(ggplot2)

target_correlations <- Useful %>% 
  correlate() %>% 
  focus(precip) %>% # Focus on the target variable column
  rename(Feature = term, Correlation = precip) %>% # Rename columns for ggplot
  filter(Feature != "precip") # Remove the 1.0 correlation with itself

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

ggplot(target_correlations, aes(x = "Correlation with Target", y = reorder(Feature, Correlation), fill = Correlation)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "deeppink", midpoint = 0, limit = c(-1, 1)) +
  geom_text(aes(label = round(Correlation, 3)), color = "blue", size = 4) +
  theme_minimal() +
  labs(title = "Feature Correlation with Target (precip)", y = NULL, x = NULL) +
  # Use a theme adjustment to get a similar clean look to the image
  theme(legend.position = "none", # Hides the legend since numbers are on the plot
        axis.text.x = element_blank()) # Hides the X-axis label

Step13: Split the data into a training and testing set

library(rsample)
set.seed(1234)#This function sets the starting number used to generate a sequence of random numbers.
#This ensures that you get the same result if you start with that same seed each time.
weather_data <- initial_split(Useful,prop=0.8)#to split the data
train_data <- training(weather_data)#the splitted training dataset
test_data <- testing(weather_data)#the splitted testing dataset

Step14: Train a Linear Regression Using each feature

The first model will be trained using relative_humidity as the predictor

linear_model1 <- lm(precip~relative_humidity, data=train_data)
summary(linear_model1)

## 
## Call:
## lm(formula = precip ~ relative_humidity, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01616 -0.00802 -0.00366  0.00069  1.17500 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.290e-02  1.877e-03   -6.87 7.26e-12 ***
## relative_humidity  2.906e-04  2.691e-05   10.80  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.036 on 4579 degrees of freedom
## Multiple R-squared:  0.02483,    Adjusted R-squared:  0.02462 
## F-statistic: 116.6 on 1 and 4579 DF,  p-value: < 2.2e-16

Visualize

ggplot(train_data, aes(x = relative_humidity, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
  labs(
    title = "Linear Model 1",
    x = "Relative Humidity",
    y = "Precip"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The second model will be trained using dry_bulb_temp_f as the predictor

linear_model2<-lm(precip~dry_bulb_temp_f,data=train_data)
summary(linear_model2)

## 
## Call:
## lm(formula = precip ~ dry_bulb_temp_f, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.00758 -0.00672 -0.00617 -0.00012  1.18316 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     5.216e-03  1.860e-03   2.804  0.00507 **
## dry_bulb_temp_f 2.388e-05  3.204e-05   0.745  0.45626   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03646 on 4579 degrees of freedom
## Multiple R-squared:  0.0001212,  Adjusted R-squared:  -9.714e-05 
## F-statistic: 0.5551 on 1 and 4579 DF,  p-value: 0.4563

ggplot(train_data, aes(x = dry_bulb_temp_f, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
  labs(
    title = "Linear Model 2",
    x = "dry_bulb_temp_f",
    y = "Precip"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The third model will be trained using wind_speed as predictor

linear_model3<-lm(precip~wind_speed,data=train_data)
summary(linear_model3)

## 
## Call:
## lm(formula = precip ~ wind_speed, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01244 -0.00650 -0.00542  0.00005  1.18053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 3.532e-03  1.109e-03   3.186  0.00145 **
## wind_speed  2.700e-04  8.697e-05   3.105  0.00192 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03642 on 4579 degrees of freedom
## Multiple R-squared:  0.002101,   Adjusted R-squared:  0.001883 
## F-statistic: 9.639 on 1 and 4579 DF,  p-value: 0.001916

ggplot(train_data, aes(x = wind_speed, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
  labs(
    title = "Linear Model 3",
    x = "Wind Speed",
    y = "Precip"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The fourth model will be trained using station_pressure as predictor

linear_model4<-lm(precip~station_pressure,data=train_data)
summary(linear_model4)

## 
## Call:
## lm(formula = precip ~ station_pressure, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02058 -0.00736 -0.00406  0.00004  1.17851 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.501903   0.068231   7.356 2.23e-13 ***
## station_pressure -0.016518   0.002275  -7.260 4.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03625 on 4579 degrees of freedom
## Multiple R-squared:  0.01138,    Adjusted R-squared:  0.01116 
## F-statistic: 52.71 on 1 and 4579 DF,  p-value: 4.515e-13

ggplot(train_data, aes(x = station_pressure, y = precip)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "lightgreen") +
  labs(
    title = "Linear Model 4",
    x = "Station Pressure",
    y = "Precip"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The fifth Model will be trained using two features, using wind_speed and station pressure

linear_model5<-lm(precip~wind_speed + station_pressure,data=train_data)
summary(linear_model5)

## 
## Call:
## lm(formula = precip ~ wind_speed + station_pressure, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02235 -0.00731 -0.00410  0.00016  1.17758 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.778e-01  7.119e-02   6.712 2.16e-11 ***
## wind_speed        1.068e-04  8.996e-05   1.187    0.235    
## station_pressure -1.575e-02  2.364e-03  -6.663 3.00e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03625 on 4578 degrees of freedom
## Multiple R-squared:  0.01168,    Adjusted R-squared:  0.01125 
## F-statistic: 27.06 on 2 and 4578 DF,  p-value: 2.069e-12

ggplot(train_data, aes(x = wind_speed, y = precip, color = station_pressure)) +
  geom_point() +
  labs(title = "Trained Model with Two Numeric Predictors") +
  theme_minimal()

The last model will be built using all the features

multiple_reg_model<-lm(precip~.,data=train_data)
summary(multiple_reg_model)

## 
## Call:
## lm(formula = precip ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03033 -0.00818 -0.00343  0.00120  1.16781 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.019e-01  7.755e-02   2.604  0.00925 ** 
## relative_humidity  3.032e-04  2.930e-05  10.350  < 2e-16 ***
## wind_speed         4.315e-04  9.636e-05   4.478 7.71e-06 ***
## dry_bulb_temp_f   -4.515e-06  3.269e-05  -0.138  0.89015    
## station_pressure  -7.344e-03  2.536e-03  -2.896  0.00380 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03584 on 4576 degrees of freedom
## Multiple R-squared:  0.03431,    Adjusted R-squared:  0.03346 
## F-statistic: 40.64 on 4 and 4576 DF,  p-value: < 2.2e-16

TASK9:For each of the models created, check the model performance using the testing set

Step15: Predict using the Testing Data

actual_values<-test_data$precip
linear_model1_predictions<-predict(linear_model1,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model1_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual     Predicted      Residual
## 1 0.000000000  0.0004680524 -0.0004680524
## 2 0.000000000  0.0100572526 -0.0100572526
## 3 0.000000000  0.0059891070 -0.0059891070
## 4 0.000000000 -0.0021471841  0.0021471841
## 5 0.006547824  0.0077325980 -0.0011847740
## 6 0.006547824 -0.0024377659  0.0089855898

actual_values<-test_data$precip
linear_model2_predictions<-predict(linear_model2,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model2_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual   Predicted      Residual
## 1 0.000000000 0.007197842 -0.0071978425
## 2 0.000000000 0.006195052 -0.0061950518
## 3 0.000000000 0.005789160 -0.0057891604
## 4 0.000000000 0.006266680 -0.0062666797
## 5 0.006547824 0.006911331 -0.0003635069
## 6 0.006547824 0.006338308  0.0002095163

actual_values<-test_data$precip
linear_model3_predictions<-predict(linear_model3,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model3_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual   Predicted      Residual
## 1 0.000000000 0.007042806 -0.0070428064
## 2 0.000000000 0.005152632 -0.0051526317
## 3 0.000000000 0.006502756 -0.0065027565
## 4 0.000000000 0.005422657 -0.0054226567
## 5 0.006547824 0.005152632  0.0013951922
## 6 0.006547824 0.007042806 -0.0004949825

actual_values<-test_data$precip
linear_model4_predictions<-predict(linear_model4,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model4_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual   Predicted      Residual
## 1 0.000000000 0.006536439 -0.0065364392
## 2 0.000000000 0.007692680 -0.0076926804
## 3 0.000000000 0.000590056 -0.0005900560
## 4 0.000000000 0.002241829 -0.0022418291
## 5 0.006547824 0.007362326 -0.0008145018
## 6 0.006547824 0.008848922 -0.0023010976

actual_values<-test_data$precip
linear_model5_predictions<-predict(linear_model5,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model5_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual    Predicted      Residual
## 1 0.000000000 0.0067343816 -0.0067343816
## 2 0.000000000 0.0070898301 -0.0070898301
## 3 0.000000000 0.0008495988 -0.0008495988
## 4 0.000000000 0.0019979238 -0.0019979238
## 5 0.006547824 0.0067747596 -0.0002269357
## 6 0.006547824 0.0089398751 -0.0023920511

actual_values<-test_data$precip
multiple_reg_predictions<-predict(multiple_reg_model,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=multiple_reg_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual     Predicted      Residual
## 1 0.000000000  0.0008754209 -0.0008754209
## 2 0.000000000  0.0085656690 -0.0085656690
## 3 0.000000000  0.0033967785 -0.0033967785
## 4 0.000000000 -0.0061761547  0.0061761547
## 5 0.006547824  0.0058573531  0.0006904708
## 6 0.006547824 -0.0009663807  0.0075142047

Step16: Model Evaluation using RMSE to get the best model

Let’s evaluate all the models, compare and get the best model.

RME stands for Root Mean Square Error which is widely used for evaluating the performance of regression models. It measures the average difference between the values predicted by a model and the actaul values. It is in the same unit as the target variable.

library(Metrics)

## 
## Attaching package: 'Metrics'

## The following object is masked from 'package:rlang':
## 
##     ll

## The following objects are masked from 'package:yardstick':
## 
##     accuracy, mae, mape, mase, precision, recall, rmse, smape

class(actual_values)

## [1] "numeric"

class(multiple_reg_predictions)

## [1] "numeric"

rmse_value1_test<-Metrics::rmse(actual_values,linear_model1_predictions)
rmse_value1_test

## [1] 0.0285869

rmse_value2_test<-Metrics::rmse(actual_values,linear_model2_predictions)
rmse_value2_test

## [1] 0.02922909

rmse_value3_test<-Metrics::rmse(actual_values,linear_model3_predictions)
rmse_value3_test

## [1] 0.02899481

rmse_value4_test<-Metrics::rmse(actual_values,linear_model4_predictions)
rmse_value4_test

## [1] 0.0288756

rmse_value5_test<-Metrics::rmse(actual_values,linear_model5_predictions)
rmse_value5_test

## [1] 0.028801

rmse_value6_test<-Metrics::rmse(actual_values,multiple_reg_predictions)
rmse_value6_test

## [1] 0.02802316

Step17: Use a table to compare the evaluation metrics

model_names <- c("linear_model1","linear_model2","linear_model3","linear_model4","linear_model5","multiple_reg_model")
RMSE<- c ("0.0285869","0.02922909","0.02899481","0.0288756","0.028801","0.02802316")
comparison_df <- data.frame(model_names, RMSE)
comparison_df

##          model_names       RMSE
## 1      linear_model1  0.0285869
## 2      linear_model2 0.02922909
## 3      linear_model3 0.02899481
## 4      linear_model4  0.0288756
## 5      linear_model5   0.028801
## 6 multiple_reg_model 0.02802316

From the table above, multiple_reg_model has the lowest RMSE The lower the the RMSE, the accurate our model is. So multiple_reg_model is the best model in this project

Save the best model in Rdata.

saveRDS(multiple_reg_model,file="my_multiple_linear_regression_model.rds")
list.files()

##   [1] "1. Overview-of-Excel-Exercise-Solved.xlsx"        
##   [2] "1. Overview-of-Excel-Exercise-Unsolved.xlsx"      
##   [3] "2. Data-Entry-Exercise - Solved.xlsx"             
##   [4] "2. Data-Entry-Exercise - Unsolved.xlsx"           
##   [5] "3. Formatting- Exercise - Solved.xlsx"            
##   [6] "3. Formatting-Exercise - Unsolved.xlsx"           
##   [7] "3uTools_v3.10.002.dmg"                            
##   [8] "3uTools_v3.11.001.dmg"                            
##   [9] "3uToolsDownloads"                                 
##  [10] "4684_down.pdf"                                    
##  [11] "48. Round-Exercise - Solved.xlsx"                 
##  [12] "48. Round-Exercise - Unsolved.xlsx"               
##  [13] "5. Formulas - Exercise - Unsolved.xlsx"           
##  [14] "5. Formulas- Exercise - Solved.xlsx"              
##  [15] "6. Functions- Exercise - Solved.xlsx"             
##  [16] "6. Functions-Exercise - Unsolved.xlsx"            
##  [17] "7. Copy-Cut-Paste-Exercise - Unsolved.xlsx"       
##  [18] "8. Paste-Special-Exercise - Solved.xlsx"          
##  [19] "8. Paste-Special-Exercise - Unsolved.xlsx"        
##  [20] "applicantImage_copy.pdf"                          
##  [21] "applicantImage.pdf"                               
##  [22] "archive (1).zip"                                  
##  [23] "archive (2).zip"                                  
##  [24] "archive (3).zip"                                  
##  [25] "archive.zip"                                      
##  [26] "avast_security_online.dmg"                        
##  [27] "Basic R for Data Science Project 1 (1).Rmd"       
##  [28] "Basic-R-for-Data-Science-Project-1--1-.html"      
##  [29] "Bella.pptx"                                       
##  [30] "Bella[1] presentation (1).pptx"                   
##  [31] "Bella[1] presentation.pptx"                       
##  [32] "CamScanner 04-24-2025 11.29.pdf"                  
##  [33] "CamScanner 05-02-2025 06.43 (1).pdf"              
##  [34] "CamScanner 9-13-25 15.19.pdf"                     
##  [35] "certificate.pdf"                                  
##  [36] "clean copy.png"                                   
##  [37] "clean.png"                                        
##  [38] "climate_change_indicators.csv"                    
##  [39] "covid.csv"                                        
##  [40] "covid1.csv"                                       
##  [41] "customer_data_cleaning_project.R"                 
##  [42] "Customer_Purchase_Data.csv"                       
##  [43] "dashyboard Tolu.R"                                
##  [44] "Data Cleaning Practice (1).Rmd"                   
##  [45] "Data Cleaning Practice.Rmd"                       
##  [46] "Data Cleaning Project 1.Rmd"                      
##  [47] "Data Cleaning Project 2.Rmd"                      
##  [48] "Data Entry_Lecture.xlsx"                          
##  [49] "Data-Cleaning-2-Practice.Rmd"                     
##  [50] "Data-Cleaning-Practice.html"                      
##  [51] "Data-Cleaning-Project-1.html"                     
##  [52] "Data-Cleaning-Project-2.html"                     
##  [53] "Data-Cleaning-Tolu.html"                          
##  [54] "Data-Cleaning-Tolu.knit.md"                       
##  [55] "djia_analysis.md"                                 
##  [56] "eDX course 3.Rmd"                                 
##  [57] "edX Project 1 (1).Rmd"                            
##  [58] "edX Project 1 (3).Rmd"                            
##  [59] "EDX R PROGRAMMING 1.R"                            
##  [60] "edX-Project-1 (1).html"                           
##  [61] "edX-Project-1--1-.html"                           
##  [62] "edX-Project-1.html"                               
##  [63] "Format cells_Lecture_after.xlsx"                  
##  [64] "Formatting-Lecture_after.xlsx"                    
##  [65] "Formatting-Lecture_before.xlsx"                   
##  [66] "Formulas-Lecture_after.xlsx"                      
##  [67] "Formulas-Lecture_before.xlsx"                     
##  [68] "Functions-Lecture_after.xlsx"                     
##  [69] "Functions-Lecture_before.xlsx"                    
##  [70] "GIT Report format (1).docx"                       
##  [71] "GIT Report format (2).docx"                       
##  [72] "GIT Report format.docx"                           
##  [73] "googlechrome.dmg"                                 
##  [74] "histogram.png"                                    
##  [75] "icarefone-mac_11735733101042763901.dmg"           
##  [76] "imm1294e.pdf"                                     
##  [77] "imm5645e (1).pdf"                                 
##  [78] "imm5645e.pdf"                                     
##  [79] "Install Spotify.app"                              
##  [80] "Introduction to Excel - Course notes.pdf"         
##  [81] "introduction to ML (1) (1).pptm"                  
##  [82] "introduction to ML (1) (2).pptm"                  
##  [83] "introduction to ML (1) (3).pptm"                  
##  [84] "introduction to ML (1).pptm"                      
##  [85] "introduction_to_data_cleaning_with_R.pdf"         
##  [86] "lax_to_jfk"                                       
##  [87] "lax_to_jfk.tar.gz"                                
##  [88] "Long_and_Wide_Data_Format_Updated.Rmd"            
##  [89] "Long_Wide_Corporate_Executive.pptx"               
##  [90] "Long_Wide_Data_Format.html"                       
##  [91] "Long_Wide_Data_Format.pptx"                       
##  [92] "Long_Wide_Data_Format.Rmd"                        
##  [93] "Long_Wide_Format_Presentation.pptx"               
##  [94] "maradun.txt"                                      
##  [95] "median_average_wages.csv"                         
##  [96] "messy_customer_data.csv"                          
##  [97] "MetaTrader 5.pkg"                                 
##  [98] "MetaTrader5.pkg.zip"                              
##  [99] "model_performance_table.png"                      
## [100] "MY  SLIDE (1).pptx"                               
## [101] "MY  SLIDE (2).pptx"                               
## [102] "MY  SLIDE.pptx"                                   
## [103] "my_correlation_matrix (1).png"                    
## [104] "my_correlation_matrix.png"                        
## [105] "my_multiple_linear_regression_model.rds"          
## [106] "MY_SLIDE_BEAUTIFIED.pptx"                         
## [107] "MY_SLIDE_UPDATED.pptx"                            
## [108] "Nike Sales Cleaned Data Sales.Rmd"                
## [109] "Nike sales cleaned data.csv"                      
## [110] "Nike sales cleaned dataset.csv"                   
## [111] "Nike_Sales_Analysis_2026.pptx"                    
## [112] "Nike_Sales_Uncleaned.csv"                         
## [113] "Nike-Sales-Cleaned-Data-Code.html"                
## [114] "Nike-Sales-Cleaned-Data-Sales.html"               
## [115] "Nike-Sales-Trend-Analysis (1).Rmd"                
## [116] "Nike-Sales-Trend-Analysis.docx"                   
## [117] "Nike-Sales-Trend-Analysis.Rmd"                    
## [118] "noaa-weather-sample-data"                         
## [119] "noaa-weather-sample-data (1).tar.gz"              
## [120] "noaa-weather-sample-data.tar.gz"                  
## [121] "notconfirmed.Rmd"                                 
## [122] "output_table.png"                                 
## [123] "Positron-2025.11.0-234-arm64.dmg"                 
## [124] "Precipitation dashboard.R"                        
## [125] "Predictive Project.Rmd"                           
## [126] "Predictive-Project_files"                         
## [127] "Predictive-Project.html"                          
## [128] "Predictive-Project.Rmd"                           
## [129] "Presentation.pptx"                                
## [130] "Project Details .pdf"                             
## [131] "R-4.5.2-arm64.pkg"                                
## [132] "Real Project.Rmd"                                 
## [133] "Real-Project.html"                                
## [134] "Resources for ICAMMDA.docx"                       
## [135] "rsconnect"                                        
## [136] "RStudio-2025.09.2-418.dmg"                        
## [137] "SpotifyInstaller (1).zip"                         
## [138] "SpotifyInstaller (2).zip"                         
## [139] "SpotifyInstaller.zip"                             
## [140] "st.R"                                             
## [141] "Stock_Analysis_Project.html"                      
## [142] "Stock_Analysis_Project.Rmd"                       
## [143] "Student Mental health (1).csv"                    
## [144] "STUDY CORNER.docx"                                
## [145] "TASK.pdf"                                         
## [146] "The Traders Bible .pdf"                           
## [147] "Tidying data (1).qmd"                             
## [148] "Tidying data (2).qmd"                             
## [149] "Tidying data.qmd"                                 
## [150] "Toluwani ABIOLA CV_025611.docx"                   
## [151] "Toluwani edx Course 2 (1).Rmd"                    
## [152] "Toluwani edx Course 2 (2).Rmd"                    
## [153] "Toluwani edx Course 2.Rmd"                        
## [154] "TOLUWANI PRECIOUS fitness (1).pdf"                
## [155] "TOLUWANI PRECIOUS fitness.pdf"                    
## [156] "Toluwani-Chapter-7-and-8.html"                    
## [157] "Transformed Dataset.Rmd"                          
## [158] "Transformed-Dataset.html"                         
## [159] "UDS0028020498213.png"                             
## [160] "unclean.png"                                      
## [161] "VICTORIA PRESENTATION SLIDE .pptx"                
## [162] "Visual Studio Code 2.app"                         
## [163] "Visual Studio Code 3.app"                         
## [164] "Visual Studio Code.app"                           
## [165] "VSCode-darwin-universal (1).zip"                  
## [166] "VSCode-darwin-universal.zip"                      
## [167] "WB_ES_T_FIN12.csv"                                
## [168] "WB_FINDEX_FIN11_2A.csv"                           
## [169] "weather-weka.csv"                                 
## [170] "WhatsApp Image 2025-11-14 at 11.24.17 PM.jpeg"    
## [171] "WhatsApp Image 2025-11-14 at 11.24.18 PM (1).jpeg"
## [172] "WhatsApp Image 2025-11-14 at 11.24.18 PM.jpeg"    
## [173] "WhatsApp Image 2025-11-15 at 4.58.13 PM.jpeg"     
## [174] "WhatsApp Image 2025-11-15 at 8.37.29 PM.jpeg"     
## [175] "yahoo_data 2.xlsx"                                
## [176] "yahoo_data.xlsx"

Now, using shiny app, We deploy our model.

FINDINGS

All the features have very weak correlation with Precipitation.
Alot of Models, but 6 Models were trained.
The Multiple Linear Regression(used all features).
The Model makes an average prediction error of 2.8% on the unseen data.

CONCLUSION/DISCUSSION

All the features have very weak correlation with precipitation(not closer to one)
Additional data collection is needed to develop a truly highly correlated and accurate forecasting system.
Further analysis should be conducted to predict the exact type of precipitation.
More features should be added to get the specific climatological factors that have strong correlation with Precipitation.(Use the whole dataset or increase the percentage since 5% was used for this project).

A Predictive Analysis On Weather Precipitation Using Key Climatological Variables(A Case Study of JFK Airport, New York)

Abiola Toluwani

2025-11-13