A Predictive Analysis On Weather Precipitation Using Key Climatological Variables(A Case Study of JFK Airport, New York)

Task1: INSTALL THE REQUIRED PACKAGES

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.0      ✔ tidyr        1.3.1 
## ✔ infer        1.0.9      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
## ✔ purrr        1.2.0      ✔ yardstick    1.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(rlang)

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:purrr':
## 
##     flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

library(corrplot)

## corrplot 0.95 loaded

TASK 2: READ IN THE DATASET

pal<-read.csv("/Users/mac/Documents/jfk_weather_sample.csv")
head(pal)#to view the first six rows of the dataset

##                   DATE HOURLYDewPointTempF HOURLYRelativeHumidity
## 1 2015-07-25T13:51:00Z                  60                     46
## 2 2016-11-18T23:51:00Z                  34                     48
## 3 2013-01-06T08:51:00Z                  33                     89
## 4 2011-01-27T16:51:00Z                  18                     48
## 5 2015-01-03T12:16:00Z                  27                     61
## 6 2013-02-15T20:51:00Z                  35                     79
##   HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1                 83                 68         0.00              13
## 2                 53                 44         0.00               6
## 3                 36                 35         0.00              13
## 4                 36                 30         0.00              14
## 5                 39                 34            T              11
## 6                 41                 38         0.00               6
##   HOURLYSeaLevelPressure HOURLYStationPressure
## 1                  30.01                 29.99
## 2                  30.05                 30.03
## 3                  30.14                 30.12
## 4                  29.82                 29.80
## 5                     NA                 30.50
## 6                  29.94                 29.92

tail(pal)#to view the last six roles of the dataset

##                      DATE HOURLYDewPointTempF HOURLYRelativeHumidity
## 5722 2015-05-26T08:51:00Z                  59                     64
## 5723 2018-06-05T23:51:00Z                  50                     67
## 5724 2016-09-20T23:59:00Z                <NA>                     NA
## 5725 2014-04-19T18:51:00Z                  34                     48
## 5726 2010-01-08T10:51:00Z                  25                     75
## 5727 2016-01-14T18:51:00Z                  22                     52
##      HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 5722                 72                 64         0.00               9
## 5723                 61                 55         0.00               6
## 5724                 NA                 NA         <NA>              NA
## 5725                 53                 44         0.00               3
## 5726                 32                 29         0.00               7
## 5727                 38                 32         0.00               0
##      HOURLYSeaLevelPressure HOURLYStationPressure
## 5722                  30.23                 30.21
## 5723                  29.74                 29.72
## 5724                     NA                    NA
## 5725                  30.28                 30.26
## 5726                  29.72                 29.69
## 5727                  29.99                 29.97

glimpse(pal)#to see the different data types

## Rows: 5,727
## Columns: 9
## $ DATE                   <chr> "2015-07-25T13:51:00Z", "2016-11-18T23:51:00Z",…
## $ HOURLYDewPointTempF    <chr> "60", "34", "33", "18", "27", "35", "4", "14", …
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF     <int> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

DATA PREPROCESSING

TASK 3:SELECT SUBSET OF COLUMNS

The first step in this data preprocessing is to select a subset of data columns and inspect the column types.For this project, I need HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYStationPressure columns.

THE NEEDED DATA GLOSSARY - HOURLYRelativeHumidity: is the relative humidity given to the nearest whole percentage.

HOURLYDRYBULBTEMPF: is the dry-bulb temperature and is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
HOURLYPrecip: is the amount of precipitation in inches to hundredths over the past hour. For certain automated stations, precipitation will be reported at sub-hourly intervals (e.g. every 15 or 20 minutes) as an accumulated amount of all precipitation within the preceding hour. A “T” indicates a trace amount of precipitation.
HOURLYWindSpeed: is the speed of the wind at the time of observation given in miles per hour (mph).
HOURLYStationPressure: is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).

Useful<-pal %>%select(HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYStationPressure)#to select columns 

head(Useful)#to show the first six rows of the dataframe

##   HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1                     46                 83         0.00              13
## 2                     48                 53         0.00               6
## 3                     89                 36         0.00              13
## 4                     48                 36         0.00              14
## 5                     61                 39            T              11
## 6                     79                 41         0.00               6
##   HOURLYStationPressure
## 1                 29.99
## 2                 30.03
## 3                 30.12
## 4                 29.80
## 5                 30.50
## 6                 29.92

tail(Useful)#to show the last six rows of the dataframe

##      HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 5722                     64                 72         0.00               9
## 5723                     67                 61         0.00               6
## 5724                     NA                 NA         <NA>              NA
## 5725                     48                 53         0.00               3
## 5726                     75                 32         0.00               7
## 5727                     52                 38         0.00               0
##      HOURLYStationPressure
## 5722                 30.21
## 5723                 29.72
## 5724                    NA
## 5725                 30.26
## 5726                 29.69
## 5727                 29.97

glimpse(Useful)#to show the number of rows, columns and the data type

## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

TASK 4: CLEAN UP COLUMNS

unique(Useful$HOURLYPrecip)#to remove duplicate values and return only the distinct values, to show all the values repeated in the column.

##  [1] "0.00"  "T"     "0.06"  NA      "0.03"  "0.02"  "0.08"  "0.01"  "0.07" 
## [10] "0.16"  "0.09"  "0.22"  "0.02s" "0.24"  "0.18"  "0.05"  "0.04"  "0.09s"
## [19] "0.11"  "0.14"  "0.25"  "0.10"  "0.01s" "0.58"  "0.12"  "0.13"  "0.46" 
## [28] "1.07"  "1.19"  "0.34"  "0.20"  "0.36s" "0.42"  "0.17"  "0.27"  "0.35" 
## [37] "0.31"  "0.33"  "0.23"  "0.26"  "0.28"  "0.75"  "0.19"  "0.36"  "0.03s"
## [46] "0.07s" "0.54"  "0.59"  "0.21"

From the dataframe(Useful), we can see that the column HOURLYPrecip - which is the hourly measure of precipitation levels - contains both NA and T values. T specifies trace amounts of precipitation (meaning essentially no precipitation), while NA means not available, and is used to denote missing values. Additionally, some values also have “s” at the end of them, indicating that the precipitation was snow.

Having characters in values (like the “T” and “s” seen in the unique values) will cause problems when I create a model because values for precipitation should be numerical.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor()  masks scales::col_factor()
## ✖ purrr::discard()     masks scales::discard()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ stringr::fixed()     masks recipes::fixed()
## ✖ rlang::flatten()     masks purrr::flatten()
## ✖ rlang::flatten_chr() masks purrr::flatten_chr()
## ✖ rlang::flatten_dbl() masks purrr::flatten_dbl()
## ✖ rlang::flatten_int() masks purrr::flatten_int()
## ✖ rlang::flatten_lgl() masks purrr::flatten_lgl()
## ✖ rlang::flatten_raw() masks purrr::flatten_raw()
## ✖ rlang::invoke()      masks purrr::invoke()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ readr::spec()        masks yardstick::spec()
## ✖ rlang::splice()      masks purrr::splice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Useful<- Useful %>%  mutate(HOURLYPrecip=HOURLYPrecip%>% #update the existing column
 gsub("T", "0.0",.) %>%#replace all the T values with "0.0" %>%
str_remove(pattern="s$"))#to remove "s" from values using regex pattern
unique(Useful$HOURLYPrecip)# to check if unique values of HOURLYPrecip still contain any T or s.

##  [1] "0.00" "0.0"  "0.06" NA     "0.03" "0.02" "0.08" "0.01" "0.07" "0.16"
## [11] "0.09" "0.22" "0.24" "0.18" "0.05" "0.04" "0.11" "0.14" "0.25" "0.10"
## [21] "0.58" "0.12" "0.13" "0.46" "1.07" "1.19" "0.34" "0.20" "0.36" "0.42"
## [31] "0.17" "0.27" "0.35" "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19"
## [41] "0.54" "0.59" "0.21"

TASK 5: CONVERT COLUMN(Precip) TO NUMERICAL TYPES

Useful<-Useful %>% mutate(HOURLYPrecip=HOURLYPrecip %>% as.numeric(HOURLYPrecip))#to convert thr column to numeric
glimpse(Useful)

## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

TASK 6: RENAME COLUMNS

Useful<-Useful %>% rename(relative_humidity=HOURLYRelativeHumidity,
                          dry_bulb_temp_f=HOURLYDRYBULBTEMPF, 
                          precip=HOURLYPrecip,
                          wind_speed=HOURLYWindSpeed,
                          station_pressure=HOURLYStationPressure)# to rename all the columns 
head(Useful)

##   relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
## 1                46              83      0         13            29.99
## 2                48              53      0          6            30.03
## 3                89              36      0         13            30.12
## 4                48              36      0         14            29.80
## 5                61              39      0         11            30.50
## 6                79              41      0          6            29.92

REPLACE THE MISSING VALUES

colSums(is.na(Useful))#to check the sums of each column's missing values

## relative_humidity   dry_bulb_temp_f            precip        wind_speed 
##               151               151              1660               156 
##  station_pressure 
##               158

Useful$relative_humidity[is.na(Useful$relative_humidity)] <- mean(Useful$relative_humidity, na.rm = TRUE)

Useful$dry_bulb_temp_f[is.na(Useful$dry_bulb_temp_f)]<- median(Useful$dry_bulb_temp_f,na.rm=TRUE)
Useful$precip[is.na(Useful$precip)] <- mean(Useful$precip, na.rm = TRUE)
Useful$wind_speed[is.na(Useful$wind_speed)] <- median(Useful$wind_speed, na.rm = TRUE)
Useful$station_pressure[is.na(Useful$station_pressure)] <- mean(Useful$station_pressure, na.rm = TRUE)
colSums(is.na(Useful))

## relative_humidity   dry_bulb_temp_f            precip        wind_speed 
##                 0                 0                 0                 0 
##  station_pressure 
##                 0

# We no longer have any missing values

TASK 7:EXPLORATORY DATA ANALYSIS Let’s check the summary of the dataset

summary(Useful)

##  relative_humidity dry_bulb_temp_f     precip           wind_speed   
##  Min.   : 13.00    Min.   : 5.00   Min.   :0.000000   Min.   : 0.00  
##  1st Qu.: 51.00    1st Qu.:43.00   1st Qu.:0.000000   1st Qu.: 7.00  
##  Median : 67.00    Median :56.00   Median :0.000000   Median :10.00  
##  Mean   : 66.89    Mean   :55.61   Mean   :0.006548   Mean   :11.16  
##  3rd Qu.: 84.00    3rd Qu.:70.00   3rd Qu.:0.006548   3rd Qu.:15.00  
##  Max.   :100.00    Max.   :99.00   Max.   :1.190000   Max.   :52.00  
##  station_pressure
##  Min.   :28.69   
##  1st Qu.:29.85   
##  Median :29.99   
##  Mean   :29.99   
##  3rd Qu.:30.14   
##  Max.   :30.82

Let’s check the strength of linear relationship between the features and the target variable, using the correlation matrix

library(ggplot2)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Useful <- subset(Useful, select = c(relative_humidity, wind_speed, dry_bulb_temp_f,station_pressure,precip))

cor_matrix <- cor(Useful, use = "pairwise.complete.obs") # Handles NAs if any remain

# 2. Melt the correlation matrix into a long format data frame for ggplot2
melted_cor_matrix <- melt(cor_matrix)

# 3. Create the heatmap using ggplot2
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") + # Adds white borders to tiles
  scale_fill_gradient2(low = "lightgreen", high = "red", mid = "purple",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  theme_minimal() +
  labs(title = "Correlation matrix of Climatological Features") +
  coord_fixed() + # Ensures squares are square
  # Add the correlation coefficients as text labels
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Let’s visualize this correlation matrix, by plotting one column of correlations using heatmap

library(dplyr)
library(corrr)
library(ggplot2)

target_correlations <- Useful %>% 
  correlate() %>% 
  focus(precip) %>% # Focus on the target variable column
  rename(Feature = term, Correlation = precip) %>% # Rename columns for ggplot
  filter(Feature != "precip") # Remove the 1.0 correlation with itself

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

ggplot(target_correlations, aes(x = "Correlation with Target", y = reorder(Feature, Correlation), fill = Correlation)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "deeppink", midpoint = 0, limit = c(-1, 1)) +
  geom_text(aes(label = round(Correlation, 3)), color = "blue", size = 4) +
  theme_minimal() +
  labs(title = "Feature Correlation with Target (precip)", y = NULL, x = NULL) +
  # Use a theme adjustment to get a similar clean look to the image
  theme(legend.position = "none", # Hides the legend since numbers are on the plot
        axis.text.x = element_blank()) # Hides the X-axis label

Now,let’s split the data into a training and testing set.

set.seed(1234)#This function sets the starting number used to generate a sequence of random numbers.
#This ensures that you get the same result if you start with that same seed each time.
weather_data <- initial_split(Useful,prop=0.8)#to split the data
train_data <- training(weather_data)#the splitted training dataset
test_data <- testing(weather_data)#the splitted testing dataset

I want to explore the distributions of the variables in the training set and testing set separately

So,Plot boxplots of the variables (relative_humidity, dry_bulb_temp_f, precip, wind_speed, station_pressure) for an intial look of their distributions

To make it easier, I will convert the train_data to a long data format

library(dplyr)
library(tidyr)
library(ggplot2)

# Select only the columns I want and convert to long format
train_long <- train_data%>%
  select(relative_humidity, dry_bulb_temp_f, precip,wind_speed, station_pressure) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  )

head(train_long)

## # A tibble: 6 × 2
##   variable             value
##   <chr>                <dbl>
## 1 relative_humidity 94      
## 2 dry_bulb_temp_f   73      
## 3 precip             0.00655
## 4 wind_speed         7      
## 5 station_pressure  29.8    
## 6 relative_humidity 76

Create boxplots for all predictors in the training set

ggplot(train_long, aes(y = value)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(
    title = "Boxplots of Training Set Variables",
    y = "Value"
  )

Box plots of Testing Set Predictors

test_long <- test_data%>%
  select(relative_humidity, dry_bulb_temp_f,precip, wind_speed, station_pressure) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  )

head(test_long)

## # A tibble: 6 × 2
##   variable          value
##   <chr>             <dbl>
## 1 relative_humidity  46  
## 2 dry_bulb_temp_f    83  
## 3 precip              0  
## 4 wind_speed         13  
## 5 station_pressure   30.0
## 6 relative_humidity  79

ggplot(test_long, aes(y = value)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(
    title = "Boxplots of Testing Set Variables",
    y = "Value"
  )

TASK 8: Build a Linear Regression Using each feature

The first model will be built using relative_humidity as the predictor

linear_model1 <- lm(precip~relative_humidity, data=train_data)
summary(linear_model1)

## 
## Call:
## lm(formula = precip ~ relative_humidity, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01616 -0.00802 -0.00366  0.00069  1.17500 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.290e-02  1.877e-03   -6.87 7.26e-12 ***
## relative_humidity  2.906e-04  2.691e-05   10.80  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.036 on 4579 degrees of freedom
## Multiple R-squared:  0.02483,    Adjusted R-squared:  0.02462 
## F-statistic: 116.6 on 1 and 4579 DF,  p-value: < 2.2e-16

The second model will built using dry_bulb_temp_f as the predictor

linear_model2<-lm(precip~dry_bulb_temp_f,data=train_data)
summary(linear_model2)

## 
## Call:
## lm(formula = precip ~ dry_bulb_temp_f, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.00758 -0.00672 -0.00617 -0.00012  1.18316 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     5.216e-03  1.860e-03   2.804  0.00507 **
## dry_bulb_temp_f 2.388e-05  3.204e-05   0.745  0.45626   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03646 on 4579 degrees of freedom
## Multiple R-squared:  0.0001212,  Adjusted R-squared:  -9.714e-05 
## F-statistic: 0.5551 on 1 and 4579 DF,  p-value: 0.4563

The third model will be built using wind_speed as predictor

linear_model3<-lm(precip~wind_speed,data=train_data)
summary(linear_model3)

## 
## Call:
## lm(formula = precip ~ wind_speed, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01244 -0.00650 -0.00542  0.00005  1.18053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 3.532e-03  1.109e-03   3.186  0.00145 **
## wind_speed  2.700e-04  8.697e-05   3.105  0.00192 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03642 on 4579 degrees of freedom
## Multiple R-squared:  0.002101,   Adjusted R-squared:  0.001883 
## F-statistic: 9.639 on 1 and 4579 DF,  p-value: 0.001916

The fourth model will be built using station_pressure as predictor

linear_model4<-lm(precip~station_pressure,data=train_data)
summary(linear_model4)

## 
## Call:
## lm(formula = precip ~ station_pressure, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02058 -0.00736 -0.00406  0.00004  1.17851 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.501903   0.068231   7.356 2.23e-13 ***
## station_pressure -0.016518   0.002275  -7.260 4.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03625 on 4579 degrees of freedom
## Multiple R-squared:  0.01138,    Adjusted R-squared:  0.01116 
## F-statistic: 52.71 on 1 and 4579 DF,  p-value: 4.515e-13

The fifth Model will be built using two features, using wind_speed and station pressure

linear_model5<-lm(precip~wind_speed + station_pressure,data=train_data)
summary(linear_model5)

## 
## Call:
## lm(formula = precip ~ wind_speed + station_pressure, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02235 -0.00731 -0.00410  0.00016  1.17758 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.778e-01  7.119e-02   6.712 2.16e-11 ***
## wind_speed        1.068e-04  8.996e-05   1.187    0.235    
## station_pressure -1.575e-02  2.364e-03  -6.663 3.00e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03625 on 4578 degrees of freedom
## Multiple R-squared:  0.01168,    Adjusted R-squared:  0.01125 
## F-statistic: 27.06 on 2 and 4578 DF,  p-value: 2.069e-12

The last model will be built using all the features

multiple_reg_model<-lm(precip~.,data=train_data)
summary(multiple_reg_model)

## 
## Call:
## lm(formula = precip ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03033 -0.00818 -0.00343  0.00120  1.16781 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.019e-01  7.755e-02   2.604  0.00925 ** 
## relative_humidity  3.032e-04  2.930e-05  10.350  < 2e-16 ***
## wind_speed         4.315e-04  9.636e-05   4.478 7.71e-06 ***
## dry_bulb_temp_f   -4.515e-06  3.269e-05  -0.138  0.89015    
## station_pressure  -7.344e-03  2.536e-03  -2.896  0.00380 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03584 on 4576 degrees of freedom
## Multiple R-squared:  0.03431,    Adjusted R-squared:  0.03346 
## F-statistic: 40.64 on 4 and 4576 DF,  p-value: < 2.2e-16

TASK9:For each of the models created, check the model performance using the testing set

TASK9: Predict using the Testing Data

actual_values<-test_data$precip
linear_model1_predictions<-predict(linear_model1,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model1_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual     Predicted      Residual
## 1 0.000000000  0.0004680524 -0.0004680524
## 2 0.000000000  0.0100572526 -0.0100572526
## 3 0.000000000  0.0059891070 -0.0059891070
## 4 0.000000000 -0.0021471841  0.0021471841
## 5 0.006547824  0.0077325980 -0.0011847740
## 6 0.006547824 -0.0024377659  0.0089855898

actual_values<-test_data$precip
linear_model2_predictions<-predict(linear_model2,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model2_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual   Predicted      Residual
## 1 0.000000000 0.007197842 -0.0071978425
## 2 0.000000000 0.006195052 -0.0061950518
## 3 0.000000000 0.005789160 -0.0057891604
## 4 0.000000000 0.006266680 -0.0062666797
## 5 0.006547824 0.006911331 -0.0003635069
## 6 0.006547824 0.006338308  0.0002095163

actual_values<-test_data$precip
linear_model3_predictions<-predict(linear_model3,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model3_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual   Predicted      Residual
## 1 0.000000000 0.007042806 -0.0070428064
## 2 0.000000000 0.005152632 -0.0051526317
## 3 0.000000000 0.006502756 -0.0065027565
## 4 0.000000000 0.005422657 -0.0054226567
## 5 0.006547824 0.005152632  0.0013951922
## 6 0.006547824 0.007042806 -0.0004949825

actual_values<-test_data$precip
linear_model4_predictions<-predict(linear_model4,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model4_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual   Predicted      Residual
## 1 0.000000000 0.006536439 -0.0065364392
## 2 0.000000000 0.007692680 -0.0076926804
## 3 0.000000000 0.000590056 -0.0005900560
## 4 0.000000000 0.002241829 -0.0022418291
## 5 0.006547824 0.007362326 -0.0008145018
## 6 0.006547824 0.008848922 -0.0023010976

actual_values<-test_data$precip
linear_model5_predictions<-predict(linear_model5,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model5_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual    Predicted      Residual
## 1 0.000000000 0.0067343816 -0.0067343816
## 2 0.000000000 0.0070898301 -0.0070898301
## 3 0.000000000 0.0008495988 -0.0008495988
## 4 0.000000000 0.0019979238 -0.0019979238
## 5 0.006547824 0.0067747596 -0.0002269357
## 6 0.006547824 0.0089398751 -0.0023920511

actual_values<-test_data$precip
multiple_reg_predictions<-predict(multiple_reg_model,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=multiple_reg_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)

##        Actual     Predicted      Residual
## 1 0.000000000  0.0008754209 -0.0008754209
## 2 0.000000000  0.0085656690 -0.0085656690
## 3 0.000000000  0.0033967785 -0.0033967785
## 4 0.000000000 -0.0061761547  0.0061761547
## 5 0.006547824  0.0058573531  0.0006904708
## 6 0.006547824 -0.0009663807  0.0075142047

TASK9: Model Evaluation using RMSE to get the best model

Let’s evaluate all the models, compare and get the best model.

RME stands for Root Mean Square Error which is widely used for evaluating the performance of regression models. It measures the average difference between the values predicted by a model and the actaul values. It is in the same unit as the target variable.

library(Metrics)

## 
## Attaching package: 'Metrics'

## The following object is masked from 'package:rlang':
## 
##     ll

## The following objects are masked from 'package:yardstick':
## 
##     accuracy, mae, mape, mase, precision, recall, rmse, smape

class(actual_values)

## [1] "numeric"

class(multiple_reg_predictions)

## [1] "numeric"

rmse_value1_test<-Metrics::rmse(actual_values,linear_model1_predictions)
rmse_value1_test

## [1] 0.0285869

rmse_value2_test<-Metrics::rmse(actual_values,linear_model2_predictions)
rmse_value2_test

## [1] 0.02922909

rmse_value3_test<-Metrics::rmse(actual_values,linear_model3_predictions)
rmse_value3_test

## [1] 0.02899481

rmse_value4_test<-Metrics::rmse(actual_values,linear_model4_predictions)
rmse_value4_test

## [1] 0.0288756

rmse_value5_test<-Metrics::rmse(actual_values,linear_model5_predictions)
rmse_value5_test

## [1] 0.028801

rmse_value6_test<-Metrics::rmse(actual_values,multiple_reg_predictions)
rmse_value6_test

## [1] 0.02802316

TASK 10: Use a table to compare the evaluation metrics

model_names <- c("linear_model1","linear_model2","linear_model3","linear_model4","linear_model5","multiple_reg_model")
RMSE<- c ("0.0285869","0.02922909","0.02899481","0.0288756","0.028801","0.02802316")
comparison_df <- data.frame(model_names, RMSE)
comparison_df

##          model_names       RMSE
## 1      linear_model1  0.0285869
## 2      linear_model2 0.02922909
## 3      linear_model3 0.02899481
## 4      linear_model4  0.0288756
## 5      linear_model5   0.028801
## 6 multiple_reg_model 0.02802316

From the table above, multiple_reg_model has the lowest RMSE The lower the the RMSE, the accurate our model is. So multiple_reg_model is the best model in this project

library(gt)
my_df<-comparison_df%>% gt() %>% tab_header(title="Model Performance Metrics")

gtsave(
  data = my_df,
  filename = "model_performance_table.png")

## file:////var/folders/ln/45c0bd2s61vbgrr4yjcb6cf00000gn/T//Rtmp7SzYL8/file11a46cf15f3e.html screenshot completed

Save the best model in Rdata.

saveRDS(multiple_reg_model,file="my_multiple_linear_regression_model.rds")
list.files()

##   [1] "1. Overview-of-Excel-Exercise-Solved.xlsx"        
##   [2] "1. Overview-of-Excel-Exercise-Unsolved.xlsx"      
##   [3] "2. Data-Entry-Exercise - Solved.xlsx"             
##   [4] "2. Data-Entry-Exercise - Unsolved.xlsx"           
##   [5] "3. Formatting- Exercise - Solved.xlsx"            
##   [6] "3. Formatting-Exercise - Unsolved.xlsx"           
##   [7] "3uTools_v3.10.002.dmg"                            
##   [8] "3uTools_v3.11.001.dmg"                            
##   [9] "3uToolsDownloads"                                 
##  [10] "4684_down.pdf"                                    
##  [11] "48. Round-Exercise - Solved.xlsx"                 
##  [12] "48. Round-Exercise - Unsolved.xlsx"               
##  [13] "5. Formulas - Exercise - Unsolved.xlsx"           
##  [14] "5. Formulas- Exercise - Solved.xlsx"              
##  [15] "6. Functions- Exercise - Solved.xlsx"             
##  [16] "6. Functions-Exercise - Unsolved.xlsx"            
##  [17] "7. Copy-Cut-Paste-Exercise - Unsolved.xlsx"       
##  [18] "8. Paste-Special-Exercise - Solved.xlsx"          
##  [19] "8. Paste-Special-Exercise - Unsolved.xlsx"        
##  [20] "applicantImage_copy.pdf"                          
##  [21] "applicantImage.pdf"                               
##  [22] "archive (1).zip"                                  
##  [23] "archive (2).zip"                                  
##  [24] "archive (3).zip"                                  
##  [25] "archive.zip"                                      
##  [26] "avast_security_online.dmg"                        
##  [27] "Basic R for Data Science Project 1 (1).Rmd"       
##  [28] "Basic-R-for-Data-Science-Project-1--1-.html"      
##  [29] "CamScanner 04-24-2025 11.29.pdf"                  
##  [30] "CamScanner 05-02-2025 06.43 (1).pdf"              
##  [31] "CamScanner 9-13-25 15.19.pdf"                     
##  [32] "certificate.pdf"                                  
##  [33] "clean copy.png"                                   
##  [34] "clean.png"                                        
##  [35] "climate_change_indicators.csv"                    
##  [36] "covid.csv"                                        
##  [37] "covid1.csv"                                       
##  [38] "customer_data_cleaning_project.R"                 
##  [39] "Customer_Purchase_Data.csv"                       
##  [40] "dashyboard Tolu.R"                                
##  [41] "Data Cleaning Practice (1).Rmd"                   
##  [42] "Data Cleaning Practice.Rmd"                       
##  [43] "Data Cleaning Project 1.Rmd"                      
##  [44] "Data Cleaning Project 2.Rmd"                      
##  [45] "Data Entry_Lecture.xlsx"                          
##  [46] "Data-Cleaning-2-Practice.Rmd"                     
##  [47] "Data-Cleaning-Practice.html"                      
##  [48] "Data-Cleaning-Project-1.html"                     
##  [49] "Data-Cleaning-Project-2.html"                     
##  [50] "Data-Cleaning-Tolu.html"                          
##  [51] "Data-Cleaning-Tolu.knit.md"                       
##  [52] "djia_analysis.md"                                 
##  [53] "eDX course 3.Rmd"                                 
##  [54] "edX Project 1 (1).Rmd"                            
##  [55] "edX Project 1 (3).Rmd"                            
##  [56] "EDX R PROGRAMMING 1.R"                            
##  [57] "edX-Project-1 (1).html"                           
##  [58] "edX-Project-1--1-.html"                           
##  [59] "edX-Project-1.html"                               
##  [60] "Format cells_Lecture_after.xlsx"                  
##  [61] "Formatting-Lecture_after.xlsx"                    
##  [62] "Formatting-Lecture_before.xlsx"                   
##  [63] "Formulas-Lecture_after.xlsx"                      
##  [64] "Formulas-Lecture_before.xlsx"                     
##  [65] "Functions-Lecture_after.xlsx"                     
##  [66] "Functions-Lecture_before.xlsx"                    
##  [67] "GIT Report format (1).docx"                       
##  [68] "GIT Report format (2).docx"                       
##  [69] "GIT Report format.docx"                           
##  [70] "googlechrome.dmg"                                 
##  [71] "histogram.png"                                    
##  [72] "icarefone-mac_11735733101042763901.dmg"           
##  [73] "imm1294e.pdf"                                     
##  [74] "imm5645e (1).pdf"                                 
##  [75] "imm5645e.pdf"                                     
##  [76] "Install Spotify.app"                              
##  [77] "Introduction to Excel - Course notes.pdf"         
##  [78] "introduction to ML (1) (1).pptm"                  
##  [79] "introduction to ML (1) (2).pptm"                  
##  [80] "introduction to ML (1) (3).pptm"                  
##  [81] "introduction to ML (1).pptm"                      
##  [82] "introduction_to_data_cleaning_with_R.pdf"         
##  [83] "lax_to_jfk"                                       
##  [84] "lax_to_jfk.tar.gz"                                
##  [85] "Long_and_Wide_Data_Format_Updated.Rmd"            
##  [86] "Long_Wide_Corporate_Executive.pptx"               
##  [87] "Long_Wide_Data_Format.html"                       
##  [88] "Long_Wide_Data_Format.pptx"                       
##  [89] "Long_Wide_Data_Format.Rmd"                        
##  [90] "Long_Wide_Format_Presentation.pptx"               
##  [91] "maradun.txt"                                      
##  [92] "median_average_wages.csv"                         
##  [93] "messy_customer_data.csv"                          
##  [94] "MetaTrader 5.pkg"                                 
##  [95] "MetaTrader5.pkg.zip"                              
##  [96] "model_performance_table.png"                      
##  [97] "MY  SLIDE (1).pptx"                               
##  [98] "MY  SLIDE.pptx"                                   
##  [99] "my_correlation_matrix (1).png"                    
## [100] "my_correlation_matrix.png"                        
## [101] "my_multiple_linear_regression_model.rds"          
## [102] "MY_SLIDE_BEAUTIFIED.pptx"                         
## [103] "MY_SLIDE_UPDATED.pptx"                            
## [104] "Nike Sales Cleaned Data Sales.Rmd"                
## [105] "Nike sales cleaned data.csv"                      
## [106] "Nike sales cleaned dataset.csv"                   
## [107] "Nike_Sales_Analysis_2026.pptx"                    
## [108] "Nike_Sales_Uncleaned.csv"                         
## [109] "Nike-Sales-Cleaned-Data-Code.html"                
## [110] "Nike-Sales-Cleaned-Data-Sales.html"               
## [111] "Nike-Sales-Trend-Analysis (1).Rmd"                
## [112] "Nike-Sales-Trend-Analysis.docx"                   
## [113] "Nike-Sales-Trend-Analysis.Rmd"                    
## [114] "noaa-weather-sample-data"                         
## [115] "noaa-weather-sample-data (1).tar.gz"              
## [116] "noaa-weather-sample-data.tar.gz"                  
## [117] "notconfirmed.Rmd"                                 
## [118] "output_table.png"                                 
## [119] "Positron-2025.11.0-234-arm64.dmg"                 
## [120] "Precipitation dashboard.R"                        
## [121] "Predictive Project.Rmd"                           
## [122] "Predictive-Project_files"                         
## [123] "Predictive-Project.html"                          
## [124] "Predictive-Project.Rmd"                           
## [125] "Project Details .pdf"                             
## [126] "R-4.5.2-arm64.pkg"                                
## [127] "Real Project.Rmd"                                 
## [128] "Real-Project.html"                                
## [129] "Resources for ICAMMDA.docx"                       
## [130] "RStudio-2025.09.2-418.dmg"                        
## [131] "SpotifyInstaller (1).zip"                         
## [132] "SpotifyInstaller (2).zip"                         
## [133] "SpotifyInstaller.zip"                             
## [134] "st.R"                                             
## [135] "Stock_Analysis_Project.html"                      
## [136] "Stock_Analysis_Project.Rmd"                       
## [137] "Student Mental health (1).csv"                    
## [138] "STUDY CORNER.docx"                                
## [139] "TASK.pdf"                                         
## [140] "The Traders Bible .pdf"                           
## [141] "Tidying data (1).qmd"                             
## [142] "Tidying data (2).qmd"                             
## [143] "Tidying data.qmd"                                 
## [144] "Toluwani ABIOLA CV_025611.docx"                   
## [145] "Toluwani edx Course 2 (1).Rmd"                    
## [146] "Toluwani edx Course 2 (2).Rmd"                    
## [147] "Toluwani edx Course 2.Rmd"                        
## [148] "TOLUWANI PRECIOUS fitness (1).pdf"                
## [149] "TOLUWANI PRECIOUS fitness.pdf"                    
## [150] "Toluwani-Chapter-7-and-8.html"                    
## [151] "UDS0028020498213.png"                             
## [152] "unclean.png"                                      
## [153] "VICTORIA PRESENTATION SLIDE .pptx"                
## [154] "Visual Studio Code 2.app"                         
## [155] "Visual Studio Code 3.app"                         
## [156] "Visual Studio Code.app"                           
## [157] "VSCode-darwin-universal (1).zip"                  
## [158] "VSCode-darwin-universal.zip"                      
## [159] "WB_ES_T_FIN12.csv"                                
## [160] "WB_FINDEX_FIN11_2A.csv"                           
## [161] "weather-weka.csv"                                 
## [162] "WhatsApp Image 2025-11-14 at 11.24.17 PM.jpeg"    
## [163] "WhatsApp Image 2025-11-14 at 11.24.18 PM (1).jpeg"
## [164] "WhatsApp Image 2025-11-14 at 11.24.18 PM.jpeg"    
## [165] "WhatsApp Image 2025-11-15 at 4.58.13 PM.jpeg"     
## [166] "WhatsApp Image 2025-11-15 at 8.37.29 PM.jpeg"     
## [167] "yahoo_data 2.xlsx"                                
## [168] "yahoo_data.xlsx"

Now, using shiny app, We deploy our model.

A Predictive Analysis On Weather Precipitation Using Key Climatological Variables(A Case Study of JFK Airport, New York)

Abiola Toluwani

2025-11-13