This project relates to the NOAA Weather Dataset - JFK Airport (New York). The end goal of this project is to predict the precipitation using some of the available features.
We install the needed packages and import the necessary libraries: “tidymodels” and “tidyverse.
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.9 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 4.0.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.9 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ purrr 1.1.0 ✔ yardstick 1.3.2
## Warning: package 'dials' was built under R version 4.5.2
## Warning: package 'infer' was built under R version 4.5.2
## Warning: package 'modeldata' was built under R version 4.5.2
## Warning: package 'parsnip' was built under R version 4.5.2
## Warning: package 'rsample' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.2
## Warning: package 'tune' was built under R version 4.5.2
## Warning: package 'workflows' was built under R version 4.5.2
## Warning: package 'workflowsets' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Download NOAA Weather Dataset
#we use the dowmload.file() function to download the file from the web; we pass in the web address of the file, and specify the destfile;the name where the downloaded file is saved.
download.file('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ENn4iRKnW2szuR-zPKslwg/noaa-weather-sample-data-tar.gz', destfile = "noaa-weather-sample-data-tar.gz")
The downloaded file is a bundle of multiple files; since we wish to work with the individual files inside, we need to untar before proceeding to extract its contents
#We use the untar() function, which takes in the same string in destfile
untar("noaa-weather-sample-data-tar.gz", list = T)
## [1] "._noaa-weather-sample-data"
## [2] "noaa-weather-sample-data/"
## [3] "noaa-weather-sample-data/jfk_weather_sample.csv"
## [4] "noaa-weather-sample-data/._README.txt"
## [5] "noaa-weather-sample-data/README.txt"
## [6] "noaa-weather-sample-data/._LICENSE.txt"
## [7] "noaa-weather-sample-data/LICENSE.txt"
#the "list=T" shows the file paths of files we have, from which we choose the file path we need.
Extract and Read into Project We want to read in the downloaded file. From the list of filenames we have, we already know the one that contains the data we need for this project; we just copy and paste it.
weather_dataset <- read.csv("noaa-weather-sample-data/jfk_weather_sample.csv")
#we check the first few rows to see what the data set contains
head(weather_dataset)
## DATE HOURLYDewPointTempF HOURLYRelativeHumidity
## 1 2015-07-25T13:51:00Z 60 46
## 2 2016-11-18T23:51:00Z 34 48
## 3 2013-01-06T08:51:00Z 33 89
## 4 2011-01-27T16:51:00Z 18 48
## 5 2015-01-03T12:16:00Z 27 61
## 6 2013-02-15T20:51:00Z 35 79
## HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1 83 68 0.00 13
## 2 53 44 0.00 6
## 3 36 35 0.00 13
## 4 36 30 0.00 14
## 5 39 34 T 11
## 6 41 38 0.00 6
## HOURLYSeaLevelPressure HOURLYStationPressure
## 1 30.01 29.99
## 2 30.05 30.03
## 3 30.14 30.12
## 4 29.82 29.80
## 5 NA 30.50
## 6 29.94 29.92
#we take a glimpse of the data set to see the different column data
glimpse(weather_dataset)
## Rows: 5,727
## Columns: 9
## $ DATE <chr> "2015-07-25T13:51:00Z", "2016-11-18T23:51:00Z",…
## $ HOURLYDewPointTempF <chr> "60", "34", "33", "18", "27", "35", "4", "14", …
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF <int> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
The key columns we need for this project are: HOURLYRelativeHumidity, HOURLYDRYBULBTEMPF, HOURLYPrecip, HOURLYWindSpeed, HOURLYStationPressure. We extract them and store them.
#we select the 5 columns and store it as a new object
mod_weather <- weather_dataset %>%
select(HOURLYRelativeHumidity, HOURLYDRYBULBTEMPF, HOURLYPrecip, HOURLYWindSpeed, HOURLYStationPressure)
#we want to see the first 10 rows of this new dataframe.
head(mod_weather, 10)
## HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1 46 83 0.00 13
## 2 48 53 0.00 6
## 3 89 36 0.00 13
## 4 48 36 0.00 14
## 5 61 39 T 11
## 6 79 41 0.00 6
## 7 51 19 0.00 0
## 8 65 24 0.00 11
## 9 90 54 0.06 11
## 10 94 73 <NA> 5
## HOURLYStationPressure
## 1 29.99
## 2 30.03
## 3 30.12
## 4 29.80
## 5 30.50
## 6 29.92
## 7 30.40
## 8 30.35
## 9 30.03
## 10 29.91
We notice that the precipitation column have some irregular values; we try to handle it.
#we check the unique values in HOURLYPrecip
unique(mod_weather$HOURLYPrecip)
## [1] "0.00" "T" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07"
## [10] "0.16" "0.09" "0.22" "0.02s" "0.24" "0.18" "0.05" "0.04" "0.09s"
## [19] "0.11" "0.14" "0.25" "0.10" "0.01s" "0.58" "0.12" "0.13" "0.46"
## [28] "1.07" "1.19" "0.34" "0.20" "0.36s" "0.42" "0.17" "0.27" "0.35"
## [37] "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19" "0.36" "0.03s"
## [46] "0.07s" "0.54" "0.59" "0.21"
#we see that there are NA and T values in HOURLYPrecip column, and values ending with "s".
#we replace all the T values with "0.0"
mod_weather$HOURLYPrecip[mod_weather$HOURLYPrecip == "T"] <- 0.0
#we remove the character "s" using the str_remove() function
head(str_remove(mod_weather$HOURLYPrecip, pattern="s$")) #pattern="s$" tells R to match to the end of values.
## [1] "0.00" "0.00" "0.00" "0.00" "0" "0.00"
We convert the data types to appropriate ones; first, we check the types of the columns:
glimpse(mod_weather) #we have 3 integer columns, a character column, and a double column
## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "0", "0.00", "0…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
#we want to have the columns as numerical data types
mod_weather <- as.data.frame(sapply(mod_weather, as.numeric))
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
str(mod_weather)
## 'data.frame': 5727 obs. of 5 variables:
## $ HOURLYRelativeHumidity: num 46 48 89 48 61 79 51 65 90 94 ...
## $ HOURLYDRYBULBTEMPF : num 83 53 36 36 39 41 19 24 54 73 ...
## $ HOURLYPrecip : num 0 0 0 0 0 0 0 0 0.06 NA ...
## $ HOURLYWindSpeed : num 13 6 13 14 11 6 0 11 11 5 ...
## $ HOURLYStationPressure : num 30 30 30.1 29.8 30.5 ...
#We proceed to rename the columns using the rename(dataframe, newname1=oldname1, newname1=oldname1,for each column) function
mod_weather_renamed <- mod_weather %>%
rename(relative_humidity =HOURLYRelativeHumidity,
dry_bulb_temp_f = HOURLYDRYBULBTEMPF,
precip = HOURLYPrecip,
wind_speed= HOURLYWindSpeed, station_pressure = HOURLYStationPressure)
We want to explore the data (EDA).
set.seed(1234) #we set seed to have consistent result such that whenever we split, we get the exact same training and testing data set
mod_weather_split <- initial_split(mod_weather_renamed, prop=0.8) #we use initial_split to split a data set into tes and train; prop=0.8 tells R to make the split 80:20, 80 for train, 20 for test
#we use the training set to plot histogram of the variables, using ggplot, to see their initial distribution
#we use the training() function to access the training data set.
mod_weather_split_train <- training(mod_weather_split)
#and the testing function to assess the test data
mod_weather_split_test <- testing(mod_weather_split)
#for the histogram;
mod_weather_split_train %>% pivot_longer(cols= everything(),
#this tells R to select all the columns in the data set instead of plotting each variable one-by-one; we use pivot_longer which takes all the selected columns and stacks them into two new columns; variable name, values.
names_to = "variable_name",
#this creates a new column for variable names
values_to = "value") %>%
#this creates a new column for values of each variables
ggplot(aes(x=value)) +
#we are plotting each variables against the value, since we're using histogram, we don't have to specify the y-axis
geom_histogram(bins=30, color="white", fill="blue") +
#the bin takes all the data points and divides them into 30 equal ranges; the fill fills the bar with blue; the color tapes the bar with white
facet_wrap(~variable_name, scales="free") +
#this function creates an histogram for each variable; scales="free" tells R to use appropriate x and y axis limit, and not generalize.
labs(title = "Initial Distribution of All Variables in Training Set") +
theme_minimal()
## Warning: Removed 1827 rows containing non-finite outside the scale range
## (`stat_bin()`).
#lets the plot look less busy and more modern
We create 4 simple linear regression models where precip is the response, since that’s what we want to predict. We use the lm() function to create a linear model; we pass in the target variable, predictor variable and the training data.
model1 <- lm(precip ~ relative_humidity, mod_weather_split_train)
#we use the summary function to see the intercept, slope F-test, and p-value
summary(model1)
##
## Call:
## lm(formula = precip ~ relative_humidity, data = mod_weather_split_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01997 -0.01081 -0.00484 0.00192 1.17163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.983e-02 2.576e-03 -7.70 1.79e-14 ***
## relative_humidity 3.980e-04 3.746e-05 10.62 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04225 on 3220 degrees of freedom
## (1359 observations deleted due to missingness)
## Multiple R-squared: 0.03387, Adjusted R-squared: 0.03357
## F-statistic: 112.9 on 1 and 3220 DF, p-value: < 2.2e-16
We visualize the models using scatter plot. For humidity:
mod_weather_split_train %>% ggplot(aes(relative_humidity, precip)) + #we map the relative_humidity column to the x-axis; the precip (precipitation) column to the y-axis
geom_point() +
#this tells R to draw a scatter plot
geom_smooth(method = "lm", na.rm = T)+ #This draws a straight line of best fit, ignoring missing (NA) values during calculation
labs(title = "Relative Humidity vs Precipitation")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1359 rows containing missing values or values outside the scale range
## (`geom_point()`).
The “Relative Humidity vs Precipitation” plot shows a large cluster of points along the bottom of the graph(precip=0);
For dry_bulb_temp_f:
model2 <- lm(precip ~ dry_bulb_temp_f, mod_weather_split_train)
summary(model2)
##
## Call:
## lm(formula = precip ~ dry_bulb_temp_f, data = mod_weather_split_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.00782 -0.00682 -0.00628 -0.00578 1.18322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.512e-03 2.557e-03 1.765 0.0777 .
## dry_bulb_temp_f 3.341e-05 4.401e-05 0.759 0.4479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04298 on 3220 degrees of freedom
## (1359 observations deleted due to missingness)
## Multiple R-squared: 0.0001789, Adjusted R-squared: -0.0001316
## F-statistic: 0.5762 on 1 and 3220 DF, p-value: 0.4479
#we visualize the model
mod_weather_split_train %>% ggplot(aes(dry_bulb_temp_f, precip)) + #we map the dry bulb temp column to the x-axis; the precip (precipitation) column to the y-axis
geom_point() + #this tells R to draw a scatter plot
geom_smooth(method = "lm", na.rm = T) + #This draws a straight line of best fit, ignoring missing (NA) values during calculation
labs(title = "Dry Bulb Temp vs Precipitation")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1359 rows containing missing values or values outside the scale range
## (`geom_point()`).
The “Dry Bulb Temp vs Precipitation” plot:
For wind speed:
model3 <- lm(precip ~ wind_speed, mod_weather_split_train)
summary(model3)
##
## Call:
## lm(formula = precip ~ wind_speed, data = mod_weather_split_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01378 -0.00730 -0.00560 -0.00458 1.17997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0025309 0.0015688 1.613 0.1068
## wind_speed 0.0003408 0.0001222 2.790 0.0053 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04293 on 3220 degrees of freedom
## (1359 observations deleted due to missingness)
## Multiple R-squared: 0.002412, Adjusted R-squared: 0.002102
## F-statistic: 7.785 on 1 and 3220 DF, p-value: 0.005301
#we visualize the model
mod_weather_split_train %>% ggplot(aes(wind_speed, precip)) +
#we map the wind speed column to the x-axis; the precip column to the y-axis
geom_point() +
geom_smooth(method = "lm", na.rm = T) +
labs(title = "Wind Speed vs Precipitation")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1359 rows containing missing values or values outside the scale range
## (`geom_point()`).
The “Wind Speed vs Precipitation” plot shows that:
For station pressure:
model4 <- lm(precip ~ station_pressure, mod_weather_split_train)
summary(model4)
##
## Call:
## lm(formula = precip ~ station_pressure, data = mod_weather_split_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02503 -0.00887 -0.00559 -0.00166 1.17698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66173 0.09420 7.025 2.60e-12 ***
## station_pressure -0.02185 0.00314 -6.958 4.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04269 on 3216 degrees of freedom
## (1363 observations deleted due to missingness)
## Multiple R-squared: 0.01483, Adjusted R-squared: 0.01452
## F-statistic: 48.41 on 1 and 3216 DF, p-value: 4.166e-12
mod_weather_split_train %>% ggplot(aes(station_pressure, precip)) +
geom_point() +
geom_smooth(method = "lm", na.rm = T) + #
labs(title = "Station Pressure vs Precipitation")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1363 rows containing missing values or values outside the scale range
## (`geom_point()`).
The “Station Pressure vs Precipitation” plot shows:
We improve the model We start with the multiple linear regression model
multiplemodel <- lm(precip ~ station_pressure + relative_humidity + dry_bulb_temp_f + wind_speed, data=mod_weather_split_train)
summary(multiplemodel)
##
## Call:
## lm(formula = precip ~ station_pressure + relative_humidity +
## dry_bulb_temp_f + wind_speed, data = mod_weather_split_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03528 -0.01064 -0.00436 0.00214 1.16227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.611e-01 1.074e-01 2.431 0.01510 *
## station_pressure -9.598e-03 3.509e-03 -2.735 0.00627 **
## relative_humidity 4.135e-04 4.093e-05 10.102 < 2e-16 ***
## dry_bulb_temp_f -6.866e-06 4.495e-05 -0.153 0.87861
## wind_speed 5.618e-04 1.358e-04 4.138 3.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04204 on 3213 degrees of freedom
## (1363 observations deleted due to missingness)
## Multiple R-squared: 0.04543, Adjusted R-squared: 0.04424
## F-statistic: 38.23 on 4 and 3213 DF, p-value: < 2.2e-16
Using the polynomial regression model
#the inital plot of temp showed a curvy relationship which wasn't captured by the linear model; we use the polynomial model of degree 2 to achieve that; but we filter out the NAs first because ploy() function cannot process data with missing values
clean_data <- mod_weather_split_train %>%
select(precip, dry_bulb_temp_f) %>%
na.omit()
polymodel <- lm(precip ~ poly(dry_bulb_temp_f, 2), data=clean_data)
summary(polymodel) #the model didn't perform well, the R^2 = 0.001006 showed a really low predictive power
##
## Call:
## lm(formula = precip ~ poly(dry_bulb_temp_f, 2), data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.00790 -0.00753 -0.00667 -0.00508 1.18267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0063656 0.0007568 8.411 <2e-16 ***
## poly(dry_bulb_temp_f, 2)1 0.0326260 0.0429571 0.760 0.4476
## poly(dry_bulb_temp_f, 2)2 -0.0928076 0.0429571 -2.160 0.0308 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04296 on 3219 degrees of freedom
## Multiple R-squared: 0.001627, Adjusted R-squared: 0.001006
## F-statistic: 2.622 on 2 and 3219 DF, p-value: 0.0728
We build a combined model, we combine temperature with other crucial variables; humidity, wind speed, and station pressure
#we clean the variables
clean_data <- mod_weather_split_train %>%
select(precip, dry_bulb_temp_f, wind_speed, station_pressure, relative_humidity) %>%
na.omit()
#we build a combined model
combined_model <- lm(precip ~ relative_humidity + poly(dry_bulb_temp_f, 2) + wind_speed +
station_pressure, data = clean_data)
summary(combined_model)
##
## Call:
## lm(formula = precip ~ relative_humidity + poly(dry_bulb_temp_f,
## 2) + wind_speed + station_pressure, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03528 -0.01061 -0.00433 0.00216 1.16228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.609e-01 1.068e-01 2.443 0.01464 *
## relative_humidity 4.147e-04 4.188e-05 9.902 < 2e-16 ***
## poly(dry_bulb_temp_f, 2)1 -6.914e-03 4.390e-02 -0.158 0.87486
## poly(dry_bulb_temp_f, 2)2 6.009e-03 4.347e-02 0.138 0.89007
## wind_speed 5.606e-04 1.360e-04 4.121 3.86e-05 ***
## station_pressure -9.605e-03 3.510e-03 -2.737 0.00624 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04204 on 3212 degrees of freedom
## Multiple R-squared: 0.04543, Adjusted R-squared: 0.04395
## F-statistic: 30.58 on 5 and 3212 DF, p-value: < 2.2e-16
The R-squared of the combined model is 0.04395, which shows that, based on the model, ~4.4% of the hourly change in precipitation can be attributed to the combination of relative humidity, temperature, wind speed, and station pressure. It also suggests that there are crucial factors that make up the remaining 95% missing from the analysis.
Evaluating the model We use the model to make predictions on the test data
clean_test <- mod_weather_split_test %>%
na.omit()
test_result <- clean_test %>% #using the cleaned test data
mutate(
#this creates a prediction column to the test data
predictions = predict(combined_model, newdata = clean_test))
head(test_result)
## relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
## 1 46 83 0 13 29.99
## 2 79 41 0 6 29.92
## 3 65 24 0 11 30.35
## 4 37 44 0 7 30.25
## 7 86 62 0 10 29.97
## 8 37 25 0 23 29.62
## predictions
## 1 -0.0008384586
## 2 0.0096764082
## 3 0.0029081417
## 4 -0.0103948128
## 7 0.0141550740
## 8 0.0050088617
We evaluate the metrics;
The RMSE tells us how far off the predictions are on average. A model with the lowest RMSE is the best.
rmsevalue <- test_result %>% rmse(truth=precip, estimate = predictions)
rmsevalue
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0324
We use the first SLR on relative humidity on the test data;
model.humidity <- clean_test %>% #we add a prediction column to the test data
mutate(predictions = predict(model1, newdata = clean_test))
# rmse value of relative humidity
rmsevalue <- model.humidity %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0333
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0333
We use the first SLR on dry bulb temperature on the test data;
model.temp <- clean_test %>%
mutate(predictions = predict(model2, newdata = clean_test))
# rmse value of dry bulb temperature
rmsevalue <- model.temp %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0343
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0343
We use the first SLR on wind speed on the test data;
model.wind <- clean_test %>%
mutate(predictions = predict(model3, newdata = clean_test))
# rmse value of wind speed
rmsevalue <- model.wind %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0339
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0339
We use the first SLR on station pressure on the test data;
model.pressure <- clean_test %>%
mutate(predictions = predict(model4, newdata = clean_test))
# rmse value of station pressure
rmsevalue <- model.pressure %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0337
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0337
Testing with MLR
test_MLR <- clean_test %>%
mutate(
# Use the trained model to predict values for the test data
predictions = predict(multiplemodel, newdata = clean_test))
#We check the rmse value
rmsevalue <- test_MLR %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0324
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0324
Testing with the polynomial regression model
test_poly <- clean_test %>%
mutate(
# Use the trained model to predict values for the test data
predictions = predict(polymodel, newdata = clean_test))
#We check the rmse value
rmsevalue <- test_poly %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0342
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0342
Based on the Root Mean Squared Error evaluated on the unseen test data;