2nd edX Project

This project relates to the NOAA Weather Dataset - JFK Airport (New York). The end goal of this project is to predict the precipitation using some of the available features.

The key columns that we will explore in this project are:

‘HOURLYRelativeHumidity’ is the relative humidity given to the nearest whole percentage.
‘HOURLYDRYBULBTEMPF’ is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
‘HOURLYPrecip’ is the amount of precipitation in inches to hundredths over the past hour. A “T” indicates a trace amount of precipitation.
‘HOURLYWindSpeed’ is the speed of the wind at the time of observation given in miles per hour (mph).
‘HOURLYStationPressure’ is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).

We install the needed packages and import the necessary libraries: “tidymodels” and “tidyverse.

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.5.2

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.9     ✔ recipes      1.3.1
## ✔ dials        1.4.2     ✔ rsample      1.3.1
## ✔ dplyr        1.1.4     ✔ tailor       0.1.0
## ✔ ggplot2      4.0.0     ✔ tidyr        1.3.1
## ✔ infer        1.0.9     ✔ tune         2.0.1
## ✔ modeldata    1.5.1     ✔ workflows    1.3.0
## ✔ parsnip      1.3.3     ✔ workflowsets 1.1.1
## ✔ purrr        1.1.0     ✔ yardstick    1.3.2

## Warning: package 'dials' was built under R version 4.5.2

## Warning: package 'infer' was built under R version 4.5.2

## Warning: package 'modeldata' was built under R version 4.5.2

## Warning: package 'parsnip' was built under R version 4.5.2

## Warning: package 'rsample' was built under R version 4.5.2

## Warning: package 'tailor' was built under R version 4.5.2

## Warning: package 'tune' was built under R version 4.5.2

## Warning: package 'workflows' was built under R version 4.5.2

## Warning: package 'workflowsets' was built under R version 4.5.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Download NOAA Weather Dataset

#we use the dowmload.file() function to download the file from the web; we pass in the web address of the file, and specify the destfile;the name where the downloaded file is saved. 
download.file('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ENn4iRKnW2szuR-zPKslwg/noaa-weather-sample-data-tar.gz', destfile = "noaa-weather-sample-data-tar.gz")

The downloaded file is a bundle of multiple files; since we wish to work with the individual files inside, we need to untar before proceeding to extract its contents

#We use the untar() function, which takes in the same string in destfile
untar("noaa-weather-sample-data-tar.gz", list = T)

## [1] "._noaa-weather-sample-data"                     
## [2] "noaa-weather-sample-data/"                      
## [3] "noaa-weather-sample-data/jfk_weather_sample.csv"
## [4] "noaa-weather-sample-data/._README.txt"          
## [5] "noaa-weather-sample-data/README.txt"            
## [6] "noaa-weather-sample-data/._LICENSE.txt"         
## [7] "noaa-weather-sample-data/LICENSE.txt"

#the "list=T" shows the file paths of files we have, from which we choose the file path we need.

Extract and Read into Project We want to read in the downloaded file. From the list of filenames we have, we already know the one that contains the data we need for this project; we just copy and paste it.

weather_dataset <- read.csv("noaa-weather-sample-data/jfk_weather_sample.csv")
#we check the first few rows to see what the data set contains
head(weather_dataset)

##                   DATE HOURLYDewPointTempF HOURLYRelativeHumidity
## 1 2015-07-25T13:51:00Z                  60                     46
## 2 2016-11-18T23:51:00Z                  34                     48
## 3 2013-01-06T08:51:00Z                  33                     89
## 4 2011-01-27T16:51:00Z                  18                     48
## 5 2015-01-03T12:16:00Z                  27                     61
## 6 2013-02-15T20:51:00Z                  35                     79
##   HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1                 83                 68         0.00              13
## 2                 53                 44         0.00               6
## 3                 36                 35         0.00              13
## 4                 36                 30         0.00              14
## 5                 39                 34            T              11
## 6                 41                 38         0.00               6
##   HOURLYSeaLevelPressure HOURLYStationPressure
## 1                  30.01                 29.99
## 2                  30.05                 30.03
## 3                  30.14                 30.12
## 4                  29.82                 29.80
## 5                     NA                 30.50
## 6                  29.94                 29.92

#we take a glimpse of the data set to see the different column data
glimpse(weather_dataset)

## Rows: 5,727
## Columns: 9
## $ DATE                   <chr> "2015-07-25T13:51:00Z", "2016-11-18T23:51:00Z",…
## $ HOURLYDewPointTempF    <chr> "60", "34", "33", "18", "27", "35", "4", "14", …
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF     <int> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

The key columns we need for this project are: HOURLYRelativeHumidity, HOURLYDRYBULBTEMPF, HOURLYPrecip, HOURLYWindSpeed, HOURLYStationPressure. We extract them and store them.

#we select the 5 columns and store it as a new object
mod_weather <- weather_dataset %>%
  select(HOURLYRelativeHumidity, HOURLYDRYBULBTEMPF, HOURLYPrecip, HOURLYWindSpeed, HOURLYStationPressure)
#we want to see  the first 10 rows of this new dataframe.
head(mod_weather, 10)

##    HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1                      46                 83         0.00              13
## 2                      48                 53         0.00               6
## 3                      89                 36         0.00              13
## 4                      48                 36         0.00              14
## 5                      61                 39            T              11
## 6                      79                 41         0.00               6
## 7                      51                 19         0.00               0
## 8                      65                 24         0.00              11
## 9                      90                 54         0.06              11
## 10                     94                 73         <NA>               5
##    HOURLYStationPressure
## 1                  29.99
## 2                  30.03
## 3                  30.12
## 4                  29.80
## 5                  30.50
## 6                  29.92
## 7                  30.40
## 8                  30.35
## 9                  30.03
## 10                 29.91

We notice that the precipitation column have some irregular values; we try to handle it.

#we check the unique values in HOURLYPrecip
unique(mod_weather$HOURLYPrecip)

##  [1] "0.00"  "T"     "0.06"  NA      "0.03"  "0.02"  "0.08"  "0.01"  "0.07" 
## [10] "0.16"  "0.09"  "0.22"  "0.02s" "0.24"  "0.18"  "0.05"  "0.04"  "0.09s"
## [19] "0.11"  "0.14"  "0.25"  "0.10"  "0.01s" "0.58"  "0.12"  "0.13"  "0.46" 
## [28] "1.07"  "1.19"  "0.34"  "0.20"  "0.36s" "0.42"  "0.17"  "0.27"  "0.35" 
## [37] "0.31"  "0.33"  "0.23"  "0.26"  "0.28"  "0.75"  "0.19"  "0.36"  "0.03s"
## [46] "0.07s" "0.54"  "0.59"  "0.21"

#we see that there are NA and T values in HOURLYPrecip column, and values ending with "s".
#we replace all the T values with "0.0" 
mod_weather$HOURLYPrecip[mod_weather$HOURLYPrecip == "T"] <- 0.0 
#we remove the character "s" using the str_remove() function
head(str_remove(mod_weather$HOURLYPrecip, pattern="s$")) #pattern="s$" tells R to match to the end of values.

## [1] "0.00" "0.00" "0.00" "0.00" "0"    "0.00"

We convert the data types to appropriate ones; first, we check the types of the columns:

glimpse(mod_weather) #we have 3 integer columns, a character column, and a double column

## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "0", "0.00", "0…
## $ HOURLYWindSpeed        <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

#we want to have the columns as numerical data types
mod_weather <- as.data.frame(sapply(mod_weather, as.numeric))

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

str(mod_weather)

## 'data.frame':    5727 obs. of  5 variables:
##  $ HOURLYRelativeHumidity: num  46 48 89 48 61 79 51 65 90 94 ...
##  $ HOURLYDRYBULBTEMPF    : num  83 53 36 36 39 41 19 24 54 73 ...
##  $ HOURLYPrecip          : num  0 0 0 0 0 0 0 0 0.06 NA ...
##  $ HOURLYWindSpeed       : num  13 6 13 14 11 6 0 11 11 5 ...
##  $ HOURLYStationPressure : num  30 30 30.1 29.8 30.5 ...

#We proceed to rename the columns using the rename(dataframe, newname1=oldname1, newname1=oldname1,for each column) function
mod_weather_renamed <- mod_weather %>%
                rename(relative_humidity  =HOURLYRelativeHumidity,
                       dry_bulb_temp_f = HOURLYDRYBULBTEMPF,
                       precip = HOURLYPrecip, 
                       wind_speed= HOURLYWindSpeed,                                       station_pressure = HOURLYStationPressure)

We want to explore the data (EDA).

set.seed(1234) #we set seed to have consistent result such that whenever we split, we get the exact same training and testing data set

mod_weather_split <- initial_split(mod_weather_renamed, prop=0.8) #we use initial_split to split a data set into tes and train; prop=0.8 tells R to make the split 80:20, 80 for train, 20 for test
#we use the training set to plot histogram of the variables, using ggplot, to see their initial distribution

#we use the training() function to access the training data set.
mod_weather_split_train <- training(mod_weather_split)
#and the testing function to assess the test data
mod_weather_split_test <- testing(mod_weather_split)
#for the histogram;
mod_weather_split_train %>% pivot_longer(cols= everything(),
#this tells R to select all the columns in the data set instead of plotting each variable one-by-one; we use pivot_longer which takes all the selected columns and stacks them into two new columns; variable name, values.
       names_to = "variable_name", 
#this creates a new column for variable names
        values_to = "value")  %>%
#this creates a new column for values of each variables
        ggplot(aes(x=value)) + 
#we are plotting each variables against the value, since we're using histogram, we don't have to specify the y-axis
        geom_histogram(bins=30, color="white", fill="blue") + 
#the bin takes all the data points and divides them into 30 equal ranges; the fill fills the bar with blue; the color tapes the bar with white
         facet_wrap(~variable_name, scales="free") +
#this function creates an histogram for each variable; scales="free" tells R to use appropriate x and y axis limit, and not generalize.
         labs(title = "Initial Distribution of All Variables in Training Set") +
        theme_minimal()

## Warning: Removed 1827 rows containing non-finite outside the scale range
## (`stat_bin()`).

#lets the plot look less busy and more modern

We create 4 simple linear regression models where precip is the response, since that’s what we want to predict. We use the lm() function to create a linear model; we pass in the target variable, predictor variable and the training data.

model1 <- lm(precip ~ relative_humidity, mod_weather_split_train)
#we use the summary function to see the intercept, slope F-test, and p-value
summary(model1)

## 
## Call:
## lm(formula = precip ~ relative_humidity, data = mod_weather_split_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01997 -0.01081 -0.00484  0.00192  1.17163 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.983e-02  2.576e-03   -7.70 1.79e-14 ***
## relative_humidity  3.980e-04  3.746e-05   10.62  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04225 on 3220 degrees of freedom
##   (1359 observations deleted due to missingness)
## Multiple R-squared:  0.03387,    Adjusted R-squared:  0.03357 
## F-statistic: 112.9 on 1 and 3220 DF,  p-value: < 2.2e-16

We visualize the models using scatter plot. For humidity:

mod_weather_split_train %>% ggplot(aes(relative_humidity, precip)) + #we map the relative_humidity column to the x-axis; the precip (precipitation) column to the y-axis
                            geom_point() + 
#this tells R to draw a scatter plot
                            geom_smooth(method = "lm", na.rm = T)+ #This draws a straight line of best fit, ignoring missing (NA) values during calculation
                            labs(title = "Relative Humidity vs Precipitation")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1359 rows containing missing values or values outside the scale range
## (`geom_point()`).

The “Relative Humidity vs Precipitation” plot shows a large cluster of points along the bottom of the graph(precip=0);

this shows that most of the time there is no rain, regardless of humidity;
the trend is a positive one; as humidity approaches 100%, the precipitation increases
over all, a relationship has been established between precipitation and humidity, but humidity alone is not a strong predictor.

For dry_bulb_temp_f:

model2 <- lm(precip ~ dry_bulb_temp_f, mod_weather_split_train)
summary(model2)

## 
## Call:
## lm(formula = precip ~ dry_bulb_temp_f, data = mod_weather_split_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.00782 -0.00682 -0.00628 -0.00578  1.18322 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     4.512e-03  2.557e-03   1.765   0.0777 .
## dry_bulb_temp_f 3.341e-05  4.401e-05   0.759   0.4479  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04298 on 3220 degrees of freedom
##   (1359 observations deleted due to missingness)
## Multiple R-squared:  0.0001789,  Adjusted R-squared:  -0.0001316 
## F-statistic: 0.5762 on 1 and 3220 DF,  p-value: 0.4479

#we visualize the model
mod_weather_split_train %>% ggplot(aes(dry_bulb_temp_f, precip)) + #we map the dry bulb temp column to the x-axis; the precip (precipitation) column to the y-axis
  geom_point() + #this tells R to draw a scatter plot
  geom_smooth(method = "lm", na.rm = T) + #This draws a straight line of best fit, ignoring missing (NA) values during calculation
  labs(title = "Dry Bulb Temp vs Precipitation")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1359 rows containing missing values or values outside the scale range
## (`geom_point()`).

The “Dry Bulb Temp vs Precipitation” plot:

shows a massive concentration of data along the bottom; there was no rain during most observation;
the points are spread across a wide range of temperatures; it can rain during cool or warm weather;
dry bulb temp is not an efficient predictor for the amount of hourly rainfall;
the blue line shows that the linear relationship is weak.

For wind speed:

model3 <- lm(precip ~ wind_speed, mod_weather_split_train)
summary(model3)

## 
## Call:
## lm(formula = precip ~ wind_speed, data = mod_weather_split_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01378 -0.00730 -0.00560 -0.00458  1.17997 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.0025309  0.0015688   1.613   0.1068   
## wind_speed  0.0003408  0.0001222   2.790   0.0053 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04293 on 3220 degrees of freedom
##   (1359 observations deleted due to missingness)
## Multiple R-squared:  0.002412,   Adjusted R-squared:  0.002102 
## F-statistic: 7.785 on 1 and 3220 DF,  p-value: 0.005301

#we visualize the model
mod_weather_split_train %>% ggplot(aes(wind_speed, precip)) + 
#we map the wind speed column to the x-axis; the precip column to the y-axis
                     geom_point() + 
                     geom_smooth(method = "lm", na.rm = T) +
  labs(title = "Wind Speed vs Precipitation")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1359 rows containing missing values or values outside the scale range
## (`geom_point()`).

The “Wind Speed vs Precipitation” plot shows that:

the points are really scattered; shows a weak relationship
high wind speed doesn’t translate to more rainfall
there is no linear correlation between them; wind speed alone is a poor predictor

For station pressure:

model4 <- lm(precip ~ station_pressure, mod_weather_split_train)
summary(model4)

## 
## Call:
## lm(formula = precip ~ station_pressure, data = mod_weather_split_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02503 -0.00887 -0.00559 -0.00166  1.17698 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.66173    0.09420   7.025 2.60e-12 ***
## station_pressure -0.02185    0.00314  -6.958 4.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04269 on 3216 degrees of freedom
##   (1363 observations deleted due to missingness)
## Multiple R-squared:  0.01483,    Adjusted R-squared:  0.01452 
## F-statistic: 48.41 on 1 and 3216 DF,  p-value: 4.166e-12

mod_weather_split_train %>% ggplot(aes(station_pressure, precip)) + 
  geom_point() +  
  geom_smooth(method = "lm", na.rm = T) + # 
  labs(title = "Station Pressure vs Precipitation")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1363 rows containing missing values or values outside the scale range
## (`geom_point()`).

The “Station Pressure vs Precipitation” plot shows:

the points are scattered around the trend line
station pressure alone is not a strong indicator
the linear relationship is weak

We improve the model We start with the multiple linear regression model

multiplemodel <- lm(precip ~ station_pressure + relative_humidity + dry_bulb_temp_f + wind_speed, data=mod_weather_split_train)
summary(multiplemodel)

## 
## Call:
## lm(formula = precip ~ station_pressure + relative_humidity + 
##     dry_bulb_temp_f + wind_speed, data = mod_weather_split_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03528 -0.01064 -0.00436  0.00214  1.16227 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.611e-01  1.074e-01   2.431  0.01510 *  
## station_pressure  -9.598e-03  3.509e-03  -2.735  0.00627 ** 
## relative_humidity  4.135e-04  4.093e-05  10.102  < 2e-16 ***
## dry_bulb_temp_f   -6.866e-06  4.495e-05  -0.153  0.87861    
## wind_speed         5.618e-04  1.358e-04   4.138 3.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04204 on 3213 degrees of freedom
##   (1363 observations deleted due to missingness)
## Multiple R-squared:  0.04543,    Adjusted R-squared:  0.04424 
## F-statistic: 38.23 on 4 and 3213 DF,  p-value: < 2.2e-16

Using the polynomial regression model

#the inital plot of temp showed a curvy relationship which wasn't captured by the linear model; we use the polynomial model of degree 2 to achieve that; but we filter out the NAs first because ploy() function cannot process data with missing values
clean_data <- mod_weather_split_train %>%
  select(precip, dry_bulb_temp_f) %>%
  na.omit()
polymodel <- lm(precip ~ poly(dry_bulb_temp_f, 2), data=clean_data)
summary(polymodel) #the model didn't perform well, the R^2 = 0.001006 showed a really low predictive power

## 
## Call:
## lm(formula = precip ~ poly(dry_bulb_temp_f, 2), data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.00790 -0.00753 -0.00667 -0.00508  1.18267 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.0063656  0.0007568   8.411   <2e-16 ***
## poly(dry_bulb_temp_f, 2)1  0.0326260  0.0429571   0.760   0.4476    
## poly(dry_bulb_temp_f, 2)2 -0.0928076  0.0429571  -2.160   0.0308 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04296 on 3219 degrees of freedom
## Multiple R-squared:  0.001627,   Adjusted R-squared:  0.001006 
## F-statistic: 2.622 on 2 and 3219 DF,  p-value: 0.0728

We build a combined model, we combine temperature with other crucial variables; humidity, wind speed, and station pressure

#we clean the variables
clean_data <- mod_weather_split_train %>%
  select(precip, dry_bulb_temp_f, wind_speed, station_pressure, relative_humidity) %>%
  na.omit()
#we build a combined model
combined_model <- lm(precip ~ relative_humidity +                                       poly(dry_bulb_temp_f, 2) + wind_speed + 
                       station_pressure, data = clean_data)
summary(combined_model)

## 
## Call:
## lm(formula = precip ~ relative_humidity + poly(dry_bulb_temp_f, 
##     2) + wind_speed + station_pressure, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03528 -0.01061 -0.00433  0.00216  1.16228 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.609e-01  1.068e-01   2.443  0.01464 *  
## relative_humidity          4.147e-04  4.188e-05   9.902  < 2e-16 ***
## poly(dry_bulb_temp_f, 2)1 -6.914e-03  4.390e-02  -0.158  0.87486    
## poly(dry_bulb_temp_f, 2)2  6.009e-03  4.347e-02   0.138  0.89007    
## wind_speed                 5.606e-04  1.360e-04   4.121 3.86e-05 ***
## station_pressure          -9.605e-03  3.510e-03  -2.737  0.00624 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04204 on 3212 degrees of freedom
## Multiple R-squared:  0.04543,    Adjusted R-squared:  0.04395 
## F-statistic: 30.58 on 5 and 3212 DF,  p-value: < 2.2e-16

The R-squared of the combined model is 0.04395, which shows that, based on the model, ~4.4% of the hourly change in precipitation can be attributed to the combination of relative humidity, temperature, wind speed, and station pressure. It also suggests that there are crucial factors that make up the remaining 95% missing from the analysis.

Evaluating the model We use the model to make predictions on the test data

clean_test <- mod_weather_split_test %>%
                na.omit()
test_result <- clean_test %>% #using the cleaned test data
               mutate(
             #this creates a prediction column to the test data
               predictions = predict(combined_model, newdata = clean_test))
head(test_result)

##   relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
## 1                46              83      0         13            29.99
## 2                79              41      0          6            29.92
## 3                65              24      0         11            30.35
## 4                37              44      0          7            30.25
## 7                86              62      0         10            29.97
## 8                37              25      0         23            29.62
##     predictions
## 1 -0.0008384586
## 2  0.0096764082
## 3  0.0029081417
## 4 -0.0103948128
## 7  0.0141550740
## 8  0.0050088617

We evaluate the metrics;

The RMSE tells us how far off the predictions are on average. A model with the lowest RMSE is the best.

rmsevalue <- test_result %>% rmse(truth=precip, estimate = predictions)
rmsevalue

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0324

Testing the Models

We use the first SLR on relative humidity on the test data;

model.humidity <- clean_test %>% #we add a prediction column to the test data
                  mutate(predictions = predict(model1, newdata = clean_test))
# rmse value of relative humidity
rmsevalue <- model.humidity %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0333

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0333

We use the first SLR on dry bulb temperature on the test data;

model.temp <- clean_test %>%
  mutate(predictions = predict(model2, newdata = clean_test))
# rmse value of dry bulb temperature
rmsevalue <- model.temp %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0343

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0343

We use the first SLR on wind speed on the test data;

model.wind <- clean_test %>%
  mutate(predictions = predict(model3, newdata = clean_test))
# rmse value of wind speed
rmsevalue <- model.wind %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0339

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0339

We use the first SLR on station pressure on the test data;

model.pressure <- clean_test %>%
  mutate(predictions = predict(model4, newdata = clean_test))
# rmse value of station pressure
rmsevalue <- model.pressure %>% rmse(truth=precip, estimate = predictions)
rmsevalue  #0.0337

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0337

Testing with MLR

test_MLR <- clean_test %>% 
             mutate(
# Use the trained model to predict values for the test data
               predictions = predict(multiplemodel, newdata = clean_test))
#We check the rmse value
rmsevalue <- test_MLR %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0324

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0324

Testing with the polynomial regression model

test_poly <- clean_test %>% 
             mutate(
# Use the trained model to predict values for the test data
               predictions = predict(polymodel, newdata = clean_test))
#We check the rmse value
rmsevalue <- test_poly %>% rmse(truth=precip, estimate = predictions)
rmsevalue #0.0342

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0342

Based on the Root Mean Squared Error evaluated on the unseen test data;

The Combined Model is the best linear regression predictor because it has the lowest average error (0.0324 inches), compared to the SLRs and polynomial model
However, there is a weak correlation between the variables.

2nd edX Project

Fadilah.O

2025-11-05

The key columns that we will explore in this project are:

Testing the Models