Task1: INSTALL THE REQUIRED PACKAGES
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 4.0.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.9 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ purrr 1.2.0 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(rlang)
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
library(corrplot)
## corrplot 0.95 loaded
TASK 2: READ IN THE DATASET
pal<-read.csv("/Users/mac/Documents/jfk_weather_sample.csv")
head(pal)#to view the first six rows of the dataset
## DATE HOURLYDewPointTempF HOURLYRelativeHumidity
## 1 2015-07-25T13:51:00Z 60 46
## 2 2016-11-18T23:51:00Z 34 48
## 3 2013-01-06T08:51:00Z 33 89
## 4 2011-01-27T16:51:00Z 18 48
## 5 2015-01-03T12:16:00Z 27 61
## 6 2013-02-15T20:51:00Z 35 79
## HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1 83 68 0.00 13
## 2 53 44 0.00 6
## 3 36 35 0.00 13
## 4 36 30 0.00 14
## 5 39 34 T 11
## 6 41 38 0.00 6
## HOURLYSeaLevelPressure HOURLYStationPressure
## 1 30.01 29.99
## 2 30.05 30.03
## 3 30.14 30.12
## 4 29.82 29.80
## 5 NA 30.50
## 6 29.94 29.92
tail(pal)#to view the last six roles of the dataset
## DATE HOURLYDewPointTempF HOURLYRelativeHumidity
## 5722 2015-05-26T08:51:00Z 59 64
## 5723 2018-06-05T23:51:00Z 50 67
## 5724 2016-09-20T23:59:00Z <NA> NA
## 5725 2014-04-19T18:51:00Z 34 48
## 5726 2010-01-08T10:51:00Z 25 75
## 5727 2016-01-14T18:51:00Z 22 52
## HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 5722 72 64 0.00 9
## 5723 61 55 0.00 6
## 5724 NA NA <NA> NA
## 5725 53 44 0.00 3
## 5726 32 29 0.00 7
## 5727 38 32 0.00 0
## HOURLYSeaLevelPressure HOURLYStationPressure
## 5722 30.23 30.21
## 5723 29.74 29.72
## 5724 NA NA
## 5725 30.28 30.26
## 5726 29.72 29.69
## 5727 29.99 29.97
glimpse(pal)#to see the different data types
## Rows: 5,727
## Columns: 9
## $ DATE <chr> "2015-07-25T13:51:00Z", "2016-11-18T23:51:00Z",…
## $ HOURLYDewPointTempF <chr> "60", "34", "33", "18", "27", "35", "4", "14", …
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF <int> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
DATA PREPROCESSING
TASK 3:SELECT SUBSET OF COLUMNS
The first step in this data preprocessing is to select a subset of data columns and inspect the column types.For this project, I need HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYStationPressure columns.
THE NEEDED DATA GLOSSARY - HOURLYRelativeHumidity: is the relative humidity given to the nearest whole percentage.
HOURLYDRYBULBTEMPF: is the dry-bulb temperature and is commonly used as the standard air temperature reported. It is given here in whole degrees Fahrenheit.
HOURLYPrecip: is the amount of precipitation in inches to hundredths over the past hour. For certain automated stations, precipitation will be reported at sub-hourly intervals (e.g. every 15 or 20 minutes) as an accumulated amount of all precipitation within the preceding hour. A “T” indicates a trace amount of precipitation.
HOURLYWindSpeed: is the speed of the wind at the time of observation given in miles per hour (mph).
HOURLYStationPressure: is the atmospheric pressure observed at the station during the time of observation. Given in inches of Mercury (in Hg).
Useful<-pal %>%select(HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYStationPressure)#to select columns
head(Useful)#to show the first six rows of the dataframe
## HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 1 46 83 0.00 13
## 2 48 53 0.00 6
## 3 89 36 0.00 13
## 4 48 36 0.00 14
## 5 61 39 T 11
## 6 79 41 0.00 6
## HOURLYStationPressure
## 1 29.99
## 2 30.03
## 3 30.12
## 4 29.80
## 5 30.50
## 6 29.92
tail(Useful)#to show the last six rows of the dataframe
## HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
## 5722 64 72 0.00 9
## 5723 67 61 0.00 6
## 5724 NA NA <NA> NA
## 5725 48 53 0.00 3
## 5726 75 32 0.00 7
## 5727 52 38 0.00 0
## HOURLYStationPressure
## 5722 30.21
## 5723 29.72
## 5724 NA
## 5725 30.26
## 5726 29.69
## 5727 29.97
glimpse(Useful)#to show the number of rows, columns and the data type
## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
TASK 4: CLEAN UP COLUMNS
unique(Useful$HOURLYPrecip)#to remove duplicate values and return only the distinct values, to show all the values repeated in the column.
## [1] "0.00" "T" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07"
## [10] "0.16" "0.09" "0.22" "0.02s" "0.24" "0.18" "0.05" "0.04" "0.09s"
## [19] "0.11" "0.14" "0.25" "0.10" "0.01s" "0.58" "0.12" "0.13" "0.46"
## [28] "1.07" "1.19" "0.34" "0.20" "0.36s" "0.42" "0.17" "0.27" "0.35"
## [37] "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19" "0.36" "0.03s"
## [46] "0.07s" "0.54" "0.59" "0.21"
From the dataframe(Useful), we can see that the column HOURLYPrecip - which is the hourly measure of precipitation levels - contains both NA and T values. T specifies trace amounts of precipitation (meaning essentially no precipitation), while NA means not available, and is used to denote missing values. Additionally, some values also have “s” at the end of them, indicating that the precipitation was snow.
Having characters in values (like the “T” and “s” seen in the unique values) will cause problems when I create a model because values for precipitation should be numerical.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ rlang::flatten() masks purrr::flatten()
## ✖ rlang::flatten_chr() masks purrr::flatten_chr()
## ✖ rlang::flatten_dbl() masks purrr::flatten_dbl()
## ✖ rlang::flatten_int() masks purrr::flatten_int()
## ✖ rlang::flatten_lgl() masks purrr::flatten_lgl()
## ✖ rlang::flatten_raw() masks purrr::flatten_raw()
## ✖ rlang::invoke() masks purrr::invoke()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ✖ rlang::splice() masks purrr::splice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Useful<- Useful %>% mutate(HOURLYPrecip=HOURLYPrecip%>% #update the existing column
gsub("T", "0.0",.) %>%#replace all the T values with "0.0" %>%
str_remove(pattern="s$"))#to remove "s" from values using regex pattern
unique(Useful$HOURLYPrecip)# to check if unique values of HOURLYPrecip still contain any T or s.
## [1] "0.00" "0.0" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07" "0.16"
## [11] "0.09" "0.22" "0.24" "0.18" "0.05" "0.04" "0.11" "0.14" "0.25" "0.10"
## [21] "0.58" "0.12" "0.13" "0.46" "1.07" "1.19" "0.34" "0.20" "0.36" "0.42"
## [31] "0.17" "0.27" "0.35" "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19"
## [41] "0.54" "0.59" "0.21"
TASK 5: CONVERT COLUMN(Precip) TO NUMERICAL TYPES
Useful<-Useful %>% mutate(HOURLYPrecip=HOURLYPrecip %>% as.numeric(HOURLYPrecip))#to convert thr column to numeric
glimpse(Useful)
## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <int> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF <int> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ HOURLYWindSpeed <int> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
TASK 6: RENAME COLUMNS
Useful<-Useful %>% rename(relative_humidity=HOURLYRelativeHumidity,
dry_bulb_temp_f=HOURLYDRYBULBTEMPF,
precip=HOURLYPrecip,
wind_speed=HOURLYWindSpeed,
station_pressure=HOURLYStationPressure)# to rename all the columns
head(Useful)
## relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
## 1 46 83 0 13 29.99
## 2 48 53 0 6 30.03
## 3 89 36 0 13 30.12
## 4 48 36 0 14 29.80
## 5 61 39 0 11 30.50
## 6 79 41 0 6 29.92
REPLACE THE MISSING VALUES
colSums(is.na(Useful))#to check the sums of each column's missing values
## relative_humidity dry_bulb_temp_f precip wind_speed
## 151 151 1660 156
## station_pressure
## 158
Useful$relative_humidity[is.na(Useful$relative_humidity)] <- mean(Useful$relative_humidity, na.rm = TRUE)
Useful$dry_bulb_temp_f[is.na(Useful$dry_bulb_temp_f)]<- median(Useful$dry_bulb_temp_f,na.rm=TRUE)
Useful$precip[is.na(Useful$precip)] <- mean(Useful$precip, na.rm = TRUE)
Useful$wind_speed[is.na(Useful$wind_speed)] <- median(Useful$wind_speed, na.rm = TRUE)
Useful$station_pressure[is.na(Useful$station_pressure)] <- mean(Useful$station_pressure, na.rm = TRUE)
colSums(is.na(Useful))
## relative_humidity dry_bulb_temp_f precip wind_speed
## 0 0 0 0
## station_pressure
## 0
# We no longer have any missing values
TASK 7:EXPLORATORY DATA ANALYSIS Let’s check the summary of the dataset
summary(Useful)
## relative_humidity dry_bulb_temp_f precip wind_speed
## Min. : 13.00 Min. : 5.00 Min. :0.000000 Min. : 0.00
## 1st Qu.: 51.00 1st Qu.:43.00 1st Qu.:0.000000 1st Qu.: 7.00
## Median : 67.00 Median :56.00 Median :0.000000 Median :10.00
## Mean : 66.89 Mean :55.61 Mean :0.006548 Mean :11.16
## 3rd Qu.: 84.00 3rd Qu.:70.00 3rd Qu.:0.006548 3rd Qu.:15.00
## Max. :100.00 Max. :99.00 Max. :1.190000 Max. :52.00
## station_pressure
## Min. :28.69
## 1st Qu.:29.85
## Median :29.99
## Mean :29.99
## 3rd Qu.:30.14
## Max. :30.82
Let’s check the strength of linear relationship between the features and the target variable, using the correlation matrix
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Useful <- subset(Useful, select = c(relative_humidity, wind_speed, dry_bulb_temp_f,station_pressure,precip))
cor_matrix <- cor(Useful, use = "pairwise.complete.obs") # Handles NAs if any remain
# 2. Melt the correlation matrix into a long format data frame for ggplot2
melted_cor_matrix <- melt(cor_matrix)
# 3. Create the heatmap using ggplot2
ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") + # Adds white borders to tiles
scale_fill_gradient2(low = "lightgreen", high = "red", mid = "purple",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
labs(title = "Correlation matrix of Climatological Features") +
coord_fixed() + # Ensures squares are square
# Add the correlation coefficients as text labels
geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
Let’s visualize this correlation matrix, by plotting one column of correlations using heatmap
library(dplyr)
library(corrr)
library(ggplot2)
target_correlations <- Useful %>%
correlate() %>%
focus(precip) %>% # Focus on the target variable column
rename(Feature = term, Correlation = precip) %>% # Rename columns for ggplot
filter(Feature != "precip") # Remove the 1.0 correlation with itself
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
ggplot(target_correlations, aes(x = "Correlation with Target", y = reorder(Feature, Correlation), fill = Correlation)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "deeppink", midpoint = 0, limit = c(-1, 1)) +
geom_text(aes(label = round(Correlation, 3)), color = "blue", size = 4) +
theme_minimal() +
labs(title = "Feature Correlation with Target (precip)", y = NULL, x = NULL) +
# Use a theme adjustment to get a similar clean look to the image
theme(legend.position = "none", # Hides the legend since numbers are on the plot
axis.text.x = element_blank()) # Hides the X-axis label
Now,let’s split the data into a training and testing set.
set.seed(1234)#This function sets the starting number used to generate a sequence of random numbers.
#This ensures that you get the same result if you start with that same seed each time.
weather_data <- initial_split(Useful,prop=0.8)#to split the data
train_data <- training(weather_data)#the splitted training dataset
test_data <- testing(weather_data)#the splitted testing dataset
I want to explore the distributions of the variables in the training set and testing set separately
So,Plot boxplots of the variables (relative_humidity, dry_bulb_temp_f, precip, wind_speed, station_pressure) for an intial look of their distributions
To make it easier, I will convert the train_data to a long data format
library(dplyr)
library(tidyr)
library(ggplot2)
# Select only the columns I want and convert to long format
train_long <- train_data%>%
select(relative_humidity, dry_bulb_temp_f, precip,wind_speed, station_pressure) %>%
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "value"
)
head(train_long)
## # A tibble: 6 × 2
## variable value
## <chr> <dbl>
## 1 relative_humidity 94
## 2 dry_bulb_temp_f 73
## 3 precip 0.00655
## 4 wind_speed 7
## 5 station_pressure 29.8
## 6 relative_humidity 76
Create boxplots for all predictors in the training set
ggplot(train_long, aes(y = value)) +
geom_boxplot(fill = "lightgreen", color = "black") +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(
title = "Boxplots of Training Set Variables",
y = "Value"
)
Box plots of Testing Set Predictors
test_long <- test_data%>%
select(relative_humidity, dry_bulb_temp_f,precip, wind_speed, station_pressure) %>%
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "value"
)
head(test_long)
## # A tibble: 6 × 2
## variable value
## <chr> <dbl>
## 1 relative_humidity 46
## 2 dry_bulb_temp_f 83
## 3 precip 0
## 4 wind_speed 13
## 5 station_pressure 30.0
## 6 relative_humidity 79
ggplot(test_long, aes(y = value)) +
geom_boxplot(fill = "lightgreen", color = "black") +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(
title = "Boxplots of Testing Set Variables",
y = "Value"
)
TASK 8: Build a Linear Regression Using each feature
The first model will be built using relative_humidity as the predictor
linear_model1 <- lm(precip~relative_humidity, data=train_data)
summary(linear_model1)
##
## Call:
## lm(formula = precip ~ relative_humidity, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01616 -0.00802 -0.00366 0.00069 1.17500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.290e-02 1.877e-03 -6.87 7.26e-12 ***
## relative_humidity 2.906e-04 2.691e-05 10.80 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.036 on 4579 degrees of freedom
## Multiple R-squared: 0.02483, Adjusted R-squared: 0.02462
## F-statistic: 116.6 on 1 and 4579 DF, p-value: < 2.2e-16
The second model will built using dry_bulb_temp_f as the predictor
linear_model2<-lm(precip~dry_bulb_temp_f,data=train_data)
summary(linear_model2)
##
## Call:
## lm(formula = precip ~ dry_bulb_temp_f, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.00758 -0.00672 -0.00617 -0.00012 1.18316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.216e-03 1.860e-03 2.804 0.00507 **
## dry_bulb_temp_f 2.388e-05 3.204e-05 0.745 0.45626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03646 on 4579 degrees of freedom
## Multiple R-squared: 0.0001212, Adjusted R-squared: -9.714e-05
## F-statistic: 0.5551 on 1 and 4579 DF, p-value: 0.4563
The third model will be built using wind_speed as predictor
linear_model3<-lm(precip~wind_speed,data=train_data)
summary(linear_model3)
##
## Call:
## lm(formula = precip ~ wind_speed, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01244 -0.00650 -0.00542 0.00005 1.18053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.532e-03 1.109e-03 3.186 0.00145 **
## wind_speed 2.700e-04 8.697e-05 3.105 0.00192 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03642 on 4579 degrees of freedom
## Multiple R-squared: 0.002101, Adjusted R-squared: 0.001883
## F-statistic: 9.639 on 1 and 4579 DF, p-value: 0.001916
The fourth model will be built using station_pressure as predictor
linear_model4<-lm(precip~station_pressure,data=train_data)
summary(linear_model4)
##
## Call:
## lm(formula = precip ~ station_pressure, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02058 -0.00736 -0.00406 0.00004 1.17851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.501903 0.068231 7.356 2.23e-13 ***
## station_pressure -0.016518 0.002275 -7.260 4.51e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03625 on 4579 degrees of freedom
## Multiple R-squared: 0.01138, Adjusted R-squared: 0.01116
## F-statistic: 52.71 on 1 and 4579 DF, p-value: 4.515e-13
The fifth Model will be built using two features, using wind_speed and station pressure
linear_model5<-lm(precip~wind_speed + station_pressure,data=train_data)
summary(linear_model5)
##
## Call:
## lm(formula = precip ~ wind_speed + station_pressure, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02235 -0.00731 -0.00410 0.00016 1.17758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.778e-01 7.119e-02 6.712 2.16e-11 ***
## wind_speed 1.068e-04 8.996e-05 1.187 0.235
## station_pressure -1.575e-02 2.364e-03 -6.663 3.00e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03625 on 4578 degrees of freedom
## Multiple R-squared: 0.01168, Adjusted R-squared: 0.01125
## F-statistic: 27.06 on 2 and 4578 DF, p-value: 2.069e-12
The last model will be built using all the features
multiple_reg_model<-lm(precip~.,data=train_data)
summary(multiple_reg_model)
##
## Call:
## lm(formula = precip ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03033 -0.00818 -0.00343 0.00120 1.16781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.019e-01 7.755e-02 2.604 0.00925 **
## relative_humidity 3.032e-04 2.930e-05 10.350 < 2e-16 ***
## wind_speed 4.315e-04 9.636e-05 4.478 7.71e-06 ***
## dry_bulb_temp_f -4.515e-06 3.269e-05 -0.138 0.89015
## station_pressure -7.344e-03 2.536e-03 -2.896 0.00380 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03584 on 4576 degrees of freedom
## Multiple R-squared: 0.03431, Adjusted R-squared: 0.03346
## F-statistic: 40.64 on 4 and 4576 DF, p-value: < 2.2e-16
TASK9:For each of the models created, check the model performance using the testing set
TASK9: Predict using the Testing Data
actual_values<-test_data$precip
linear_model1_predictions<-predict(linear_model1,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model1_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.0004680524 -0.0004680524
## 2 0.000000000 0.0100572526 -0.0100572526
## 3 0.000000000 0.0059891070 -0.0059891070
## 4 0.000000000 -0.0021471841 0.0021471841
## 5 0.006547824 0.0077325980 -0.0011847740
## 6 0.006547824 -0.0024377659 0.0089855898
actual_values<-test_data$precip
linear_model2_predictions<-predict(linear_model2,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model2_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.007197842 -0.0071978425
## 2 0.000000000 0.006195052 -0.0061950518
## 3 0.000000000 0.005789160 -0.0057891604
## 4 0.000000000 0.006266680 -0.0062666797
## 5 0.006547824 0.006911331 -0.0003635069
## 6 0.006547824 0.006338308 0.0002095163
actual_values<-test_data$precip
linear_model3_predictions<-predict(linear_model3,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model3_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.007042806 -0.0070428064
## 2 0.000000000 0.005152632 -0.0051526317
## 3 0.000000000 0.006502756 -0.0065027565
## 4 0.000000000 0.005422657 -0.0054226567
## 5 0.006547824 0.005152632 0.0013951922
## 6 0.006547824 0.007042806 -0.0004949825
actual_values<-test_data$precip
linear_model4_predictions<-predict(linear_model4,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model4_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.006536439 -0.0065364392
## 2 0.000000000 0.007692680 -0.0076926804
## 3 0.000000000 0.000590056 -0.0005900560
## 4 0.000000000 0.002241829 -0.0022418291
## 5 0.006547824 0.007362326 -0.0008145018
## 6 0.006547824 0.008848922 -0.0023010976
actual_values<-test_data$precip
linear_model5_predictions<-predict(linear_model5,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=linear_model5_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.0067343816 -0.0067343816
## 2 0.000000000 0.0070898301 -0.0070898301
## 3 0.000000000 0.0008495988 -0.0008495988
## 4 0.000000000 0.0019979238 -0.0019979238
## 5 0.006547824 0.0067747596 -0.0002269357
## 6 0.006547824 0.0089398751 -0.0023920511
actual_values<-test_data$precip
multiple_reg_predictions<-predict(multiple_reg_model,newdata=test_data)
results_table<-data.frame(Actual=actual_values,Predicted=multiple_reg_predictions)
results_table$Residual<-results_table$Actual-results_table$Predicted
head(results_table)
## Actual Predicted Residual
## 1 0.000000000 0.0008754209 -0.0008754209
## 2 0.000000000 0.0085656690 -0.0085656690
## 3 0.000000000 0.0033967785 -0.0033967785
## 4 0.000000000 -0.0061761547 0.0061761547
## 5 0.006547824 0.0058573531 0.0006904708
## 6 0.006547824 -0.0009663807 0.0075142047
TASK9: Model Evaluation using RMSE to get the best model
Let’s evaluate all the models, compare and get the best model.
RME stands for Root Mean Square Error which is widely used for evaluating the performance of regression models. It measures the average difference between the values predicted by a model and the actaul values. It is in the same unit as the target variable.
library(Metrics)
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:rlang':
##
## ll
## The following objects are masked from 'package:yardstick':
##
## accuracy, mae, mape, mase, precision, recall, rmse, smape
class(actual_values)
## [1] "numeric"
class(multiple_reg_predictions)
## [1] "numeric"
rmse_value1_test<-Metrics::rmse(actual_values,linear_model1_predictions)
rmse_value1_test
## [1] 0.0285869
rmse_value2_test<-Metrics::rmse(actual_values,linear_model2_predictions)
rmse_value2_test
## [1] 0.02922909
rmse_value3_test<-Metrics::rmse(actual_values,linear_model3_predictions)
rmse_value3_test
## [1] 0.02899481
rmse_value4_test<-Metrics::rmse(actual_values,linear_model4_predictions)
rmse_value4_test
## [1] 0.0288756
rmse_value5_test<-Metrics::rmse(actual_values,linear_model5_predictions)
rmse_value5_test
## [1] 0.028801
rmse_value6_test<-Metrics::rmse(actual_values,multiple_reg_predictions)
rmse_value6_test
## [1] 0.02802316
TASK 10: Use a table to compare the evaluation metrics
model_names <- c("linear_model1","linear_model2","linear_model3","linear_model4","linear_model5","multiple_reg_model")
RMSE<- c ("0.0285869","0.02922909","0.02899481","0.0288756","0.028801","0.02802316")
comparison_df <- data.frame(model_names, RMSE)
comparison_df
## model_names RMSE
## 1 linear_model1 0.0285869
## 2 linear_model2 0.02922909
## 3 linear_model3 0.02899481
## 4 linear_model4 0.0288756
## 5 linear_model5 0.028801
## 6 multiple_reg_model 0.02802316
From the table above, multiple_reg_model has the lowest RMSE The lower the the RMSE, the accurate our model is. So multiple_reg_model is the best model in this project
library(gt)
my_df<-comparison_df%>% gt() %>% tab_header(title="Model Performance Metrics")
gtsave(
data = my_df,
filename = "model_performance_table.png")
## file:////var/folders/ln/45c0bd2s61vbgrr4yjcb6cf00000gn/T//RtmpqlNiiY/file17e0443472a9.html screenshot completed
Save the best model in Rdata.
saveRDS(multiple_reg_model,file="my_multiple_linear_regression_model.rds")
list.files()
## [1] "1. Overview-of-Excel-Exercise-Solved.xlsx"
## [2] "1. Overview-of-Excel-Exercise-Unsolved.xlsx"
## [3] "2. Data-Entry-Exercise - Solved.xlsx"
## [4] "2. Data-Entry-Exercise - Unsolved.xlsx"
## [5] "3. Formatting- Exercise - Solved.xlsx"
## [6] "3. Formatting-Exercise - Unsolved.xlsx"
## [7] "3uTools_v3.10.002.dmg"
## [8] "3uTools_v3.11.001.dmg"
## [9] "3uToolsDownloads"
## [10] "4684_down.pdf"
## [11] "48. Round-Exercise - Solved.xlsx"
## [12] "48. Round-Exercise - Unsolved.xlsx"
## [13] "5. Formulas - Exercise - Unsolved.xlsx"
## [14] "5. Formulas- Exercise - Solved.xlsx"
## [15] "6. Functions- Exercise - Solved.xlsx"
## [16] "6. Functions-Exercise - Unsolved.xlsx"
## [17] "7. Copy-Cut-Paste-Exercise - Unsolved.xlsx"
## [18] "8. Paste-Special-Exercise - Solved.xlsx"
## [19] "8. Paste-Special-Exercise - Unsolved.xlsx"
## [20] "applicantImage_copy.pdf"
## [21] "applicantImage.pdf"
## [22] "archive (1).zip"
## [23] "archive (2).zip"
## [24] "archive (3).zip"
## [25] "archive.zip"
## [26] "avast_security_online.dmg"
## [27] "Basic R for Data Science Project 1 (1).Rmd"
## [28] "Basic-R-for-Data-Science-Project-1--1-.html"
## [29] "CamScanner 04-24-2025 11.29.pdf"
## [30] "CamScanner 05-02-2025 06.43 (1).pdf"
## [31] "CamScanner 9-13-25 15.19.pdf"
## [32] "certificate.pdf"
## [33] "clean copy.png"
## [34] "clean.png"
## [35] "climate_change_indicators.csv"
## [36] "covid.csv"
## [37] "covid1.csv"
## [38] "customer_data_cleaning_project.R"
## [39] "Customer_Purchase_Data.csv"
## [40] "Data Cleaning Practice (1).Rmd"
## [41] "Data Cleaning Practice.Rmd"
## [42] "Data Cleaning Project 1.Rmd"
## [43] "Data Cleaning Project 2.Rmd"
## [44] "Data Entry_Lecture.xlsx"
## [45] "Data-Cleaning-2-Practice.Rmd"
## [46] "Data-Cleaning-Practice.html"
## [47] "Data-Cleaning-Project-1.html"
## [48] "Data-Cleaning-Project-2.html"
## [49] "Data-Cleaning-Tolu.html"
## [50] "Data-Cleaning-Tolu.knit.md"
## [51] "djia_analysis.md"
## [52] "eDX course 3.Rmd"
## [53] "edX Project 1 (1).Rmd"
## [54] "edX Project 1 (3).Rmd"
## [55] "EDX R PROGRAMMING 1.R"
## [56] "edX-Project-1 (1).html"
## [57] "edX-Project-1--1-.html"
## [58] "edX-Project-1.html"
## [59] "Format cells_Lecture_after.xlsx"
## [60] "Formatting-Lecture_after.xlsx"
## [61] "Formatting-Lecture_before.xlsx"
## [62] "Formulas-Lecture_after.xlsx"
## [63] "Formulas-Lecture_before.xlsx"
## [64] "Functions-Lecture_after.xlsx"
## [65] "Functions-Lecture_before.xlsx"
## [66] "GIT Report format (1).docx"
## [67] "GIT Report format (2).docx"
## [68] "GIT Report format.docx"
## [69] "googlechrome.dmg"
## [70] "histogram.png"
## [71] "icarefone-mac_11735733101042763901.dmg"
## [72] "imm1294e.pdf"
## [73] "imm5645e (1).pdf"
## [74] "imm5645e.pdf"
## [75] "Install Spotify.app"
## [76] "Introduction to Excel - Course notes.pdf"
## [77] "introduction to ML (1) (1).pptm"
## [78] "introduction to ML (1) (2).pptm"
## [79] "introduction to ML (1) (3).pptm"
## [80] "introduction to ML (1).pptm"
## [81] "introduction_to_data_cleaning_with_R.pdf"
## [82] "lax_to_jfk"
## [83] "lax_to_jfk.tar.gz"
## [84] "Long_and_Wide_Data_Format_Updated.Rmd"
## [85] "Long_Wide_Corporate_Executive.pptx"
## [86] "Long_Wide_Data_Format.html"
## [87] "Long_Wide_Data_Format.pptx"
## [88] "Long_Wide_Data_Format.Rmd"
## [89] "Long_Wide_Format_Presentation.pptx"
## [90] "maradun.txt"
## [91] "median_average_wages.csv"
## [92] "messy_customer_data.csv"
## [93] "MetaTrader 5.pkg"
## [94] "MetaTrader5.pkg.zip"
## [95] "model_performance_table.png"
## [96] "MY SLIDE (1).pptx"
## [97] "MY SLIDE.pptx"
## [98] "my_correlation_matrix (1).png"
## [99] "my_correlation_matrix.png"
## [100] "my_multiple_linear_regression_model.rds"
## [101] "MY_SLIDE_BEAUTIFIED.pptx"
## [102] "MY_SLIDE_UPDATED.pptx"
## [103] "Nike Sales Cleaned Data Sales.Rmd"
## [104] "Nike sales cleaned data.csv"
## [105] "Nike sales cleaned dataset.csv"
## [106] "Nike_Sales_Uncleaned.csv"
## [107] "Nike-Sales-Cleaned-Data-Code.html"
## [108] "Nike-Sales-Cleaned-Data-Sales.html"
## [109] "Nike-Sales-Trend-Analysis (1).Rmd"
## [110] "Nike-Sales-Trend-Analysis.docx"
## [111] "Nike-Sales-Trend-Analysis.Rmd"
## [112] "noaa-weather-sample-data"
## [113] "noaa-weather-sample-data (1).tar.gz"
## [114] "noaa-weather-sample-data.tar.gz"
## [115] "notconfirmed.Rmd"
## [116] "output_table.png"
## [117] "Predictive Project.Rmd"
## [118] "Predictive-Project_files"
## [119] "Predictive-Project.html"
## [120] "Predictive-Project.Rmd"
## [121] "Project Details .pdf"
## [122] "R-4.5.2-arm64.pkg"
## [123] "Real Project.Rmd"
## [124] "Real-Project.html"
## [125] "Resources for ICAMMDA.docx"
## [126] "rsconnect"
## [127] "RStudio-2025.09.2-418.dmg"
## [128] "SpotifyInstaller (1).zip"
## [129] "SpotifyInstaller (2).zip"
## [130] "SpotifyInstaller.zip"
## [131] "st.R"
## [132] "Stock_Analysis_Project.html"
## [133] "Stock_Analysis_Project.Rmd"
## [134] "Student Mental health (1).csv"
## [135] "STUDY CORNER.docx"
## [136] "TASK.pdf"
## [137] "The Traders Bible .pdf"
## [138] "Tidying data (1).qmd"
## [139] "Tidying data (2).qmd"
## [140] "Tidying data.qmd"
## [141] "Toluwani ABIOLA CV_025611.docx"
## [142] "Toluwani edx Course 2 (1).Rmd"
## [143] "Toluwani edx Course 2 (2).Rmd"
## [144] "Toluwani edx Course 2.Rmd"
## [145] "TOLUWANI PRECIOUS fitness (1).pdf"
## [146] "TOLUWANI PRECIOUS fitness.pdf"
## [147] "Toluwani-Chapter-7-and-8.html"
## [148] "UDS0028020498213.png"
## [149] "unclean.png"
## [150] "VICTORIA PRESENTATION SLIDE .pptx"
## [151] "Visual Studio Code 2.app"
## [152] "Visual Studio Code 3.app"
## [153] "Visual Studio Code.app"
## [154] "VSCode-darwin-universal (1).zip"
## [155] "VSCode-darwin-universal.zip"
## [156] "WB_ES_T_FIN12.csv"
## [157] "WB_FINDEX_FIN11_2A.csv"
## [158] "weather-weka.csv"
## [159] "WhatsApp Image 2025-11-14 at 11.24.17 PM.jpeg"
## [160] "WhatsApp Image 2025-11-14 at 11.24.18 PM (1).jpeg"
## [161] "WhatsApp Image 2025-11-14 at 11.24.18 PM.jpeg"
## [162] "WhatsApp Image 2025-11-15 at 4.58.13 PM.jpeg"
## [163] "WhatsApp Image 2025-11-15 at 8.37.29 PM.jpeg"
## [164] "yahoo_data 2.xlsx"
## [165] "yahoo_data.xlsx"
Now, using shiny app, We deploy our model.