Congratulations! You have just been hired by a US Weather forecast firm as a data scientist.
The company is considering the weather condition to help predict the possibility of precipitations, which involves using various local climatological variables, including temperature, wind speed, humidity, dew point, and pressure. The data you will be handling was collected by a NOAA weather station located at the John F. Kennedy International Airport in Queens, New York.
Your task is to provide a high level analysis of weather data in JFK Airport. Your stakeholders want to understand the current and historical record of precipitations based on different variables. For now they are mainly interested in a macro-view of JFK Airport Weather, and how it relates to the possibility to rain because it will affect flight delays and etc.
This project relates to the NOAA Weather Dataset - JFK Airport (New York). The original dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
For this project, you will be using a subset dataset, which contains 5727 rows (about 5% or original rows) and 9 columns. The end goal will be to predict the precipitation using some of the available features. In this project, you will practice reading data files, preprocessing data, creating models, improving models and evaluating them to ultimately choose the best model.
Using this R notebook you will complete 10 tasks:
Below, install "tidymodels", additionally "rlang" should be updated in order to properly run "tidymodels".
# Install tidymodels if you haven't done so
install.packages("rlang")
install.packages("tidymodels")
Note: After installing the packages, restart the kernel. Without installing the packages again, load them. Tidyverse and Tidymodels will be the two main packages you will use.
# Library for modeling
library(tidymodels)
# Load tidyverse
library(tidyverse)
The original NOAA JFK dataset contains 114,546 hourly observations of various local climatological variables (including temperature, wind speed, humidity, dew point, and pressure).
In this project you will use a sample dataset, which is around 293 KB. Link to the sample dataset.
The sample contains 5727 rows (about 5% or original rows) and 9 columns, which are:
The original dataset is much bigger. Feel free to explore the original dataset. Link to the original dataset.
For more information about the dataset, checkout the preview of NOAA Weather - JFK Airport.
Use the download.file() function to download the sample
dataset from the URL below.
url <- "https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz"
download.file(url,destfile = "noaa-weather-sample-data.tar.gz")
Untar the zipped file.
untar("noaa-weather-sample-data.tar.gz",tar= "internal")
We start by reading in the raw dataset. You should specify the file name as "noaa-weather-sample-data/jfk_weather_sample.csv".
noaa_weather <- read_csv("noaa-weather-sample-data/jfk_weather_sample.csv")
Next, display the first few rows of the dataframe.
head(noaa_weather)
DATE HOURLYDewPointTempF HOURLYRelativeHumidity HOURLY…¹ HOURL…² HOURL…³ HOURL…⁴ HOURL…⁵ HOURL…⁶
<dttm> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 2015-07-25 13:51:00 60 46 83 68 0.00 13 30.0 30.0
2 2016-11-18 23:51:00 34 48 53 44 0.00 6 30.0 30.0
3 2013-01-06 08:51:00 33 89 36 35 0.00 13 30.1 30.1
4 2011-01-27 16:51:00 18 48 36 30 0.00 14 29.8 29.8
5 2015-01-03 12:16:00 27 61 39 34 T 11 NA 30.5
6 2013-02-15 20:51:00 35 79 41 38 0.00 6 29.9
# … with abbreviated variable names ¹HOURLYDRYBULBTEMPF, ²HOURLYWETBULBTEMPF, ³HOURLYPrecip, ⁴HOURLYWindSpeed,
# ⁵HOURLYSeaLevelPressure, ⁶HOURLYStationPressure
Also, take a glimpse of the dataset to see the different
column data types and make sure it is the correct subset dataset with
about 5700 rows and 9 columns.
glimpse(noaa_weather)
Rows: 5,727
Columns: 9
$ DATE <dttm> 2015-07-25 13:51:00, 2016-11-18 23:51:00, 2013-01-06 08:51:00, 2011-01-27 16:51…
$ HOURLYDewPointTempF <dbl> 60, 34, 33, 18, 27, 35, 4, 14, 51, 71, 76, 19, 48, 56, 70, 23, 45, 50, 30, 36, 6…
$ HOURLYRelativeHumidity <dbl> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37, 72, 47, 84, 57, 37, 75, 92, 79, …
$ HOURLYDRYBULBTEMPF <dbl> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44, 57, 78, 75, 37, 73, 58, 32, 42, …
$ HOURLYWETBULBTEMPF <dbl> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35, 52, 65, 72, 32, 58, 54, 31, 39, …
$ HOURLYPrecip <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0.00", "0.00", "0.06", NA, NA, "0.…
$ HOURLYWindSpeed <dbl> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, 8, 3, 11, 13, 9, 9, 3, 18, 6, 10,…
$ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 30.37, 30.05, 29.94, NA, 30.27, 29…
$ HOURLYStationPressure <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40, 30.35, 30.03, 29.91, 29.75, 30.…
The end goal of this project will be to predict
HOURLYprecip (precipitation) using a few other variables.
Before you can do this, you first need to preprocess the dataset.
Section 3 to section 6 focuses on preprocessing.
The first step in preprocessing is to select a subset of data columns and inspect the column types.
The key columns that we will explore in this project are:
Data Glossary:
Select those five columns and store the modified
dataframe as a new variable.
noaa_weather_sub <- noaa_weather %>% select(HOURLYRelativeHumidity,
HOURLYDRYBULBTEMPF,
HOURLYPrecip,
HOURLYWindSpeed,
HOURLYStationPressure)
Show the first 10 rows of this new dataframe.
head(noaa_weather_sub,10)
# A tibble: 10 × 5
HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed HOURLYStationPressure
<dbl> <dbl> <chr> <dbl> <dbl>
1 46 83 0.00 13 30.0
2 48 53 0.00 6 30.0
3 89 36 0.00 13 30.1
4 48 36 0.00 14 29.8
5 61 39 T 11 30.5
6 79 41 0.00 6 29.9
7 51 19 0.00 0 30.4
8 65 24 0.00 11 30.4
9 90 54 0.06 11 30.0
10 94 73 NA 5 29.9
From the dataframe preview above, we can see that the column
HOURLYPrecip - which is the hourly measure of precipitation
levels - contains both NA and T values.
T specifies trace amounts of precipitation
(meaning essentially no precipitation), while NA means
not available, and is used to denote missing values.
Additionally, some values also have "s" at the end of them, indicating
that the precipitation was snow.
Inspect the unique values present in the column
HOURLYPrecip (with unique(dataframe$column))
to see these values.
unique(noaa_weather_sub$HOURLYPrecip)
[1] "0.00" "T" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07" "0.16" "0.09" "0.22" "0.02s"
[14] "0.24" "0.18" "0.05" "0.04" "0.09s" "0.11" "0.14" "0.25" "0.10" "0.01s" "0.58" "0.12" "0.13"
[27] "0.46" "1.07" "1.19" "0.34" "0.20" "0.36s" "0.42" "0.17" "0.27" "0.35" "0.31" "0.33" "0.23"
[40] "0.26" "0.28" "0.75" "0.19" "0.36" "0.03s" "0.07s" "0.54" "0.59" "0.21"
Having characters in values (like the "T" and "s" that you see in the unique values) will cause problems when you create a model because values for precipitation should be numerical. So you need to fix these values that have characters.
Now, for the column HOURLYPrecip:
T values with "0.0" andstr_remove(column, pattern = "s$") to remove the character
"s" from the end of values. The "$" tells R to match to the end of
values. The pattern is a regex pattern. Look at here
for more information about regex and matching to strings in R.Remember that you can use tidyverse's
mutate() to update columns.
You can check your work by checking if unique values of
HOURLYPrecip still contain any T or
s. Store the modified dataframe as a new variable.
noaa_weather_df <- noaa_weather_sub %>%
mutate(HOURLYPrecip = as.character(str_replace(str_remove_all(
HOURLYPrecip, "[*s]"), "T", "0")))
unique(noaa_weather_df$HOURLYPrecip)
[1] "0.00" "0" "0.06" NA "0.03" "0.02" "0.08" "0.01" "0.07" "0.16" "0.09" "0.22" "0.24" "0.18" "0.05"
[16] "0.04" "0.11" "0.14" "0.25" "0.10" "0.58" "0.12" "0.13" "0.46" "1.07" "1.19" "0.34" "0.20" "0.36" "0.42"
[31] "0.17" "0.27" "0.35" "0.31" "0.33" "0.23" "0.26" "0.28" "0.75" "0.19" "0.54" "0.59" "0.21"
Now that you have removed the characters in the
HOURLYPrecip column, you can safely covert the column to a
numeric type.
First, check the types of the columns. You will notice that all are
dbl (double or numeric) except for
HOURLYPrecip, which is chr (character or
string). Use the glimpse function from Tidyverse.
str(noaa_weather_df)
tibble [5,727 × 5] (S3: tbl_df/tbl/data.frame)
$ HOURLYRelativeHumidity: num [1:5727] 46 48 89 48 61 79 51 65 90 94 ...
$ HOURLYDRYBULBTEMPF : num [1:5727] 83 53 36 36 39 41 19 24 54 73 ...
$ HOURLYPrecip : chr [1:5727] "0.00" "0.00" "0.00" "0.00" ...
$ HOURLYWindSpeed : num [1:5727] 13 6 13 14 11 6 0 11 11 5 ...
$ HOURLYStationPressure : num [1:5727] 30 30 30.1 29.8 30.5 ...
Convert HOURLYPrecip to the numeric type
and store the cleaned dataframe as a new variable.
noaa_weather_numeric <- as.data.frame(sapply(noaa_weather_df, as.numeric))
We can now see that all fields have numerical data type.
str(noaa_weather_numeric)
'data.frame': 5727 obs. of 5 variables:
$ HOURLYRelativeHumidity: num 46 48 89 48 61 79 51 65 90 94 ...
$ HOURLYDRYBULBTEMPF : num 83 53 36 36 39 41 19 24 54 73 ...
$ HOURLYPrecip : num 0 0 0 0 0 0 0 0 0.06 NA ...
$ HOURLYWindSpeed : num 13 6 13 14 11 6 0 11 11 5 ...
$ HOURLYStationPressure : num 30 30 30.1 29.8 30.5 ...
Let's rename the following columns as:
You can use dplyr::rename(). Then, store the final
dataframe as a new variable.
noaa_weather_clean <- noaa_weather_numeric %>%
rename(relative_humidity = HOURLYRelativeHumidity,
dry_bulb_temp_f = HOURLYDRYBULBTEMPF,
precip = HOURLYPrecip,
wind_speed = HOURLYWindSpeed,
station_pressure = HOURLYStationPressure)
head(noaa_weather_clean)
relative_humidity dry_bulb_temp_f precip wind_speed station_pressure
1 46 83 0 13 29.99
2 48 53 0 6 30.03
3 89 36 0 13 30.12
4 48 36 0 14 29.80
5 61 39 0 11 30.50
6 79 41 0 6 29.92
Now that you have finished preprocessing the dataset, you can can start exploring the columns more.
First, split the data into a training and testing set. Splitting a dataset is done randomly, so to have reproducible results set the seed = 1234. Also, use 80% of the data for training.
set.seed(1234)
noaa_weather_split <- initial_split(noaa_weather_clean, prop = 0.8)
train_data <- training(noaa_weather_split)
test_data <- testing(noaa_weather_split)
Next, looking at just the training set, plot
histograms or box plots of the variables
(relative_humidity, dry_bulb_temp_f,
precip, wind_speed,
station_pressure) for an intial look of their distributions
using tidyverse's ggplot. Leave the testing
set as is because it is good practice to not see the testing set until
evaluating the final model.
ggplot(data = train_data, mapping = aes(x = relative_humidity)) +
geom_histogram(color = "white", fill = "darkred")
ggplot(data = train_data, mapping = aes(x = dry_bulb_temp_f)) +
geom_histogram(color = "white", fill = "darkred")
ggplot(data = train_data, mapping = aes(x = precip)) +
geom_histogram(color = "white", fill = "darkred")
ggplot(data = train_data, mapping = aes(x = wind_speed)) +
geom_histogram(color = "white", fill = "darkred")
ggplot(data = train_data, mapping = aes(x = station_pressure)) +
geom_histogram(color = "white", fill = "darkred")
After exploring the dataset more, you are now ready to start creating
models to predict the precipitation (precip).
Create simple linear regression models where precip is
the response variable and each of relative_humidity,
dry_bulb_temp_f,wind_speed or
station_pressure will be a predictor variable, e.g.
precip ~ relative_humidity,
precip ~ dry_bulb_temp_f, etc. for a total of four simple
models. Additionally, visualize each simple model with a scatter
plot.
linear_model <- lm(precip ~ relative_humidity, data = train_data)
ggplot(data = train_data, mapping = aes(x = relative_humidity, y = precip)) +
geom_point()+ stat_smooth(method = "lm", col = "blue") +
geom_smooth(method = "lm", na.rm = TRUE) +
ggtitle("Relative humidity vs precipitation")
linear_model2 <- lm(precip ~ dry_bulb_temp_f, data = train_data)
ggplot(data = train_data, mapping = aes(x = dry_bulb_temp_f, y = precip)) +
geom_point()+ stat_smooth(method = "lm", col = "blue") +
geom_smooth(method = "lm", na.rm = TRUE)+
ggtitle("Air temperature in F vs precipitation")
linear_model3 <- lm(precip ~ wind_speed, data = train_data)
ggplot(data = train_data, mapping = aes(x = wind_speed, y = precip)) +
geom_point()+ stat_smooth(method = "lm", col = "blue") +
geom_smooth(method = "lm", na.rm = TRUE) +
ggtitle("Wind Speed vs precipitation")
linear_model4 <- lm(precip ~ station_pressure, data = train_data)
ggplot(train_data, aes(x = station_pressure, y = precip)) +
geom_point() + stat_smooth(method = "lm", col = "blue") +
geom_smooth(method = "lm", na.rm = TRUE) +
ggtitle("Station_pressure vs precipitation")
According to the above scatter plot results, it would seem as if the most meaningful correlation is between relative humidity and precipitation. In the next exercise I will be exploring the correlation between multiple factors using multiple linear regression and will be exploring weather a polynomial regression model using a single factor with the purpose of finding the plot which best fits the data using MSE and R-Squared metrics.
Now, try improving the simple models you created in the previous section.
Create at least two more models, each model should use at least one of the different techniques:
Also, for each of the models you create, check the model performance using the training set and a metric like MSE, RMSE, or R-squared.
Consider using tidymodels if you choose to add
regularization and tune lambda.
The multiple linear regression model visible below uses multiple factors in the dataset and to explore their correlation with precipitation.
mlr <- lm(precip ~ station_pressure + relative_humidity + dry_bulb_temp_f, data = train_data)
Multiple Linear Regression Plot
ggplot(train_data, aes(x = relative_humidity + station_pressure + dry_bulb_temp_f, y = precip)) +
geom_point() + stat_smooth(method = "lm", col = "blue")
Calculating the MSE value for the multiple regression model based on the training data set
mse_mlr <- mean(mlr$residuals^2)
mse_mlr
0.0018064
Calculating the R-Squared Value
summary(mlr)$r.squared
0.04170082
The Polynomial regression model visible below is used to describe curvilinear relationships and is helpful for tracing points in the data that are non-linear.
poly_reg <- lm(precip ~ poly(relative_humidity, 15, raw = TRUE), data = train_data)
Polynomial Regression Plot
ggplot(data = train_data, aes(relative_humidity, precip))+
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x,15))
Calculating the MSE value for the polynomial regression model based on the training data set
mse_poly_reg <- mean(poly_reg$residuals^2)
mse_poly_reg
0.001772175
Calculating the R-Squared Value
summary(poly_reg)$r.squared
0.05895911
Compare the regression metrics of each model from section 9 to find the best model overall. To do this,
::: {#554eb37d-add8-49f7-8130-bc5c9f4d89a2 .cell .code}
Multiple Regression Model (Test Data set):
mlr_test <- lm(precip ~ station_pressure + relative_humidity + dry_bulb_temp_f, data = test_data)
mse_mlr_test <- mean(mlr_test$residuals^2)
mse_mlr_test
MSE value of the Multiple Regression Model:
0.001080003
summary(mlr_test)$r.squared
R-Squared value of the Multiple Regression Model:
0.07885748
Polynomial Regression Model:
poly_reg_test <- lm(precip ~ poly(relative_humidity, 15, raw = TRUE), data = test_data)
python
mse_poly_reg_test <- mean(poly_reg_test$residuals^2)
mse_poly_reg_test
MSE value of the Polynomial Regression Model:
0.001028331
summary(poly_reg_test)$r.squared
R-Squared value of the Polynomial Regression Model:
0.122929
model_names <- c("mlr", "poly_reg")
mse_train_error <- c("0.0018064", "0.001772175")
mse_test_error <- c("0.001080003", "0.001028331")
mse_comparison_df <- data.frame(model_names, mse_train_error, mse_test_error)
mse_comparison_df
model_names mse_train_error mse_test_error
1 mlr 0.0018064 0.001080003
2 poly_reg 0.001772175 0.001028331
Rsquared_train <- c("0.04170082", "0.05895911")
Rsquared_test <- c("0.07885748", "0.122929")
Rsquared_comparison_df <- data.frame(model_names, Rsquared_train, Rsquared_test)
Rsquared_comparison_df
model_names Rsquared_train Rsquared_test
1 mlr 0.04170082 0.07885748
2 poly_reg 0.05895911 0.122929
The Polynomial Regression model (poly_reg) is the best overall model due to its higher R-squared values for both the training and the testing data set while having a lower MSE (mean square error) when compared to the Multiple Linear Regression Model indicating that it is a better performing model.
However in the context of predicting precipitation, these data plots showed a weak correlation between weather variables and precipitation resulting in inconclusive predictions.