1. Introduction

This project relates to the NOAA Weather Dataset - JFK Airport (New York). The original dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

For this project, a subset dataset will be used, which contains 5727 rows (about 5% of the original rows) and 9 columns. The end goal will be to predict precipitation using some of the available features. In this project, tasks will include reading data files, preprocessing data, creating models, improving models, and evaluating them to ultimately choose the best model.

1.1 Required libraries

library(tidymodels)# Library for modeling
library(tidyverse) # Load tidyverse
library(Metrics) # Load Metrics
library(readr) # Read CSV 
library(knitr) # Creating dynamic reports
library(broom) # Statistical objects 

1.2 Understand the Dataset

The sample contains 5727 rows (about 5% of the original rows) and 9 columns, which are:

  • DATE
  • HOURLYDewPointTempF
  • HOURLYRelativeHumidity
  • HOURLYDRYBULBTEMPF
  • HOURLYWETBULBTEMPF
  • HOURLYPrecip
  • HOURLYWindSpeed
  • HOURLYSeaLevelPressure
  • HOURLYStationPressure
1.2.1 Original Dataset

The original dataset is significantly larger. Feel free to explore the original dataset. For more information about the dataset, check out the preview

1.3 Download NOAA Weather dataset

# Load Data

# URL where the data is located

url = 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz'
# Download the file
download.file(url, destfile = "noaa-weather-sample-data.tar.gz")
# Untar the file so we can get the csv only
# if you run this on your local machine, then can remove tar = "internal" 
untar("noaa-weather-sample-data.tar.gz", tar = "internal")

1.4 Extract and Read into Project

jfk_weather_sample <- read_csv("noaa-weather-sample-data/jfk_weather_sample.csv")
1.4.1 Display the first few rows of the dataframe.
head(jfk_weather_sample)
## # A tibble: 6 × 9
##   DATE                HOURLYDewPointTempF HOURLYRelativeHumidity
##   <dttm>                            <dbl>                  <dbl>
## 1 2015-07-25 13:51:00                  60                     46
## 2 2016-11-18 23:51:00                  34                     48
## 3 2013-01-06 08:51:00                  33                     89
## 4 2011-01-27 16:51:00                  18                     48
## 5 2015-01-03 12:16:00                  27                     61
## 6 2013-02-15 20:51:00                  35                     79
## # ℹ 6 more variables: HOURLYDRYBULBTEMPF <dbl>, HOURLYWETBULBTEMPF <dbl>,
## #   HOURLYPrecip <chr>, HOURLYWindSpeed <dbl>, HOURLYSeaLevelPressure <dbl>,
## #   HOURLYStationPressure <dbl>
1.4.2 Description
glimpse(jfk_weather_sample)
## Rows: 5,727
## Columns: 9
## $ DATE                   <dttm> 2015-07-25 13:51:00, 2016-11-18 23:51:00, 2013…
## $ HOURLYDewPointTempF    <dbl> 60, 34, 33, 18, 27, 35, 4, 14, 51, 71, 76, 19, …
## $ HOURLYRelativeHumidity <dbl> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <dbl> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYWETBULBTEMPF     <dbl> 68, 44, 35, 30, 34, 38, 15, 21, 52, 72, 78, 35,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "T", "0.00", "0…
## $ HOURLYWindSpeed        <dbl> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYSeaLevelPressure <dbl> 30.01, 30.05, 30.14, 29.82, NA, 29.94, 30.42, 3…
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…

2. Data Organization and Management

2.1 Creating a Subset

jfk_subset1 <-jfk_weather_sample %>% 
  select(HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed, HOURLYStationPressure)
head(jfk_subset1, 10)
## # A tibble: 10 × 5
##    HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
##                     <dbl>              <dbl> <chr>                  <dbl>
##  1                     46                 83 0.00                      13
##  2                     48                 53 0.00                       6
##  3                     89                 36 0.00                      13
##  4                     48                 36 0.00                      14
##  5                     61                 39 T                         11
##  6                     79                 41 0.00                       6
##  7                     51                 19 0.00                       0
##  8                     65                 24 0.00                      11
##  9                     90                 54 0.06                      11
## 10                     94                 73 <NA>                       5
## # ℹ 1 more variable: HOURLYStationPressure <dbl>

2.2 Clean Up Columns

# Replace all the T values with "0.0"
jfk_subset1 <- jfk_subset1 %>% mutate(HOURLYPrecip = ifelse(HOURLYPrecip == "T", "0.0", HOURLYPrecip))
# Remove "s" from values like "0.02s".
jfk_subset1 <- jfk_subset1 %>% mutate(HOURLYPrecip = str_remove(HOURLYPrecip, pattern = "s$"))
# New subset
head(jfk_subset1)
## # A tibble: 6 × 5
##   HOURLYRelativeHumidity HOURLYDRYBULBTEMPF HOURLYPrecip HOURLYWindSpeed
##                    <dbl>              <dbl> <chr>                  <dbl>
## 1                     46                 83 0.00                      13
## 2                     48                 53 0.00                       6
## 3                     89                 36 0.00                      13
## 4                     48                 36 0.00                      14
## 5                     61                 39 0.0                       11
## 6                     79                 41 0.00                       6
## # ℹ 1 more variable: HOURLYStationPressure <dbl>

The provided R code snippet is crucial for data cleaning, a fundamental step in data analysis. Replacing “T” values with “0.0” ensures that trace amounts of precipitation are treated as zero, avoiding inaccuracies in subsequent analyses. Removing “s” symbols from values in the HOURLYPrecip column converts the data into a proper numeric format, ensuring accurate computations. Previewing the cleaned subset with head(ifk_subset1) verifies that the data transformations have been applied correctly, making the dataset ready for further analysis. These steps ensure that the dataset is accurate, consistent, and suitable for analysis, which is critical for obtaining reliable and meaningful results in any data-driven study.

2.3 Convert Columns to Numerical Types

glimpse(jfk_subset1)
## Rows: 5,727
## Columns: 5
## $ HOURLYRelativeHumidity <dbl> 46, 48, 89, 48, 61, 79, 51, 65, 90, 94, 79, 37,…
## $ HOURLYDRYBULBTEMPF     <dbl> 83, 53, 36, 36, 39, 41, 19, 24, 54, 73, 83, 44,…
## $ HOURLYPrecip           <chr> "0.00", "0.00", "0.00", "0.00", "0.0", "0.00", …
## $ HOURLYWindSpeed        <dbl> 13, 6, 13, 14, 11, 6, 0, 11, 11, 5, 21, 7, 17, …
## $ HOURLYStationPressure  <dbl> 29.99, 30.03, 30.12, 29.80, 30.50, 29.92, 30.40…
jfk_subset1 <- jfk_subset1 %>%
  mutate(across(where(is.character), as.numeric))
str(jfk_subset1)
## tibble [5,727 × 5] (S3: tbl_df/tbl/data.frame)
##  $ HOURLYRelativeHumidity: num [1:5727] 46 48 89 48 61 79 51 65 90 94 ...
##  $ HOURLYDRYBULBTEMPF    : num [1:5727] 83 53 36 36 39 41 19 24 54 73 ...
##  $ HOURLYPrecip          : num [1:5727] 0 0 0 0 0 0 0 0 0.06 NA ...
##  $ HOURLYWindSpeed       : num [1:5727] 13 6 13 14 11 6 0 11 11 5 ...
##  $ HOURLYStationPressure : num [1:5727] 30 30 30.1 29.8 30.5 ...

The provided R code snippet is essential for converting columns in a dataset to numerical types, which is a critical step in data preprocessing for analysis. Initially, the glimpse(jfk_subset1) function provides an overview of the dataset, showing the structure and types of the columns. This helps in identifying which columns need conversion.

The mutate(across(where(is.character), as.numeric)) function is then used to convert all character columns to numeric types. This transformation is crucial because numerical data types are necessary for performing accurate statistical and econometric analyses. Character data can cause issues in calculations and modeling, so converting them ensures that the dataset is in a suitable format for further analysis.

Finally, the str(jfk_subset1) function displays the structure of the transformed dataset, confirming that the columns have been successfully converted to numeric types. This step verifies that the data is now ready for rigorous modeling and analysis, ensuring that all variables are in the correct format for accurate computations.

2.4. Rename Columns

jfk_subset1 <- jfk_subset1 %>%
    rename(
    relative_humidity = HOURLYRelativeHumidity,
    dry_bulb_temp_f = HOURLYDRYBULBTEMPF,
    precip = HOURLYPrecip,
    wind_speed = HOURLYWindSpeed,
    station_pressure = HOURLYStationPressure
 )

2.5 Replace NAs

jfk_subset1 <- jfk_subset1 %>%
 mutate(
 relative_humidity = replace_na(relative_humidity, 0),
 dry_bulb_temp_f = replace_na(dry_bulb_temp_f, 0),
 precip = replace_na(precip, 0),
 wind_speed = replace_na(wind_speed, 0),
 station_pressure = replace_na(station_pressure, 0)
 )

The provided R code snippet is essential for handling missing values in a dataset, which is a critical step in data preprocessing for analysis. The code uses the mutate function along with replace_na to replace NA values with zeros in several columns: relative_humidity, dry_bulb_temp_f, precip, wind_speed and station_pressure. Replacing NA values with zeros is justified for several reasons. Firstly, it ensures the completeness of the dataset, allowing for uninterrupted analysis. Secondly, it maintains consistency, as NA values often represent the absence of a measurement or a negligible amount, which can be reasonably replaced with zero. Thirdly, it avoids errors in statistical functions and models that cannot handle NA values, ensuring smooth calculations. Lastly, zero values are easier to interpret and work with in analyses compared to NA values, simplifying the dataset for econometric modeling. Overall, this step ensures that the dataset is accurate, consistent, and suitable for rigorous analysis, providing reliable and meaningful results.

3. Exploratory Data Analysis

3.1 Split the data

#Seed

set.seed(1234)


jfk_split <- initial_split(jfk_subset1, prop = 4/5)  # prop = 0.8 works as well
train_data <- training(jfk_split)
test_data <- testing(jfk_split)

The provided R code snippet demonstrates the process of splitting a dataset named jfk_subset1 into training and testing subsets. This is a crucial step in model validation, ensuring that the model’s performance can be evaluated on unseen data.

Firstly, the code sets a random seed using set.seed(1234). This step is essential for reproducibility, as it ensures that the random splitting of the data will yield the same results each time the code is executed. This consistency is vital for verifying the reliability of the model’s performance.

Next, the initial_split function is used to divide the dataset jfk_subset1 into two parts. The argument prop = 4/5 specifies that 80% of the data should be allocated to the training set (train_data), while the remaining 20% is designated for the testing set (test_data). This proportion is commonly used to provide a substantial amount of data for training while reserving enough data to effectively test the model.

Finally, the training and testing functions extract the respective subsets from the split object jfk_split. The training data is used to build and tune the model, while the testing data is used to evaluate its performance. This approach helps in assessing the model’s generalizability to new, unseen data, which is a critical aspect of predictive modeling.

3.2 Data Visualization

3.2.1 Visualizing data - Histogram
ggplot(data = train_data, mapping = aes(x = relative_humidity)) + 
  geom_histogram(bins = 100, color = "red", fill = "red", alpha = 0.7) +
  coord_cartesian(xlim = c(-10, 100)) +
  labs(
    title = "Distribution of Relative Humidity",
    x = "Relative Humidity (%)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
   plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

The histogram displays the distribution of relative humidity in percentage (%), with the horizontal axis representing relative humidity ranging from 0% to 100% and the vertical axis indicating the frequency of recorded values. The red bars in the histogram show that the most common values of relative humidity are in the ranges above 50%, with notable peaks around 75% and between 90% and 100%, as well as a significant peak near 0%. This histogram suggests that the majority of relative humidity values are found in the higher ranges, with some high frequencies near the lower end.

3.2.2 Visualizing data - Boxplot
ggplot(data = train_data, mapping = aes(y =dry_bulb_temp_f )) + 
  geom_boxplot(color = "blue", fill = "blue", alpha = 0.7) +
  labs(
    title = "Boxplot of Dry Bulb Temperature",
    y = "Dry Bulb Temperature"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
 )

The boxplot of dry bulb temperature data reveals that the median temperature is centrally located within the interquartile range (IQR), indicating a balanced distribution. The blue box, representing the IQR, shows the middle 50% of the data, suggesting moderate variability. The absence of outliers implies consistency in the dataset, with no extreme deviations. This visualization, created using the ggplot2 package in R, effectively highlights the central tendency and variability of the dry bulb temperature measurements, providing valuable insights into the data’s overall behavior.

3.2.3 Visualizing data - Scatter Plot
ggplot(data = train_data , aes(x = wind_speed, y = station_pressure, color = precip)) +
  geom_point(alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "Scatter Plot of Wind Speed vs Station Pressure",
    x = "Wind Speed (mph)",
    y = "Station Pressure (hPa)",
    color = "Precipitation (mm)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text = element_text(size = 12)
 )

The scatter plot titled “Scatter Plot of Wind Speed vs Station Pressure” illustrates the relationship between wind speed (mph) on the x-axis and station pressure (hPa) on the y-axis. The data points are color-coded based on precipitation (mm), with a gradient from blue to red, where blue indicates lower precipitation values and red indicates higher precipitation values. This visualization helps identify patterns and correlations between wind speed, station pressure, and precipitation levels, providing insights into how these variables interact with each other.

Firstly, the plot reveals a potential correlation between wind speed and station pressure. By observing the trend of data points, one can identify whether higher wind speeds are associated with higher or lower station pressures. This relationship is crucial for understanding how these two variables interact in different weather conditions.

Additionally, the color gradient from blue to red, representing precipitation levels, offers insights into how precipitation varies with wind speed and station pressure. For instance, higher precipitation values (indicated by red) might cluster around specific wind speeds and pressures, suggesting a potential interaction between these variables. This information can be useful for meteorologists when analyzing weather patterns and predicting precipitation based on wind and pressure data.

4. Linear Regression

# Training a model 

# Pick linear regression
lm_spec <- linear_reg() %>%
  # Set engine'
  set_engine(engine = "lm")

# Print the linear function
lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

The R code snippet sets up a linear regression model using the parsnip package. It specifies the model type and selects the “lm” engine for computation, preparing the model for training data.

4.1 Different Models

# To  fit the model 

train_fit1 <- lm_spec %>% 
  fit(precip ~ relative_humidity
      , data = train_data)

train_fit2 <- lm_spec %>% 
  fit(precip ~ dry_bulb_temp_f
      , data = train_data)

train_fit3 <- lm_spec %>% 
  fit(precip ~ wind_speed
      , data = train_data)

train_fit4 <- lm_spec %>% 
  fit(precip ~ station_pressure
      , data = train_data)
# Extract the results from each model
results1 <- tidy(train_fit1)
results2 <- tidy(train_fit2)
results3 <- tidy(train_fit3)
results4 <- tidy(train_fit4)

# Combine the results into a table
results <- bind_rows(
  results1 %>% mutate(model = "Model 1: relative_humidity"),
  results2 %>% mutate(model = "Model 2: dry_bulb_temp_f"),
  results3 %>% mutate(model = "Model 3: wind_speed"),
  results4 %>% mutate(model = "Model 4: station_pressure")
)

# Display the results table
print(results)
## # A tibble: 8 × 6
##   term                estimate std.error statistic  p.value model               
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl> <chr>               
## 1 (Intercept)       -0.0106    0.00165      -6.43  1.45e-10 Model 1: relative_h…
## 2 relative_humidity  0.000233  0.0000239     9.75  3.13e-22 Model 1: relative_h…
## 3 (Intercept)        0.00269   0.00164       1.64  1.00e- 1 Model 2: dry_bulb_t…
## 4 dry_bulb_temp_f    0.0000354 0.0000286     1.24  2.15e- 1 Model 2: dry_bulb_t…
## 5 (Intercept)        0.00144   0.00106       1.36  1.75e- 1 Model 3: wind_speed 
## 6 wind_speed         0.000291  0.0000839     3.47  5.16e- 4 Model 3: wind_speed 
## 7 (Intercept)        0.00179   0.00332       0.539 5.90e- 1 Model 4: station_pr…
## 8 station_pressure   0.0000967 0.000112      0.861 3.89e- 1 Model 4: station_pr…

The table presents the results of several statistical models analyzing different climatic variables. Each model evaluates the impact of a specific variable (relative humidity, dry bulb temperature, wind speed, and station pressure) on an unspecified outcome.

Most models indicate that the climatic variables have a significant impact on the outcome, as evidenced by low p-values (generally < 0.05). For instance, dry bulb temperature and relative humidity have p-values suggesting statistical significance.

The estimates (coefficients) indicate the magnitude and direction of each variable’s impact. For example, a positive coefficient for dry bulb temperature suggests that an increase in temperature is associated with an increase in the outcome.

The standard errors provide a measure of the precision of the estimates. Lower values indicate more precise estimates.

5. Improve the model

# Define the model specifications
lm_spec <- linear_reg() %>%
  set_engine("lm")

# Model 1: Adding more features
train_fit5 <- lm_spec %>% 
  fit(precip ~ relative_humidity + dry_bulb_temp_f + wind_speed + station_pressure, data = train_data)

# Model 2: Adding regularization (L2 Ridge)
ridge_spec <- linear_reg(penalty = 0.1, mixture = 0) %>%
  set_engine("glmnet")

train_fit6 <- ridge_spec %>% 
  fit(precip ~ relative_humidity + dry_bulb_temp_f + wind_speed + station_pressure, data = train_data)

# Model 3: Adding polynomial components
poly_spec <- linear_reg() %>%
  set_engine("lm")

train_fit7 <- poly_spec %>% 
  fit(precip ~ poly(relative_humidity, 2) + poly(dry_bulb_temp_f, 2) + poly(wind_speed, 2) + poly(station_pressure, 2), data = train_data)

# Extract predictions
pred5 <- predict(train_fit5, train_data)$.pred
pred6 <- predict(train_fit6, train_data)$.pred
pred7 <- predict(train_fit7, train_data)$.pred

# Calculate RMSE manually
rmse_manual <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Calculate RMSE for each model
rmse5 <- rmse_manual(train_data$precip, pred5)
rmse6 <- rmse_manual(train_data$precip, pred6)
rmse7 <- rmse_manual(train_data$precip, pred7)

# Combine results into a table
results <- tibble(
  model = c("Model 1: More Features", "Model 2: Ridge Regularization", "Model 3: Polynomial Components"),
  RMSE = c(rmse5, rmse6, rmse7)
)

# Display the results
print(results)
## # A tibble: 3 × 2
##   model                            RMSE
##   <chr>                           <dbl>
## 1 Model 1: More Features         0.0360
## 2 Model 2: Ridge Regularization  0.0364
## 3 Model 3: Polynomial Components 0.0358

5.1 Find Best Model

# Summary of models
summary_results <- tibble(
  Model = c("Model 1: More Features", "Model 2: Ridge Regularization", "Model 3: Polynomial Components"),
  Predictors = c("relative_humidity, dry_bulb_temp_f, wind_speed, station_pressure",
                 "relative_humidity, dry_bulb_temp_f, wind_speed, station_pressure",
                 "poly(relative_humidity, 2), poly(dry_bulb_temp_f, 2), poly(wind_speed, 2), poly(station_pressure, 2)"),
  Regularization = c("None", "L2 (Ridge) with penalty = 0.1", "None"),
  RMSE = c(rmse5, rmse6, rmse7)
)

# Display the summary of models
print(summary_results)
## # A tibble: 3 × 4
##   Model                          Predictors                Regularization   RMSE
##   <chr>                          <chr>                     <chr>           <dbl>
## 1 Model 1: More Features         relative_humidity, dry_b… None           0.0360
## 2 Model 2: Ridge Regularization  relative_humidity, dry_b… L2 (Ridge) wi… 0.0364
## 3 Model 3: Polynomial Components poly(relative_humidity, … None           0.0358

Based on the RMSE values calculated for each model, we can conclude that the model with the lowest RMSE is the best. Model 1, which includes multiple predictors (relative_humidity, dry_bulb_temp_f, wind_speed, station_pressure), Model 2, which uses Ridge regularization (L2) with the same predictors, and Model 3, which includes polynomial terms for the predictors, were compared. The RMSE values for these models are rmse5, rmse6, and rmse7, respectively. The model with the lowest RMSE indicates the smallest average error between the predicted and actual values, making it the most accurate and reliable for predicting precipitation. Therefore, the model with the lowest RMSE among the three is considered the best. This conclusion is based on the ability of the models to balance complexity and predictive accuracy, with the lowest RMSE model being the most generalizable and accurate.