Homework 1

Question1:

A machine learning model is as good as the data it is fed in. There is a popular saying called Garbage In Garbage Out. So a crucial practice is data pre-processing and cleaning. Real datasets often contain duplicates, errors, and missing observations that can significantly impact model performance. So thorough exploration of missing values, duplicates and data inconsistencies, followed by appropriate handling strategies such as imputation, removal, or transformation is very important. Another important practice is random splitting of the test and training data. This separation is fundamental because we need an unbiased estimate of how our model will perform on unseen data. The training set is used to fit the model, while the test set serves as a final evaluation to estimate the test error rate. Without this separation, we risk overfitting the model where it works well with the training data but fails when it comes to un-seen test data.

Cross-validation and model selection represent another critical best practice when working with real data. Instead of relying solely on a single train-test split, techniques like k-fold cross-validation provide more robust estimates of model performance by repeatedly training and testing on different subsets of the data. This approach is particularly valuable for model selection, choosing between different algorithms or tuning parameters. The goal is to find the right balance in the bias-variance tradeoff: models that are too simple (high bias) will underfit the data, while overly complex models (high variance) will overfit. Cross-validation helps identify this sweet spot where the model captures true underlying patterns without fitting to noise, ultimately leading to better performance on real-world applications.

Question 2:

Data Description of storms dataset- The storms dataset in the dplyr library provides observational data on hurricanes and tropical storms from the National Oceanic and Atmospheric Administration’s (NOAA) Atlantic Hurricane Database (HURDAT2).

The data covers a temporal scale from 1975 to 2022. Each observation represents a specific storm’s position and intensity at six-hour intervals. The spatial scale is the Atlantic basin, with storm locations recorded by latitude and longitude. The dataset contains 19,537 observations and 13 variables: name (Storm name); year, month, day (Date of report); hour(hour of report UTC); lat, long (Location of Storm Center), status (Storm Classification), category (based on the Saffir-Simpson scale); wind speed (maximum sustained wind speeds in knots); pressure (air pressure at storm’s center in millibars); tropicalstorm_force_diameter (Diameter (in nautical miles) of the area experiencing tropical storm strength winds); and hurricane_force_diameter (Diameter (in nautical miles) of the area experiencing hurricane strength winds)

The data can be used for a range on analyses like tracing storm paths, storm trends across different months and can be used to make a prediction model for storm categories based on temporal as well as spatial data. So it can be used in areas such as risk assessment for disaster management, making meteorology models and climate research.This dataset exhibits high reliability as the source is the National Hurricane Center’s HURDAT2 database, which represents the official record of Atlantic tropical cyclones maintained by NOAA however one possible cause of concern could be that Storms in earlier years (before 1979) have some missing data.

Question 3:

library(dplyr)
library(ggplot2)

storms_data<- storms

cat ("Rows:", nrow(storms_data), "Columns:", ncol(storms_data), '\n')
## Rows: 19537 Columns: 13
cat ("Number of NA values :", sum(is.na(storms_data)))
## Number of NA values : 33758
# Basic structure information
str(storms_data)
## tibble [19,537 × 13] (S3: tbl_df/tbl/data.frame)
##  $ name                        : chr [1:19537] "Amy" "Amy" "Amy" "Amy" ...
##  $ year                        : num [1:19537] 1975 1975 1975 1975 1975 ...
##  $ month                       : num [1:19537] 6 6 6 6 6 6 6 6 6 6 ...
##  $ day                         : int [1:19537] 27 27 27 27 28 28 28 28 29 29 ...
##  $ hour                        : num [1:19537] 0 6 12 18 0 6 12 18 0 6 ...
##  $ lat                         : num [1:19537] 27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
##  $ long                        : num [1:19537] -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
##  $ status                      : Factor w/ 9 levels "disturbance",..: 7 7 7 7 7 7 7 7 8 8 ...
##  $ category                    : num [1:19537] NA NA NA NA NA NA NA NA NA NA ...
##  $ wind                        : int [1:19537] 25 25 25 25 25 25 25 30 35 40 ...
##  $ pressure                    : int [1:19537] 1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
##  $ tropicalstorm_force_diameter: int [1:19537] NA NA NA NA NA NA NA NA NA NA ...
##  $ hurricane_force_diameter    : int [1:19537] NA NA NA NA NA NA NA NA NA NA ...

Based on the number of rows and columns, this dataset is very intensive and suitable for comprehensive analysis. However there is a significant number of NA values. So we can take a deeper dive into the a variables that have the missing/NA values to see if they are really a cause of concern and then offer ways to tackle them.

# columns with NA values and their percentages
na_by_column <- sapply(storms_data, function(x) sum(is.na(x)))
print(paste("NA values of ", names(na_by_column),":", na_by_column))
##  [1] "NA values of  name : 0"                           
##  [2] "NA values of  year : 0"                           
##  [3] "NA values of  month : 0"                          
##  [4] "NA values of  day : 0"                            
##  [5] "NA values of  hour : 0"                           
##  [6] "NA values of  lat : 0"                            
##  [7] "NA values of  long : 0"                           
##  [8] "NA values of  status : 0"                         
##  [9] "NA values of  category : 14734"                   
## [10] "NA values of  wind : 0"                           
## [11] "NA values of  pressure : 0"                       
## [12] "NA values of  tropicalstorm_force_diameter : 9512"
## [13] "NA values of  hurricane_force_diameter : 9512"

Based on this even though there’s 13% NA values, they are mostly the category, hurricane_force_diameter and tropical_force_diameter. In category, NA: Not a hurricane so we need to classify it as such and not try to fix it. The missing data is tropicalstorm_force_diameter and hurricane_force_diameter which are only available post 2004. So we can use the data present to try to fill the missing values using MICE imputation. It will use all the other variables to predict the diameters and generate realistic data which will be a better approximation than using mean or omit.

Question 4:

variable_types <- data.frame(
  R_Type = sapply(storms_data, class)
)
print(variable_types)
##                                 R_Type
## name                         character
## year                           numeric
## month                          numeric
## day                            integer
## hour                           numeric
## lat                            numeric
## long                           numeric
## status                          factor
## category                       numeric
## wind                           integer
## pressure                       integer
## tropicalstorm_force_diameter   integer
## hurricane_force_diameter       integer

So the Qualitive variables are: - name (character) - status (factor) - category (numeric)

The Quantitative variables are - year, month, day, hour (integer) - lat, long (numeric) - wind (integer) - pressure (numeric) - tropicalstorm_force_diameter, hurricane_force_diameter (numeric)

Question 5:

cat5_storms <- storms_data %>%
  filter(category == 5)


cat5_before_2000 <- cat5_storms %>%
  filter(year < 2000) %>%
  distinct(name,year) %>%  #multiple readings of the same data
  nrow()


cat5_after_2000 <- cat5_storms %>%
  filter(year >= 2000) %>%
  distinct(name, year) %>% #multiple readings of the same data 
  nrow()

cat("Number of Category-5 hurricanes before 2000:", cat5_before_2000, "\n")
## Number of Category-5 hurricanes before 2000: 7
cat("Number of Category-5 hurricanes during or after 2000:", cat5_after_2000, "\n")
## Number of Category-5 hurricanes during or after 2000: 15

Question 6:

 tdp <- storms_data %>%
  filter (status == "tropical depression")

 tdp_1990s <- tdp %>%
  filter (year >=1990 & year <2000) %>%
  distinct(name, year) %>%     #multiple readings of the same data, also names are re-used for different storms
  nrow()

cat("Number of Tropical Depressions in 1990s :", tdp_1990s , "\n")
## Number of Tropical Depressions in 1990s : 124
upgraded_to_hurricane <- storms_data %>%
  filter(year >= 1990 & year <= 1999) %>%
  group_by(name, year) %>%
  summarise(
    had_td = any(status == "tropical depression"),
    became_hurricane = any(status == "hurricane"),
    .groups = 'drop'
  ) %>%
  filter(had_td == TRUE & became_hurricane == TRUE) %>%
  nrow()

cat ("Number of these tropical depressions that were eventually upgraded to hurricanes:", upgraded_to_hurricane, "\n")
## Number of these tropical depressions that were eventually upgraded to hurricanes: 61

Question 7:

library(ggplot2)

major_hurricanes_sep <- storms_data %>%
  filter(month == 9) %>%
  filter(category %in% c(3, 4, 5)) 

# Create scatter plot
ggplot(major_hurricanes_sep, aes(x = pressure, y = wind, color = as.factor(category))) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    title = "Wind Speed vs Air Pressure for Major Hurricanes in September",
    x = "Air Pressure (millibars)",
    y = "Wind Speed (knots)",
    color = "Hurricane Category"
  ) +
  theme_light() +
  theme(text = element_text(size = 12)) +
  scale_color_manual(values = c("3" = "#228B22", "4" = "#ff7f0e", "5" = "#d62728"))

From the plot we can conclude that air pressure and Wind Speeds are inversely proportional. This makes sense as low pressure allows winds to blow into the low pressure zone. Also we see that higher the Wind Speed- Higher category Hurricane which also make sense as faster wind speed means severe hurricane.

Question 8:

ggplot(storms_data ,
       aes(x = reorder(status, pressure, FUN = median), y = pressure)) +
  geom_boxplot(aes(fill = status), alpha = 0.7) +
  labs(
    title = "Distribution of Atmospheric Pressure Across Storm Statuses",
    x = "Storm Status",
    y = "Atmospheric Pressure (millibars)",
    fill = "Storm Status"
  ) +
  theme_light() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  guides(fill = FALSE)  # Remove redundant legend
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

pressure_variance <- storms_data %>%
  group_by(status) %>%
  summarise(
    variance = var(pressure),
    mean_pressure = mean(pressure),
    .groups = 'drop'
  )%>%
arrange(desc(variance)) 
print (pressure_variance)
## # A tibble: 9 × 3
##   status                 variance mean_pressure
##   <fct>                     <dbl>         <dbl>
## 1 hurricane                349.            969.
## 2 extratropical            204.            993.
## 3 subtropical storm         52.4           998.
## 4 tropical storm            47.9           999.
## 5 other low                 27.7          1009.
## 6 disturbance               15.7          1009.
## 7 tropical depression       15.0          1008.
## 8 subtropical depression    11.9          1008.
## 9 tropical wave              3.38         1009.

From the plot it can be concluded that Hurricane has the highest variance of 348.667618 and the lowest variance is of tropical wave. A general trend follows that the variance of the air pressure decreases with increase in mean air pressure. Hurricanes also have a wide range of categories from 1-5 (67 to 137+ knots) which is a huge range hence the high variance.