Project 2 - Dataset 3

Tidying

Load the Dataset

The following code loads the data set into r and fills NA into empty cells.

data3 <- read.csv("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/data_607_project_2_dataset_3.csv", na = c("","NA"))

glimpse(data3)

## Rows: 458
## Columns: 10
## $ X              <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Name           <chr> "One", "Two", "Four", "Three", "Four", "Six", "Five", "…
## $ Duration       <chr> "June 16 – 26, 1921", "September 4 – 8, 1921", "Septemb…
## $ Wind.speed     <chr> "90 mph (150 km/h)", "80 mph (130 km/h)", "90 mph (150 …
## $ Pressure       <chr> "980 hPa (28.94 inHg)", "985 hPa (29.09 inHg)", "979 hP…
## $ Areas.affected <chr> "Central America, Gulf of Mexico", "Mexico", "None", "N…
## $ Deaths         <chr> "0", "215", "0", "0", "0", "0", "Unknown", "Unknown", "…
## $ Damage         <chr> NA, "$19 million", NA, NA, NA, NA, NA, NA, NA, NA, "$4.…
## $ REf            <chr> "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]",…
## $ Category       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

The code has several issues that keep it from being tidy, and even more issues that keep it from being clean.

To tidy the following transformations will be executed: - Separate duration into a start and end date and create an actual duration column - Separate out environmental metrics (wind speed and pressure) so that each column only has one data type and only contains numeric data - Separate and convert monetary damage into separate columns for notes and numeric values, while eliminating the text based values - Pivot long areas affected so they can be evaluated individually

Most columns also need some additional cleaning.

The code block below begins separating out the date data. The date data is mostly a range of two dates, with very inconsistent formatting. The code below separates the original duration column into individual components for the start month, year, day and the end month, year, day, and a footnote column. The original dataset did not contain the actual footnote.

data3 <- data3 |> 
  select(!X) |> 
  separate_wider_regex(
    Duration,
    patterns = c(
     start_month = "[A-Za-z]+",
     "\\s*",
     start_day = "[0-9]+",
     ",*",
     "\\s*",
     start_year = "[0-9]*",
     "\\W*",
     end_month = "[A-Za-z]*",
     "\\W*",
     end_day = "[0-9]*",
     "\\W*",
     end_year = "[0-9]*",
     "\\s*",
     date_footnote = ".*"
    )
  )

glimpse(data3)

## Rows: 458
## Columns: 15
## $ Name           <chr> "One", "Two", "Four", "Three", "Four", "Six", "Five", "…
## $ start_month    <chr> "June", "September", "September", "September", "Septemb…
## $ start_day      <chr> "16", "4", "8", "18", "10", "12", "13", "5", "18", "2",…
## $ start_year     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_month      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_day        <chr> "26", "8", "14", "24", "13", "17", "17", "14", "24", "1…
## $ end_year       <chr> "1921", "1921", "1921", "1922", "1923", "1923", "1924",…
## $ date_footnote  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ Wind.speed     <chr> "90 mph (150 km/h)", "80 mph (130 km/h)", "90 mph (150 …
## $ Pressure       <chr> "980 hPa (28.94 inHg)", "985 hPa (29.09 inHg)", "979 hP…
## $ Areas.affected <chr> "Central America, Gulf of Mexico", "Mexico", "None", "N…
## $ Deaths         <chr> "0", "215", "0", "0", "0", "0", "Unknown", "Unknown", "…
## $ Damage         <chr> NA, "$19 million", NA, NA, NA, NA, NA, NA, NA, NA, "$4.…
## $ REf            <chr> "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]",…
## $ Category       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

The following code copies the year into the start year from the end year columns wherever there is no start year. It also copies the start month into the end month column anywhere that an end month does not already exist.

data3 <- data3 |> mutate(
  start_year = if_else(start_year == "", end_year, start_year),
  end_month = if_else(end_month == "", start_month, end_month)
)

glimpse(data3)

## Rows: 458
## Columns: 15
## $ Name           <chr> "One", "Two", "Four", "Three", "Four", "Six", "Five", "…
## $ start_month    <chr> "June", "September", "September", "September", "Septemb…
## $ start_day      <chr> "16", "4", "8", "18", "10", "12", "13", "5", "18", "2",…
## $ start_year     <chr> "1921", "1921", "1921", "1922", "1923", "1923", "1924",…
## $ end_month      <chr> "June", "September", "September", "September", "Septemb…
## $ end_day        <chr> "26", "8", "14", "24", "13", "17", "17", "14", "24", "1…
## $ end_year       <chr> "1921", "1921", "1921", "1922", "1923", "1923", "1924",…
## $ date_footnote  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ Wind.speed     <chr> "90 mph (150 km/h)", "80 mph (130 km/h)", "90 mph (150 …
## $ Pressure       <chr> "980 hPa (28.94 inHg)", "985 hPa (29.09 inHg)", "979 hP…
## $ Areas.affected <chr> "Central America, Gulf of Mexico", "Mexico", "None", "N…
## $ Deaths         <chr> "0", "215", "0", "0", "0", "0", "Unknown", "Unknown", "…
## $ Damage         <chr> NA, "$19 million", NA, NA, NA, NA, NA, NA, NA, NA, "$4.…
## $ REf            <chr> "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]",…
## $ Category       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

The following code block converts the months to their numeric equivalent and then forms individual columns that contain the full start and date, as well as the difference between the two so that there is a duration column.

data3 <- data3 |>
  mutate(
    start_month = match(start_month, month.name),
    start_date = make_date(start_year, start_month, start_day),
    end_month = match(end_month, month.name),
    end_date = make_date(end_year, end_month, end_day),
    duration = as.duration(end_date - start_date)
    ) |> 
  select(!start_month:end_year) |> 
  relocate(start_date:duration, .after = Name)

glimpse(data3)

## Rows: 458
## Columns: 12
## $ Name           <chr> "One", "Two", "Four", "Three", "Four", "Six", "Five", "…
## $ start_date     <date> 1921-06-16, 1921-09-04, 1921-09-08, 1922-09-18, 1923-0…
## $ end_date       <date> 1921-06-26, 1921-09-08, 1921-09-14, 1922-09-24, 1923-0…
## $ duration       <Duration> 864000s (~1.43 weeks), 345600s (~4 days), 518400s …
## $ date_footnote  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ Wind.speed     <chr> "90 mph (150 km/h)", "80 mph (130 km/h)", "90 mph (150 …
## $ Pressure       <chr> "980 hPa (28.94 inHg)", "985 hPa (29.09 inHg)", "979 hP…
## $ Areas.affected <chr> "Central America, Gulf of Mexico", "Mexico", "None", "N…
## $ Deaths         <chr> "0", "215", "0", "0", "0", "0", "Unknown", "Unknown", "…
## $ Damage         <chr> NA, "$19 million", NA, NA, NA, NA, NA, NA, NA, NA, "$4.…
## $ REf            <chr> "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]",…
## $ Category       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

The following code separates wide wind speed and pressure data so each column only contains a single numeric value.

data3 <- data3 |> separate_wider_regex(Wind.speed,
                           patterns = c(
                             wind_speed_mph = "[0-9]+",
                             "\\D+",
                             wind_speed_km_per_hr = "[0-9]+",
                             "\\D+")) |> 
    mutate(Pressure = str_remove_all(Pressure,",")) |> 
  separate_wider_regex(Pressure,
                       patterns = c(
                         pressure_hPa = "[0-9]+",
                         "\\D+",
                         pressure_inHg = "[0-9+\\.0-9]+",
                         "\\D+"), too_few = "align_start")

The following code block cleans up and converts the damages information into numeric values. The formatting is very inconsistent as loaded. This code makes it consistent so that it can be used in data analysis.

data3 <- data3 |> 
  separate_wider_regex(Damage,
                       patterns = c(damage_notes = "^\\[.*|>*",
                                    "\\$?",
                                    damage_amount_dollars = "\\$?\\s*\\d*\\.*\\d*",
                                    "\\s*",
                                    damage_modifier = "[A-Za-z]*"
                                    ), too_few = "align_start") |> 
  mutate(
    damage_amount_dollars = as.numeric(damage_amount_dollars),
    damage_amount_dollars = if_else(damage_modifier == "thousand", damage_amount_dollars*1000, damage_amount_dollars),
    damage_amount_dollars = if_else(damage_modifier == "million", damage_amount_dollars*1000000, damage_amount_dollars),
    damage_amount_dollars = if_else(damage_modifier == "billion", damage_amount_dollars*1000000000, damage_amount_dollars),
    damage_notes = if_else(damage_modifier == "Unknown", "Unknown", damage_notes),
    damage_notes = if_else(damage_notes == ">", "Damage greater than damage_amount", damage_notes)
  )

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `damage_amount_dollars = as.numeric(damage_amount_dollars)`.
## Caused by warning:
## ! NAs introduced by coercion

The following code separates longer the areas affected column. This is necessary because the areas affected consist of multiple inconsistent variables. This is necessary but it also needs to be something that is clear and aware of during data analysis, because it duplicates deaths and monetary damages, it has the potential, during analysis to skew the numbers. As a result its necessary to separate the data into two separate data frames so as to prevent this potential duplication of data.

data_damage <- data3 |> 
  select(!Areas.affected)

data3 <- data3 |> 
  select(!Deaths :damage_modifier) |> 
  separate_longer_delim(
   Areas.affected, delim = regex("\\w+[a-z][A-Z]\\w+|, | and")
  ) |> 
  rename(area_affected = Areas.affected)

The following code attempts to clean and transform the areas affected data so that it is consistent. In a true analysis it would be important to explore and study and define more about how the areas affected are structured and what each area actually means. The data, as is, is very inconsistent. It uses different terms for what sound like the same region.

data3 <- data3 |> 
  mutate(
    area_affected = str_remove_all(area_affected, "^\\s|\\s$"),
   area_affected = str_replace_all(area_affected, "Caicos$", "Caicos Islands"),
   area_affected = str_replace_all(area_affected, "Atlantic Canada", "Canada, Atlantic"),
   area_affected = str_replace_all(area_affected, "Central Mexico", "Mexico, Central"),
   area_affected = str_replace_all(area_affected, "Canadian", "Canada,"),
   area_affected = str_replace_all(area_affected, "Cape Verde$", "Cape Verde Islands"),
   area_affected = str_replace_all(area_affected, "Cayman $", "Cayman Islands"),
   area_affected = str_replace_all(area_affected, "Central United States|central United States", "United States, Central"),
   area_affected = str_replace_all(area_affected, "East Coast of the United States|Eastern Coast of the United States|Eastern United States|United States East Coast|United States East coast", "United States, Eastern"),
   area_affected = str_replace_all(area_affected, "East Coast of the United States|Eastern Coast of the United States|Eastern United States|eastern United States", "United States, Eastern"),
   area_affected = str_replace_all(area_affected, "Eastern Canada", "Canada, Eastern"),
   area_affected = str_replace_all(area_affected, "Greater Antilles", "Antilles, Greater"),
   area_affected = str_replace_all(area_affected, "Gulf Coast of the United States|United States Gulf Coast", "United States, Gulf Coast"),
   area_affected = str_replace_all(area_affected, "Gulf of Mexico", "Mexico, Gulf"),
   area_affected = str_replace_all(area_affected, "Leeward$", "Antilles, Leeward"),
   area_affected = str_replace_all(area_affected, "Great Britain", "United Kingdom"),
   area_affected = str_replace_all(area_affected, "Leeward Antilles|Leeward Islands", "Antilles, Leeward"),
   area_affected = str_replace_all(area_affected, "Leeward Antilles|Leeward Islands|Lesser Antilles", "Antilles, Leeward"),
   area_affected = str_replace_all(area_affected, "Mid-Atlantic|Mid-Atlantic States|Mid-Atlantic states", "United States, Mid-Atlantic"),
   area_affected = str_replace_all(area_affected, "Midwestern Unites States", "United States, Midwestern"),
   area_affected = str_replace_all(area_affected, "Northeastern Caribbean", "Caribbean, Northeastern"),
   area_affected = str_replace_all(area_affected, "Northeastern United States", "United States, Northeastern"),
   area_affected = str_replace_all(area_affected, "South Florida", "Florida, South"),
   area_affected = str_replace_all(area_affected, "South Texas|Southern Texas", "Texas, South"),
   area_affected = str_replace_all(area_affected, "South United States, Central", "United States, South Central"),
   area_affected = str_replace_all(area_affected, "Southeast Mexico", "Mexico, Southeast"),
   area_affected = str_replace_all(area_affected, "Southeastern United States", "Unite Sates, Southeastern"),
   area_affected = str_replace_all(area_affected, "Southern Portugal", "Portugal, Southern"),
   area_affected = str_replace_all(area_affected, "Southwestern Florida", "Florida, Southwestern"),
   area_affected = str_replace_all(area_affected, "Southwestern Quebec", "Quebec, Southwestern"),
   area_affected = str_replace_all(area_affected, "Northeastern Caribbean", "Caribbean, Northeastern"),
   area_affected = str_replace_all(area_affected, "The Bahamas", "Bahamas"),
   area_affected = str_replace_all(area_affected, "The Caribbean", "Caribbean"),
   area_affected = str_replace_all(area_affected, "The Carolinas", "Carolinas"),
   area_affected = str_replace_all(area_affected, "Northeastern Caribbean", "Caribbean, Northeastern"),
   area_affected = str_replace_all(area_affected, "West Africa", "Africa, West"),
   area_affected = str_replace_all(area_affected, "Western Europe", "Europe, Western"),
   area_affected = str_replace_all(area_affected, "Western Mexico", "Mexico, Western"),
   area_affected = str_replace_all(area_affected, "^Yucatá.*", "Yucatán Peninsula"),
   area_affected = str_replace_all(area_affected, "western Cuba", "Cuba, Western")
  )

The following code replaces “Unknown” deaths with NA. In order to provide more consistency during analysis and not have any non-numeric data within the column.

data_damage <- data_damage |> 
  mutate(
  Deaths = na_if(Deaths, "Unknown")
    )

The following code block does some additional cleanup to both data frames.It renames column names and eliminates a helper column.

data_damage <- data_damage |> 
  relocate(damage_notes, .after = damage_amount_dollars) |> 
  mutate(
    Deaths = as.numeric(Deaths),
    wind_speed_mph = as.numeric(wind_speed_mph),
    wind_speed_km_per_hr = as.numeric(wind_speed_km_per_hr),
    pressure_hPa = as.numeric(pressure_hPa),
    pressure_inHg = as.numeric(pressure_inHg)
  ) |> 
  select(!damage_modifier) |>
  rename(deaths = Deaths, ref = REf, name = Name)

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Deaths = as.numeric(Deaths)`.
## Caused by warning:
## ! NAs introduced by coercion

data3 <- data3 |> 
  mutate(
    wind_speed_mph = as.numeric(wind_speed_mph),
    wind_speed_km_per_hr = as.numeric(wind_speed_km_per_hr),
    pressure_hPa = as.numeric(pressure_hPa),
    pressure_inHg = as.numeric(pressure_inHg)
  ) |> 
  rename(ref = REf, name = Name)

glimpse(data3)

## Rows: 1,127
## Columns: 12
## $ name                 <chr> "One", "One", "Two", "Four", "Three", "Four", "Si…
## $ start_date           <date> 1921-06-16, 1921-06-16, 1921-09-04, 1921-09-08, …
## $ end_date             <date> 1921-06-26, 1921-06-26, 1921-09-08, 1921-09-14, …
## $ duration             <Duration> 864000s (~1.43 weeks), 864000s (~1.43 weeks)…
## $ date_footnote        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ wind_speed_mph       <dbl> 90, 90, 80, 90, 80, 80, 80, 80, 85, 80, 80, 80, 9…
## $ wind_speed_km_per_hr <dbl> 150, 150, 130, 150, 130, 130, 130, 130, 140, 130,…
## $ pressure_hPa         <dbl> 980, 980, 985, 979, 987, 986, 983, 983, 980, 994,…
## $ pressure_inHg        <dbl> 28.94, 28.94, 29.09, 28.91, 29.15, 29.12, 29.03, …
## $ area_affected        <chr> "Central America", "Mexico, Gulf", "Mexico", "Non…
## $ ref                  <chr> "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", …
## $ Category             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

glimpse(data_damage)

## Rows: 458
## Columns: 14
## $ name                  <chr> "One", "Two", "Four", "Three", "Four", "Six", "F…
## $ start_date            <date> 1921-06-16, 1921-09-04, 1921-09-08, 1922-09-18,…
## $ end_date              <date> 1921-06-26, 1921-09-08, 1921-09-14, 1922-09-24,…
## $ duration              <Duration> 864000s (~1.43 weeks), 345600s (~4 days), 5…
## $ date_footnote         <chr> "", "", "", "", "", "", "", "", "", "", "", "", …
## $ wind_speed_mph        <dbl> 90, 80, 90, 80, 80, 80, 85, 80, 80, 90, 80, 90, …
## $ wind_speed_km_per_hr  <dbl> 150, 130, 150, 130, 130, 130, 140, 130, 130, 150…
## $ pressure_hPa          <dbl> 980, 985, 979, 987, 986, 983, 980, 994, 994, 100…
## $ pressure_inHg         <dbl> 28.94, 29.09, 28.91, 29.15, 29.12, 29.03, 28.94,…
## $ deaths                <dbl> 0, 215, 0, 0, 0, 0, NA, NA, 0, 0, 19, 210, 0, NA…
## $ damage_amount_dollars <dbl> NA, 19000000, NA, NA, NA, NA, NA, NA, NA, NA, 43…
## $ damage_notes          <chr> NA, "", NA, NA, NA, NA, NA, NA, NA, NA, "", "", …
## $ ref                   <chr> "[2]", "[2]", "[2]", "[2]", "[2]", "[2]", "[2]",…
## $ Category              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Analysis

Below is a cursory analysis of the now tidy data.

The following code sorts and displays, then charts the Hurricanes by the ones that cost the most human life.

data_deaths <- data_damage |> 
  arrange(desc(deaths)) |>
  distinct(name,.keep_all = TRUE) |> 
  slice_head(n = 10)

ggplot(data_deaths, aes(x = reorder(name, -deaths),y = deaths)) + geom_col()

The following code displays then plots the top ten hurricanes by the estimated cost of damage for all huricanes that had data.

data_cost <- data_damage |> 
  arrange(desc(damage_amount_dollars)) |>
  distinct(name,.keep_all = TRUE) |> 
  slice_head(n = 10)

ggplot(data_cost, aes(x = reorder(name, -damage_amount_dollars), y = damage_amount_dollars)) + geom_col()

The following code displays then plots the top ten areas that have the most hurricanes in the dataset

data_areas_affected <- data3 |> 
  filter(area_affected != "None") |>
  count(area_affected, sort = TRUE) |>
  top_n(10) |> 
  print()

## Selecting by n

## # A tibble: 11 × 2
##    area_affected                 n
##    <chr>                     <int>
##  1 Bermuda                      57
##  2 Bahamas                      52
##  3 United States, Eastern       52
##  4 Florida                      51
##  5 Cuba                         45
##  6 Antilles, Leeward            39
##  7 Canada, Atlantic             37
##  8 Mexico                       37
##  9 Puerto Rico                  27
## 10 Newfoundland                 25
## 11 United States, Gulf Coast    25

ggplot(data_areas_affected, aes(x = reorder(area_affected, -n), y = n)) + geom_col()

data_607_project_3_dataset_3

Maxfield Raynolds

2025-03-08

Project 2 - Dataset 3

Tidying

Load the Dataset

Analysis