#INTRODUCTION This report analyzes the NOAA Storm Database to determine which types of severe weather events are most harmful to population health and which have the greatest economic consequences across the United States. The analysis focuses on the EVTYPE variable to categorize events and uses FATALITIES and INJURIES to assess population health impacts. Economic consequences are evaluated using PROPDMG and CROPDMG, adjusted for their respective exponents. The data is loaded directly from the raw CSV file and processed within this document. The results are presented through tables and figures, highlighting the most significant event types in terms of both human and economic impact. This analysis aims to provide information relevant to government and municipal managers for resource allocation in severe weather preparedness.

The data was downloaded from the course website to a local computer. It was then loaded into R usint its format

data <- read.csv("repdata_data_stormData.csv.bz2")
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
summary(data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

Since the data analysis is to evaluate the effect of severe weather on health and economics, only variables of interest to these were selected for the analysis.

Selected_data <- data %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

The Eventype variable was cleaned by converting it to an uppercase and removing leading/trailing spaces in between.

Selected_data$EVTYPE <- toupper(str_trim(Selected_data$EVTYPE))

Converting the damage exponent variables to numeric and create columns for the actual damage values

unique(Selected_data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(Selected_data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
# Standardize exponents to uppercase and handle unexpected values
Selected_data <- Selected_data %>%
  mutate(
    PROPDMGEXP = toupper(PROPDMGEXP),
    CROPDMGEXP = toupper(CROPDMGEXP),
    PROPDMGEXP_VAL = case_when(
      PROPDMGEXP == "K" ~ 1000,
      PROPDMGEXP == "M" ~ 1000000,
      PROPDMGEXP == "B" ~ 1000000000,
      PROPDMGEXP == "H" ~ 100,
      PROPDMGEXP %in% c("", "+", "-", "?", "0", "1", "2", "3", "4", "5", "6", "7", "8") ~ 1, # Treat as 1
      TRUE ~ 1 # Default to 1 for any other case
    ),
    CROPDMGEXP_VAL = case_when(
      CROPDMGEXP == "K" ~ 1000,
      CROPDMGEXP == "M" ~ 1000000,
      CROPDMGEXP == "B" ~ 1000000000,
      CROPDMGEXP == "H" ~ 100,
      CROPDMGEXP %in% c("", "+", "-", "?", "0", "1", "2", "3", "4", "5", "6", "7", "8") ~ 1, # Treat as 1
      TRUE ~ 1 # Default to 1 for any other case
    ),
    PROPDMG_ADJ = PROPDMG * PROPDMGEXP_VAL,
    CROPDMG_ADJ = CROPDMG * CROPDMGEXP_VAL
  )

calculating the total health impact and economic impact for each event type

health_impact <- Selected_data %>%
  group_by(EVTYPE) %>%
  summarize(
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    total_injuries = sum(INJURIES, na.rm = TRUE),
    total_health_impact = total_fatalities + total_injuries
  ) %>%
  arrange(desc(total_health_impact))

economic_impact <- Selected_data %>%
  group_by(EVTYPE) %>%
  summarize(
    total_prop_dmg = sum(PROPDMG_ADJ, na.rm = TRUE),
    total_crop_dmg = sum(CROPDMG_ADJ, na.rm = TRUE),
    total_economic_impact = total_prop_dmg + total_crop_dmg
  ) %>%
  arrange(desc(total_economic_impact))

Event type with high population health impact

head(health_impact, 10)
## # A tibble: 10 × 4
##    EVTYPE            total_fatalities total_injuries total_health_impact
##    <chr>                        <dbl>          <dbl>               <dbl>
##  1 TORNADO                       5633          91346               96979
##  2 EXCESSIVE HEAT                1903           6525                8428
##  3 TSTM WIND                      504           6957                7461
##  4 FLOOD                          470           6789                7259
##  5 LIGHTNING                      816           5230                6046
##  6 HEAT                           937           2100                3037
##  7 FLASH FLOOD                    978           1777                2755
##  8 ICE STORM                       89           1975                2064
##  9 THUNDERSTORM WIND              133           1488                1621
## 10 WINTER STORM                   206           1321                1527

Plot of eventypes with the greatest population health impact.

ggplot(head(health_impact, 10), aes(x = reorder(EVTYPE, total_health_impact), y = total_health_impact)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Events by Population Health Impact", x = "Event Type", y = "Total Health Impact (Fatalities + Injuries)")

Event types with greatest economics consequences

head(economic_impact, 10)
## # A tibble: 10 × 4
##    EVTYPE            total_prop_dmg total_crop_dmg total_economic_impact
##    <chr>                      <dbl>          <dbl>                 <dbl>
##  1 FLOOD              144657709807      5661968450         150319678257 
##  2 HURRICANE/TYPHOON   69305840000      2607872800          71913712800 
##  3 TORNADO             56937160779.      414953270          57352114049.
##  4 STORM SURGE         43323536000            5000          43323541000 
##  5 HAIL                15732267543.     3025954473          18758222016.
##  6 FLASH FLOOD         16140862067.     1421317100          17562179167.
##  7 DROUGHT              1046106000     13972566000          15018672000 
##  8 HURRICANE           11868319010      2741910000          14610229010 
##  9 RIVER FLOOD          5118945500      5029459000          10148404500 
## 10 ICE STORM            3944927860      5022113500           8967041360

Plot of event types with the greatest economic impact

ggplot(head(economic_impact, 10), aes(x = reorder(EVTYPE, total_economic_impact), y = total_economic_impact)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Events by Economic Impact", x = "Event Type", y = "Total Economic Impact (Property + Crop Damage)")

The analysis and the result above shows that Tornado events have the most significant impact on population health while flood events have the greatest economic impact and consequences. These findings can help municipal managers to prioritixe resources in preparing for severe weather. Also, it could help the Government not only in preparedness or resources allocation but in policies and efforts that could mitigate those events in the future.

##Sensitivity Analysis ###Sensitivity Analysis on Health Impact

The analysis above ignored the missing values and since the primary area of uncertainty in data analysis is how missing data are treated. Thus a sensitivity analysis is conducted to have a deeper understanding of the robustness of the result presented above and to ascertain the reliability of the analyis.

In the case of impact on population health, the sensitivity analysis is done by using regular expressions to standardize event names. The aim is to combine similar event types that might be recorded with slight variation.

# More aggressive EVTYPE cleaning using regular expressions
Selected_data_cleaned2 <- Selected_data %>%
  mutate(
    EVTYPE_CLEANED2 = case_when(
      str_detect(EVTYPE, "TORNADO") ~ "TORNADO",
      str_detect(EVTYPE, "FLOOD") ~ "FLOOD",
      str_detect(EVTYPE, "HURRICANE|TYPHOON") ~ "HURRICANE/TYPHOON",
      str_detect(EVTYPE, "EXCESSIVE HEAT") ~ "EXCESSIVE HEAT",
      str_detect(EVTYPE, "LIGHTNING") ~ "LIGHTNING",
      str_detect(EVTYPE, "TSTM WIND|THUNDERSTORM WIND") ~ "THUNDERSTORM WIND",
      str_detect(EVTYPE, "FLASH FLOOD") ~ "FLASH FLOOD",
      str_detect(EVTYPE, "HEAT") ~ "HEAT",
      str_detect(EVTYPE, "ICE STORM") ~ "ICE STORM",
      str_detect(EVTYPE, "WINTER STORM") ~ "WINTER STORM",
      TRUE ~ "OTHER"
    )
  )

health_impact_cleaned2 <- Selected_data_cleaned2 %>%
  group_by(EVTYPE_CLEANED2) %>%
  summarize(
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    total_injuries = sum(INJURIES, na.rm = TRUE),
    total_health_impact = total_fatalities + total_injuries
  ) %>%
  arrange(desc(total_health_impact))

# Compare original and refined cleaning results
head(health_impact, 10)
## # A tibble: 10 × 4
##    EVTYPE            total_fatalities total_injuries total_health_impact
##    <chr>                        <dbl>          <dbl>               <dbl>
##  1 TORNADO                       5633          91346               96979
##  2 EXCESSIVE HEAT                1903           6525                8428
##  3 TSTM WIND                      504           6957                7461
##  4 FLOOD                          470           6789                7259
##  5 LIGHTNING                      816           5230                6046
##  6 HEAT                           937           2100                3037
##  7 FLASH FLOOD                    978           1777                2755
##  8 ICE STORM                       89           1975                2064
##  9 THUNDERSTORM WIND              133           1488                1621
## 10 WINTER STORM                   206           1321                1527
head(health_impact_cleaned2, 10)
## # A tibble: 10 × 4
##    EVTYPE_CLEANED2   total_fatalities total_injuries total_health_impact
##    <chr>                        <dbl>          <dbl>               <dbl>
##  1 TORNADO                       5661          91407               97068
##  2 OTHER                         2835          11892               14727
##  3 THUNDERSTORM WIND              728           9493               10221
##  4 FLOOD                         1525           8604               10129
##  5 EXCESSIVE HEAT                1922           6525                8447
##  6 LIGHTNING                      817           5232                6049
##  7 HEAT                          1216           2699                3915
##  8 ICE STORM                       89           1990                2079
##  9 WINTER STORM                   217           1353                1570
## 10 HURRICANE/TYPHOON              135           1333                1468

The result above shows a difference in the first 10 event impact on population health when a different data cleaning method approach is used. Although it didn’t change the result of the first analyis in that Tornado is still the leading event that mostly impact population health but it is important to pay attention to “Other” event as well when considering allocation of resources or any Goverment policies that could mitigate the events.

ggplot(head(health_impact_cleaned2, 10), aes(x = reorder(EVTYPE_CLEANED2, total_health_impact), y = total_health_impact)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Events by Population Health Impact (Refined EVTYPE Cleaning)", x = "Event Type", y = "Total Health Impact (Fatalities + Injuries)")

Comparing the figure above to the previous figure generated for the impact on population health, we will see clearyly that there are some discrepancy in the figures. This gives us more insight into how the result of an analyis could change when a missing data is handled differently.

# Assuming you have health_impact and health_impact_cleaned2 data frames

# Find common event types
common_events <- intersect(health_impact$EVTYPE, health_impact_cleaned2$EVTYPE_CLEANED2)

# Filter data frames to include only common events
health_impact_common <- health_impact %>%
  filter(EVTYPE %in% common_events)

health_impact_cleaned2_common <- health_impact_cleaned2 %>%
  filter(EVTYPE_CLEANED2 %in% common_events)

# Join the data frames for comparison (using a simplified join for demonstration)
comparison_data <- data.frame(
  EVTYPE = health_impact_common$EVTYPE,
  Original_Impact = health_impact_common$total_health_impact,
  Cleaned_Impact = health_impact_cleaned2_common$total_health_impact
)

# Calculate percentage change
comparison_data <- comparison_data %>%
  mutate(
    Percentage_Change = ((Cleaned_Impact - Original_Impact) / Original_Impact) * 100
  )

# Display the comparison data
print(comparison_data)
##               EVTYPE Original_Impact Cleaned_Impact Percentage_Change
## 1            TORNADO           96979          97068      9.177245e-02
## 2     EXCESSIVE HEAT            8428          14727      7.473897e+01
## 3              FLOOD            7259          10221      4.080452e+01
## 4          LIGHTNING            6046          10129      6.753225e+01
## 5               HEAT            3037           8447      1.781363e+02
## 6          ICE STORM            2064           6049      1.930717e+02
## 7  THUNDERSTORM WIND            1621           3915      1.415176e+02
## 8       WINTER STORM            1527           2079      3.614931e+01
## 9  HURRICANE/TYPHOON            1339           1570      1.725168e+01
## 10             OTHER               4           1468      3.660000e+04
# You can also visualize this data, for example, using a bar chart
library(ggplot2)

ggplot(comparison_data, aes(x = EVTYPE, y = Percentage_Change)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Percentage Change in Health Impact After Cleaning",
    x = "Event Type",
    y = "Percentage Change"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

###Sensitivity Analysis on Economic Impact

Although, in the analyis above the exponent values both numerical and symbols are treated as multipliers of 1. It is important to explore if there will be a change in result if the missing data were handled differently.

In the sensitivity analysis, missing data were imputed by replacing missing numerical damage values with mean or median of the non-missing values for each event type

# Mean Imputation for PROPDMG_ADJ
Selected_data_imputed <- Selected_data %>%
  group_by(EVTYPE) %>%
  mutate(
    PROPDMG_ADJ_IMPUTED = ifelse(is.na(PROPDMG_ADJ), mean(PROPDMG_ADJ, na.rm = TRUE), PROPDMG_ADJ),
    CROPDMG_ADJ_IMPUTED = ifelse(is.na(CROPDMG_ADJ), mean(CROPDMG_ADJ, na.rm = TRUE), CROPDMG_ADJ)
  ) %>%
  ungroup()

economic_impact_imputed <- Selected_data_imputed %>%
  group_by(EVTYPE) %>%
  summarize(
    total_prop_dmg_imputed = sum(PROPDMG_ADJ_IMPUTED, na.rm = TRUE),
    total_crop_dmg_imputed = sum(CROPDMG_ADJ_IMPUTED, na.rm = TRUE),
    total_economic_impact_imputed = total_prop_dmg_imputed + total_crop_dmg_imputed
  ) %>%
  arrange(desc(total_economic_impact_imputed))

# Compare original and imputed results
head(economic_impact, 10)
## # A tibble: 10 × 4
##    EVTYPE            total_prop_dmg total_crop_dmg total_economic_impact
##    <chr>                      <dbl>          <dbl>                 <dbl>
##  1 FLOOD              144657709807      5661968450         150319678257 
##  2 HURRICANE/TYPHOON   69305840000      2607872800          71913712800 
##  3 TORNADO             56937160779.      414953270          57352114049.
##  4 STORM SURGE         43323536000            5000          43323541000 
##  5 HAIL                15732267543.     3025954473          18758222016.
##  6 FLASH FLOOD         16140862067.     1421317100          17562179167.
##  7 DROUGHT              1046106000     13972566000          15018672000 
##  8 HURRICANE           11868319010      2741910000          14610229010 
##  9 RIVER FLOOD          5118945500      5029459000          10148404500 
## 10 ICE STORM            3944927860      5022113500           8967041360
head(economic_impact_imputed, 10)
## # A tibble: 10 × 4
##    EVTYPE   total_prop_dmg_imputed total_crop_dmg_imputed total_economic_impac…¹
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 FLOOD             144657709807              5661968450          150319678257 
##  2 HURRICA…           69305840000              2607872800           71913712800 
##  3 TORNADO            56937160779.              414953270           57352114049.
##  4 STORM S…           43323536000                    5000           43323541000 
##  5 HAIL               15732267543.             3025954473           18758222016.
##  6 FLASH F…           16140862067.             1421317100           17562179167.
##  7 DROUGHT             1046106000             13972566000           15018672000 
##  8 HURRICA…           11868319010              2741910000           14610229010 
##  9 RIVER F…            5118945500              5029459000           10148404500 
## 10 ICE STO…            3944927860              5022113500            8967041360 
## # ℹ abbreviated name: ¹​total_economic_impact_imputed

The result shows that there is no difference in the way the missing data was treated. Either by ignoring the data or by imputing the result seems the sames

ggplot(head(economic_impact_imputed, 10), aes(x = reorder(EVTYPE, total_economic_impact_imputed), y = total_economic_impact_imputed)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Events by Economic Impact", x = "Event Type", y = "Total Economic Impact (Property + Crop Damage)")