Exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database

Synopsis:

This report investigates some key parameters of the storm conditions from the 1950 till 2011 by using the NOAA storm database. The main goal is to analyse the most harmful events in terms of population health and economic aftermath.

Data Processing

This section shows how data was pre-processed, first by importing the packages and then how data is loaded into R and processed for analysis

Importing packages

Before pre-processing data, I imported the packages “tidyverse” for data analysis and “ggrepel” for adding label texts to ggplots.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)

Importing database

I downloaded the dataset and used “setwd” to set work directory, I loaded the dataset “repdata_data_StormData.csv.bz2” by using the “read.csv()” function, which will read this bz2 compressed file. Importantly, this script requires that the work directory is set to the directory where the storm database is located.

#setwd("#####/####/miriam_assignment_2/data")
setwd("C:\\Users\\glori\\Downloads")
STORM_db <- read.csv("repdata_data_StormData.csv.bz2")

Results

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

# Print column names and summary of the dataset to learn which type of data we have
colnames(STORM_db)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
summary(STORM_db)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

Because we want to investigate the population damage, we will analyse “FATALITIES” and “INJURIES”.

Summarizing the data by total “FATALITIES” and “INJURIES”

By using “group_by” and “summarise” functions, we can create a new dataset (“summarize_damage”) with the total number of “FATALITIES” and “INJURIES” based on event type “EVTYPE”.

summarize_damage <- STORM_db %>%
  group_by(EVTYPE) %>%
  summarise(total_fatalities = sum(FATALITIES, na.rm = TRUE),
            total_injuries = sum(INJURIES, na.rm = TRUE))

Filtering the top 5 events with the most human damage by either fatalities (“top5_fatalities”) or injuries (“top5_injuries”) by using “top_n()” function.

#Filter top 5 events with most human damage
top5_fatalities <- summarize_damage %>%
  top_n(total_fatalities, n = 5) %>%
  arrange(desc(total_fatalities))

print(top5_fatalities)
## # A tibble: 5 × 3
##   EVTYPE         total_fatalities total_injuries
##   <chr>                     <dbl>          <dbl>
## 1 TORNADO                    5633          91346
## 2 EXCESSIVE HEAT             1903           6525
## 3 FLASH FLOOD                 978           1777
## 4 HEAT                        937           2100
## 5 LIGHTNING                   816           5230
# Filter top 5 events with most human damage
top5_injuries <- summarize_damage %>%
  top_n(total_injuries, n = 5) %>%
  arrange(desc(total_injuries))
  
print(top5_injuries)
## # A tibble: 5 × 3
##   EVTYPE         total_fatalities total_injuries
##   <chr>                     <dbl>          <dbl>
## 1 TORNADO                    5633          91346
## 2 TSTM WIND                   504           6957
## 3 FLOOD                       470           6789
## 4 EXCESSIVE HEAT             1903           6525
## 5 LIGHTNING                   816           5230

Creating new table with a new column (“is_top5_damage”) that will show if the event type is in between the top 5 of injuries, fatalities, both (“Both_damages”) or none (NA).

summarize_damage_top <- summarize_damage %>%
  mutate(is_top5_damage = case_when(EVTYPE %in% top5_fatalities$EVTYPE & EVTYPE %in% top5_injuries$EVTYPE ~ "Both_damages",
                                    EVTYPE %in% top5_injuries$EVTYPE ~ "Injuries",
                                    EVTYPE %in% top5_fatalities$EVTYPE ~ "Fatalities"))

Filtering the previously created table so that only shows the top 5 damages (by injuries, fatalities and both).

only_top_humandamage <- summarize_damage_top %>%
  filter(is_top5_damage %in% c("Both_damages", "Injuries","Fatalities"))

print(only_top_humandamage)
## # A tibble: 7 × 4
##   EVTYPE         total_fatalities total_injuries is_top5_damage
##   <chr>                     <dbl>          <dbl> <chr>         
## 1 EXCESSIVE HEAT             1903           6525 Both_damages  
## 2 FLASH FLOOD                 978           1777 Fatalities    
## 3 FLOOD                       470           6789 Injuries      
## 4 HEAT                        937           2100 Fatalities    
## 5 LIGHTNING                   816           5230 Both_damages  
## 6 TORNADO                    5633          91346 Both_damages  
## 7 TSTM WIND                   504           6957 Injuries

Plotting fatalities and injuries together and label top 5

For the visualization of this analysis, a scatter plot is created to show the correlation between the top injuries versus fatalities.

I used functions like geom_text_repel() to label the top event types that caused the highest human damage, where fatalities is in green, injuries in blue and both in red.

ggplot(summarize_damage_top, aes(total_fatalities, total_injuries)) +
  geom_point() +
  coord_cartesian(expand = TRUE) +
  geom_text_repel(data = only_top_humandamage, aes(label = EVTYPE, color = is_top5_damage)) +
  xlab("Fatalities") +
  ylab("Injuries") +
labs(caption = "Figure 1. Scatter plot showing the top events that lead to most population health damage")

Printing top human damage

print(only_top_humandamage %>% 
        select(EVTYPE, total_fatalities, total_injuries, is_top5_damage) %>% 
        arrange(desc(total_fatalities)))
## # A tibble: 7 × 4
##   EVTYPE         total_fatalities total_injuries is_top5_damage
##   <chr>                     <dbl>          <dbl> <chr>         
## 1 TORNADO                    5633          91346 Both_damages  
## 2 EXCESSIVE HEAT             1903           6525 Both_damages  
## 3 FLASH FLOOD                 978           1777 Fatalities    
## 4 HEAT                        937           2100 Fatalities    
## 5 LIGHTNING                   816           5230 Both_damages  
## 6 TSTM WIND                   504           6957 Injuries      
## 7 FLOOD                       470           6789 Injuries

As shown can see, the top human damage event is tornadoes, with 5633 total fatalities and 91346 total injuries, which are also in the top 5 of both fatalities and tornadoes from the storm dataset.

Question 2

Across the United States, which types of events have the greatest economic consequences?

For that, we need to analyse the data from the variables “PROPDMG”, “PROPDMGEXP”, “CROPDMG” and “CROPDMGEXP”, which we saw at the beginning when looking at the data. However, 2 of the variables (“PROPDMGEXP” and “CROPDMGEXP”) represent data in exponential characters.

Checking exponents with “unique()” function

unique(STORM_db$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
unique(STORM_db$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

We can see that the data has characters, where K/k is thousands, M/m is millions and B/b is billions.

Summarizing the data

Creating a new data table that first transforms the data of property damage and crop damage to numbers by multiplying by either 1e3 (thousands), 1e6 (millions) or 1e9 (billions).Then, this is transformed to new columns representing the sum of property and crop damage, grouped by event type (EVTYPE).

All the plots and numbers included in the new data table are calculated in billions, for a more clear visualization of the data in Figure 2.

summarize_economic_damage <- STORM_db %>%
  mutate(propertydamage = case_when(PROPDMGEXP %in% c("K","k") ~ PROPDMG*1e3,
                                    PROPDMGEXP %in% c("M","m") ~ PROPDMG*1e6,
                                    PROPDMGEXP %in% c("B","b") ~ PROPDMG*1e9,
                                    TRUE ~ PROPDMG),
         cropdamage =  case_when(CROPDMGEXP %in% c("K","k") ~ CROPDMG*1e3,
                                    CROPDMGEXP %in% c("M","m") ~ CROPDMG*1e6,
                                    CROPDMGEXP %in% c("B","b") ~ CROPDMG*1e9,
                                    TRUE ~ CROPDMG)) %>%
  group_by(EVTYPE) %>%
  summarise(total_propertydamage = sum(propertydamage, na.rm = TRUE)/1e9,
                   total_cropdamage = sum(cropdamage, na.rm = TRUE)/1e9)

Filtering top 5 events with most property and crop damage

top5_propertydamage <- summarize_economic_damage %>%
  top_n(total_propertydamage, n = 5)

top5_cropdamage <- summarize_economic_damage %>%
  top_n(total_cropdamage, n = 5)

print(top5_propertydamage)
## # A tibble: 5 × 3
##   EVTYPE            total_propertydamage total_cropdamage
##   <chr>                            <dbl>            <dbl>
## 1 FLASH FLOOD                       16.1         1.42    
## 2 FLOOD                            145.          5.66    
## 3 HURRICANE/TYPHOON                 69.3         2.61    
## 4 STORM SURGE                       43.3         0.000005
## 5 TORNADO                           56.9         0.415
print(top5_cropdamage)
## # A tibble: 5 × 3
##   EVTYPE      total_propertydamage total_cropdamage
##   <chr>                      <dbl>            <dbl>
## 1 DROUGHT                     1.05            14.0 
## 2 FLOOD                     145.               5.66
## 3 HAIL                       15.7              3.03
## 4 ICE STORM                   3.94             5.02
## 5 RIVER FLOOD                 5.12             5.03

Creating a new table with a new column (“is_top5_damage”) that will show if the event type is in between the top 5 for property damage, crop damage, both (“Both_damages”) or none (NA).

summarize_economic_damage_top <- summarize_economic_damage %>%
  mutate(is_top5_damage = case_when(EVTYPE %in% top5_propertydamage$EVTYPE & EVTYPE %in% top5_cropdamage$EVTYPE ~ "Both_damages",
                                    EVTYPE %in% top5_propertydamage$EVTYPE ~ "Property_damage",
                                    EVTYPE %in% top5_cropdamage$EVTYPE ~ "Crop_damage"))

Filtering the previously created table so that only shows the top 5 damages (by injuries, fatalities and both), and adding a new column of total damage for both the crop and property damage.

only_top_damage <- summarize_economic_damage_top %>%
  filter(is_top5_damage %in% c("Both_damages", "Property_damage","Crop_damage")) %>%
  mutate(total_damage = total_cropdamage + total_propertydamage)

Plotting crop and property damage

For the visualization of this analysis, a scatter plot is created to show the correlation between the top crop damage versus the property damage.

I used functions like geom_text_repel() to label the top event types that caused the highest economic damage, where crop damage is in green, property in blue and both in red.

ggplot(summarize_economic_damage_top, aes(x = total_propertydamage, y = total_cropdamage)) +
  geom_point() +
  coord_cartesian(expand = TRUE) +
  geom_text_repel(data = only_top_damage, aes(x = total_propertydamage, y = total_cropdamage, label = EVTYPE, color = is_top5_damage)) +
  xlab("Property Damage (billions of Dollars)") +
  ylab("Crop Damage (billions of Dollars)") + 
  labs(caption = "Figure 2. Scatter plot showing the top events that lead to most economic damage")

Printing top economic damage

print(only_top_damage %>% 
        select(EVTYPE, total_damage, total_propertydamage, total_cropdamage, is_top5_damage) %>%
        arrange(desc(total_damage)))
## # A tibble: 9 × 5
##   EVTYPE       total_damage total_propertydamage total_cropdamage is_top5_damage
##   <chr>               <dbl>                <dbl>            <dbl> <chr>         
## 1 FLOOD              150.                 145.           5.66     Both_damages  
## 2 HURRICANE/T…        71.9                 69.3          2.61     Property_dama…
## 3 TORNADO             57.4                 56.9          0.415    Property_dama…
## 4 STORM SURGE         43.3                 43.3          0.000005 Property_dama…
## 5 HAIL                18.8                 15.7          3.03     Crop_damage   
## 6 FLASH FLOOD         17.6                 16.1          1.42     Property_dama…
## 7 DROUGHT             15.0                  1.05        14.0      Crop_damage   
## 8 RIVER FLOOD         10.1                  5.12         5.03     Crop_damage   
## 9 ICE STORM            8.97                 3.94         5.02     Crop_damage

As we can see, the top economic damage event if we put together the crop and the property damage are floods, with 150 billion dollars, with 145 billion dollars loss for property and 5.66 in crop damage. Interestingly, the event that lead to the most crop damage was drought, with 14 billions of dollars, while only 1 billion dollar in property damage.

Conclusion

The analysis of this data shows that tornadoes cause the biggest human damage, specially in number of reported injuries, while floods and drought have the biggest impact in crops and floods, in addition have a harmful effect in property.