Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. We will use the database to answer the questions below and show the code for the entire analysis. Questions to be answered: 1. Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

In this session we will describe all steps to be performed in the dataset including downloading, reading, and transforming it to make sure the set is tidy before we can perform our analysis.

data<-read.csv("repdata_data_StormData.csv.bz2")

#Visualizing the data and its structure
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
dim(data)
## [1] 902297     37

The data has 902297 observations in 37 variables. Several columns indicate location of the event, number of fatalities, mag, injuries, property and crop damage, More info can be found in [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]Link.

summary(is.na(data))
##   STATE__         BGN_DATE        BGN_TIME       TIME_ZONE      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902297    FALSE:902297   
##                                                                 
##    COUNTY        COUNTYNAME        STATE           EVTYPE       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902297    FALSE:902297   
##                                                                 
##  BGN_RANGE        BGN_AZI        BGN_LOCATI       END_DATE      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902297    FALSE:902297   
##                                                                 
##   END_TIME       COUNTY_END      COUNTYENDN     END_RANGE        END_AZI       
##  Mode :logical   Mode :logical   Mode:logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    TRUE:902297    FALSE:902297    FALSE:902297   
##                                                                                
##  END_LOCATI        LENGTH          WIDTH             F          
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902297    FALSE:58734    
##                                                  TRUE :843563   
##     MAG          FATALITIES       INJURIES        PROPDMG       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902297    FALSE:902297   
##                                                                 
##  PROPDMGEXP       CROPDMG        CROPDMGEXP         WFO         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902297    FALSE:902297   
##                                                                 
##  STATEOFFIC      ZONENAMES        LATITUDE       LONGITUDE      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902297    FALSE:902297    FALSE:902250    FALSE:902297   
##                                  TRUE :47                       
##  LATITUDE_E      LONGITUDE_       REMARKS          REFNUM       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:902257    FALSE:902297    FALSE:902297    FALSE:902297   
##  TRUE :40
#Changing date column
data$BGN_DATE<- as.POSIXct(data$BGN_DATE, format = "%m/%d/%Y")

There are a few columns with NA values, but only the F column, and some of the Latitude ones.

To start answering the questions we can look at fatalities and injuries summaries with time.

summary(data$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(data$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summary_data <- data %>%
  
group_by(EVTYPE) %>%
  summarise(
    Total_Fatalities = sum(FATALITIES, na.rm = TRUE),
    Total_Injuries = sum(INJURIES, na.rm = TRUE),
    Total_Harm = Total_Fatalities + Total_Injuries
  ) %>%
  arrange(desc(Total_Harm)) %>%
  slice_head(n = 10)

print(summary_data)
## # A tibble: 10 × 4
##    EVTYPE            Total_Fatalities Total_Injuries Total_Harm
##    <chr>                        <dbl>          <dbl>      <dbl>
##  1 TORNADO                       5633          91346      96979
##  2 EXCESSIVE HEAT                1903           6525       8428
##  3 TSTM WIND                      504           6957       7461
##  4 FLOOD                          470           6789       7259
##  5 LIGHTNING                      816           5230       6046
##  6 HEAT                           937           2100       3037
##  7 FLASH FLOOD                    978           1777       2755
##  8 ICE STORM                       89           1975       2064
##  9 THUNDERSTORM WIND              133           1488       1621
## 10 WINTER STORM                   206           1321       1527

There are a few columns with NA values, but only the F column, and some of the Latitude ones.

Results

After gathering the data, we will then create the figures to answer those.

library(dplyr)
library(tidyr)
library(ggplot2)


# Convert to long format for grouped bar chart
summary_long <- summary_data %>%
  pivot_longer(cols = c(Total_Fatalities, Total_Injuries, Total_Harm),
               names_to = "Harm_Type",
               values_to = "Count")

# Plot
ggplot(summary_long, aes(x = reorder(EVTYPE, -Count), y = Count, fill = Harm_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Fatalities, Injuries, and Harm by Event Type",
       x = "Event Type",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("salmon", "skyblue", "darkseagreen"))

The top 10 event types that account for the biggest harm to population (considering injuries and fatalities). Now let’s take a look at damage to property and crop.

summary(data$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
summary(data$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000
library(dplyr)

dmg_summary <- data %>%
  
group_by(EVTYPE) %>%
  summarise(
    Total_PropDmg = sum(PROPDMG, na.rm = TRUE),
    Total_CropDmg = sum(CROPDMG, na.rm = TRUE),
    Total_Dmg = Total_PropDmg + Total_CropDmg
  ) %>%
  arrange(desc(Total_Dmg)) %>%
  slice_head(n = 10)

print(dmg_summary)
## # A tibble: 10 × 4
##    EVTYPE             Total_PropDmg Total_CropDmg Total_Dmg
##    <chr>                      <dbl>         <dbl>     <dbl>
##  1 TORNADO                 3212258.       100019.  3312277.
##  2 FLASH FLOOD             1420125.       179200.  1599325.
##  3 TSTM WIND               1335966.       109203.  1445168.
##  4 HAIL                     688693.       579596.  1268290.
##  5 FLOOD                    899938.       168038.  1067976.
##  6 THUNDERSTORM WIND        876844.        66791.   943636.
##  7 LIGHTNING                603352.         3581.   606932.
##  8 THUNDERSTORM WINDS       446293.        18685.   464978.
##  9 HIGH WIND                324732.        17283.   342015.
## 10 WINTER STORM             132721.         1979.   134700.
# Convert to long format for grouped bar chart
summary_dmg_long <- dmg_summary %>%
  pivot_longer(cols = c(Total_Dmg, Total_PropDmg, Total_CropDmg),
               names_to = "Damage_Type",
               values_to = "Count")

# Plot
ggplot(summary_dmg_long, aes(x = reorder(EVTYPE, -Count), y = Count, fill = Damage_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Property, Crop, and Damage by Event Type",
       x = "Event Type",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("salmon", "skyblue", "darkseagreen"))

One can clearly observe that tornados and floods are the main causes for property damage. On the other hand, hail accounts for the damage caused to crops.

Conclusions

By transforming the data and running the analysis, it is possible to conclude and regarding population, Tornados by far are the main cause for injuries and fatalities. Considering damage, tornados are also the key event types for damage to properties while hail is the biggest source of damage to crops.