Analysis on the Disasters in the United States

2024/11/07

By Satoshi Ohnishi

1. The aim of the analysis

In this analysis, I will analyze the disasters in the United States. I will use the storm data set from the National Oceanic and Atmospheric Administration (NOAA).

NOAA Storm Events Database

The main purpose of this analysis is:

  • which types of events are most harmful to population health?

  • which types of events have the greatest economic consequences?

So, let’s start the analysis.

2. Data Processing

Before starting the analysis, I will load the data and check the data structure. If the data is not tidy, I will clean it up and improve the accuracy of the analysis. First, load the necessary libraries.

library(dplyr)
## 
## 次のパッケージを付け加えます: 'dplyr'
## 以下のオブジェクトは 'package:stats' からマスクされています:
## 
##     filter, lag
## 以下のオブジェクトは 'package:base' からマスクされています:
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Next, load the data and check the structure.

storm <- read.csv("C:/Users/s-ohn/OneDrive/デスクトップ/repdata_data_StormData.csv.bz2")
# check the structure of the data
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The data has 902297 rows and 37 columns.

Data Cleaning 1: drop past data which is not matched with the recent data

In BNG_DATE column, the event occurred 1950 is recorded. According to the explanation of the data in website, The event type was changed several times in the data set. From Jan 1996, all event types are recorded in the EVTYPE column. So I decided to use data from 1996.

# create new column for year from BGN_DATE(char type) such as 4/18/1950 0:00:00'
storm$year <- as.Date(storm$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
storm$year <- as.numeric(format(storm$year, "%Y"))
# filter the data from 1996
storm <- storm %>% filter(year >= 1996)
# Print unique values of year
unique(storm$year)
##  [1] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## [16] 2011

The year column has only 1996 to 2011.

Data Cleaning 2: drop unnecessary columns

Now, I will drop unnecessary columns for this analysis. In this analysis I will focus on the following columns:

-“STATE” The day of the month that the event began

-“EVTYPE” The type of event

-“FATALITIES” The number of fatalities

-“INJURIES” The number of injuries

-“PROPDMG” The estimated amount of damage to property incurred by the weather event

-“CROPDMG” The estimated amount of damage to crops incurred by the weather event

storm <- storm[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG", "year")]

Data Cleaning 3: Check the missing values

head(storm)
##   STATE       EVTYPE FATALITIES INJURIES PROPDMG CROPDMG year
## 1    AL WINTER STORM          0        0     380      38 1996
## 2    AL      TORNADO          0        0     100       0 1996
## 3    AL    TSTM WIND          0        0       3       0 1996
## 4    AL    TSTM WIND          0        0       5       0 1996
## 5    AL    TSTM WIND          0        0       2       0 1996
## 6    AL         HAIL          0        0       0       0 1996
str(storm)
## 'data.frame':    653530 obs. of  7 variables:
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "WINTER STORM" "TORNADO" "TSTM WIND" "TSTM WIND" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG   : num  380 100 3 5 2 0 400 12 8 12 ...
##  $ CROPDMG   : num  38 0 0 0 0 0 0 0 0 0 ...
##  $ year      : num  1996 1996 1996 1996 1996 ...
# check the missing values
sum(is.na(storm))
## [1] 0

The data has no missing values.

3. Results

The data is tydy now. Let’s start the analysis. let’s start the analysis for answering the 2 questions. First I will check the number of events recorded in the data set or the mean, median, min, max, and 4q of the damage columns and so on.

# how many events are recorded in the data set?
nrow(storm)
## [1] 653530
damage_cols <- c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")
# Print mean median min maz 4q damage_cols
summary(storm[, damage_cols])
##    FATALITIES           INJURIES           PROPDMG           CROPDMG       
##  Min.   :  0.00000   Min.   :0.00e+00   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:  0.00000   1st Qu.:0.00e+00   1st Qu.:   0.00   1st Qu.:  0.000  
##  Median :  0.00000   Median :0.00e+00   Median :   0.00   Median :  0.000  
##  Mean   :  0.01336   Mean   :8.87e-02   Mean   :  11.69   Mean   :  1.839  
##  3rd Qu.:  0.00000   3rd Qu.:0.00e+00   3rd Qu.:   1.00   3rd Qu.:  0.000  
##  Max.   :158.00000   Max.   :1.15e+03   Max.   :5000.00   Max.   :990.000
# Box plot of damage_cols
par(mfrow=c(2,2))
for (col in damage_cols) {
  boxplot(storm[, col], main = col)
}

# histogram of damage_cols
par(mfrow=c(2,2))
for (col in damage_cols) {
  hist(storm[, col], main = col)
}

# Print correlation matrix of damage_cols
cor(storm[, damage_cols])
##            FATALITIES   INJURIES    PROPDMG    CROPDMG
## FATALITIES 1.00000000 0.42623579 0.01619302 0.01008034
## INJURIES   0.42623579 1.00000000 0.02694289 0.02149989
## PROPDMG    0.01619302 0.02694289 1.00000000 0.09909225
## CROPDMG    0.01008034 0.02149989 0.09909225 1.00000000
# make pair plot
pairs(storm[, damage_cols])

From the box plot and histogram, , we can see that the damage columns have many outliers. Almost all of the damage columns near 0, but some of them have a large value. The correlation matrix shows that the damage columns are low correlation with each other. This means that the damage columns are independent of each other. For example, the number of fatalities is weekly correlated with the number of injuries and no correlation with the amount of damage to property or crops.

Therefore, I will analyze the damage columns separately caliculating the total damage and find top 5 events.for each damage column.

# calculate the total damage for each damage column
storm_agg <- storm %>%
  group_by(EVTYPE, STATE) %>%
  summarise(FATALITIES = sum(FATALITIES),
            INJURIES = sum(INJURIES),
            PROPDMG = sum(PROPDMG),
            CROPDMG = sum(CROPDMG))
## `summarise()` has grouped output by 'EVTYPE'. You can override using the
## `.groups` argument.
# Find top 5 events for each damage column
top5_fatalities <- storm_agg %>% 
  # sum the damage columns for each event type
  group_by(EVTYPE) %>%
  summarise(FATALITIES = sum(FATALITIES)) %>%
  arrange(desc(FATALITIES)) %>% 
  head(5)

top5_injuries <- storm_agg %>%
  group_by(EVTYPE) %>%
  summarise(INJURIES = sum(INJURIES)) %>% 
  arrange(desc(INJURIES)) %>% 
  head(5)

top5_propdmg <- storm_agg %>%
  group_by(EVTYPE) %>%
  summarise(PROPDMG = sum(PROPDMG)) %>%
  arrange(desc(PROPDMG)) %>%
  head(5)

top5_cropdmg <- storm_agg %>%
  group_by(EVTYPE) %>%
  summarise(CROPDMG = sum(CROPDMG)) %>%
  arrange(desc(CROPDMG)) %>%
  head(5)

# plot with bar chart for each damage column(2x2)
par(mfrow=c(2,2))
# set figure size
options(repr.plot.width=15, repr.plot.height=12)

barplot(top5_fatalities$FATALITIES, names.arg = top5_fatalities$EVTYPE, main = "Top 5 events for Fatalities", las = 2, cex.names = 0.7)
barplot(top5_injuries$INJURIES, names.arg = top5_injuries$EVTYPE, main = "Top 5 events for Injuries", las = 2, cex.names = 0.7)
barplot(top5_propdmg$PROPDMG, names.arg = top5_propdmg$EVTYPE, main = "Top 5 events for Property Damage", las = 2, cex.names = 0.7)
barplot(top5_cropdmg$CROPDMG, names.arg = top5_cropdmg$EVTYPE, main = "Top 5 events for Crop Damage", las = 2, cex.names = 0.7)

** TOP 3 events for each damage column are as follows:**

  • Fatalities: Excessive Heat, Tornado, Flash Flood

  • Injuries: Tornado, Flood, excessive heat

  • Property Damage: Tstm Wind, flash flood, tornado

  • Crop Damage: Hail, flash flood, flood

In what states do the most harmful events occur? Aggregate the data by state and calculate the total number of fatalities, injuries, and property and crop damage. Then, find the top 5 states for each damage column and plot the results with bar charts. In the bar chart, the x-axis represents the state and the y-axis represents the total damage. y-axis is stacked by the event type(only top 5 events are colored and the others are colored in gray in same color).

# Plot the top 5 states for Fatalities
state_fatalities <- storm_agg %>%
  # mutate the event type to 'other' if it is not in top 5
  mutate(EVTYPE = ifelse(EVTYPE %in% top5_fatalities$EVTYPE, as.character(EVTYPE), "other")) %>%
  group_by(STATE, EVTYPE) %>%
  summarise(FATALITIES = sum(FATALITIES))
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
# add the total number of fatalities for each state
top5_fatalities <- state_fatalities %>%
  group_by(STATE) %>%
  summarise(FATALITIES = sum(FATALITIES)) %>%
  arrange(desc(FATALITIES)) %>%
  head(5)  %>%
 #change 'FATALITIES' column name to 'Total Fatalities'
  rename(Total_Fatalities = FATALITIES)

# inner join the top5_fatalities and state_fatalities
top5_fatalities <- inner_join(top5_fatalities, state_fatalities, by = c("STATE" = "STATE")) %>%
  arrange(desc(Total_Fatalities))

# plot the stacked bar chart(x=state, y=total fatalities, color=EVTYPES)
ggplot(top5_fatalities, aes(x = STATE, y = FATALITIES, fill = EVTYPE)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 5 states for Fatalities", x = "State", y = "Total Fatalities") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# plot the top 5 states for Injuries
state_injuries <- storm_agg %>%
  mutate(EVTYPE = ifelse(EVTYPE %in% top5_injuries$EVTYPE, as.character(EVTYPE), "other")) %>%
  group_by(STATE, EVTYPE) %>%
  summarise(INJURIES = sum(INJURIES))
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
top5_injuries <- state_injuries %>%
  group_by(STATE) %>%
  summarise(INJURIES = sum(INJURIES)) %>%
  arrange(desc(INJURIES)) %>%
  head(5) %>%
  rename(Total_Injuries = INJURIES)

top5_injuries <- inner_join(top5_injuries, state_injuries, by = c("STATE" = "STATE")) %>%
  arrange(desc(Total_Injuries))

ggplot(top5_injuries, aes(x = STATE, y = INJURIES, fill = EVTYPE)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 5 states for Injuries", x = "State", y = "Total Injuries") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# plot the top 5 states for Property Damage
state_propdmg <- storm_agg %>%
  mutate(EVTYPE = ifelse(EVTYPE %in% top5_propdmg$EVTYPE, as.character(EVTYPE), "other")) %>%
  group_by(STATE, EVTYPE) %>%
  summarise(PROPDMG = sum(PROPDMG))
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
top5_propdmg <- state_propdmg %>%
  group_by(STATE) %>%
  summarise(PROPDMG = sum(PROPDMG)) %>%
  arrange(desc(PROPDMG)) %>%
  head(5) %>%
  rename(Total_Propdmg = PROPDMG)

top5_propdmg <- inner_join(top5_propdmg, state_propdmg, by = c("STATE" = "STATE")) %>%
  arrange(desc(Total_Propdmg))

ggplot(top5_propdmg, aes(x = STATE, y = PROPDMG, fill = EVTYPE)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 5 states for Property Damage", x = "State", y = "Total Property Damage") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# plot the top 5 states for Crop Damage
state_cropdmg <- storm_agg %>%
  mutate(EVTYPE = ifelse(EVTYPE %in% top5_cropdmg$EVTYPE, as.character(EVTYPE), "other")) %>%
  group_by(STATE, EVTYPE) %>%
  summarise(CROPDMG = sum(CROPDMG))
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
top5_cropdmg <- state_cropdmg %>%
  group_by(STATE) %>%
  summarise(CROPDMG = sum(CROPDMG)) %>%
  arrange(desc(CROPDMG)) %>%
  head(5) %>%
  rename(Total_Cropdmg = CROPDMG)

top5_cropdmg <- inner_join(top5_cropdmg, state_cropdmg, by = c("STATE" = "STATE")) %>%
  arrange(desc(Total_Cropdmg))

ggplot(top5_cropdmg, aes(x = STATE, y = CROPDMG, fill = EVTYPE)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 5 states for Crop Damage", x = "State", y = "Total Crop Damage") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

### 4. Conclusion In this analysis, I analyzed the disasters in the United States using the storm data set from the National Oceanic and Atmospheric Administration (NOAA).

The main purpose of this analysis was to find out which types of events are most harmful to population health and which types of events have the greatest economic consequences.

The results are as follows:

  • The top 3 events that caused the most fatalities were Excessive Heat, Tornado, and Flash Flood.

  • The top 3 events that caused the most injuries were Tornado, Flood, and Excessive Heat.

  • The top 3 events that caused the most property damage were Tstm Wind, Flash Flood, and Tornado.

  • The top 3 events that caused the most crop damage were Hail, Flash Flood, and Flood.

Texas is the state where the most harmful events occurred in terms of fatalities, injuries property damage.

The kind of events are different in each state, so it is important to take measures against the events that occur frequently in each state.

Thank you for reading this analysis report.