The goal of this analysis is to explore the NOAA Storm Database and answer basic questions about severe weather events in order to protect human lives and property.
The analysis presented here specifically focuses on the following questions:
This analysis is prepared for local, federal authorities and GNOs to make decisions related to resource allocation and funding in order to prepare for these type of natural events to minimize the impact on human life, local and the U.S. economy.
Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents:
Data can be downloaded via the following code. The code will check the working directory and look for the “StormData.csv” file. If the data file does not exist in the folder, the code will download it from the web.
suppressWarnings(suppressMessages(library(dplyr)))
library (tidyr)
library(ggplot2)
#It is assumed that the data file is already in csv format in the working directory.
#If it does not exist, it will be downloaded and unzipped automatically in the working directory
if (!file.exists("data.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="data.csv.bz2")
}
Raw_Data <- read.table("data.csv.bz2", header = TRUE, sep = ",", stringsAsFactors = FALSE, na.strings="NA")
The data has the following structure.
str(Raw_Data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The data has 37 columns (variables) and 902297 rows (records).
Of these data fields, we only need the following fields to answer the aforementioned questions.
EVTYPE: Type of the event
FATALITIES:Number of fatalities
INJURIES: Number of injuries
PROPDMG: Property damage
PROPDMGEXP: Magnitude for property damange (million or billion)
CROPDMG:Damage to Crops
CROPDMGEXP: Magnitude of the crop damage (million or billion)
Source: NOAA Storm Database Data Dictionary
To answer the first question, we need to extract, group, summarize and plot the data by event type for fatalities and injuries.
#Extract data to answer question #1
Q1_Data<-select(Raw_Data, EVTYPE, FATALITIES, INJURIES)
#Summarize, organize and sort the data by event type for fatalities. Take records
#where fatalities >=100 (To reduce data for the bar plot)
Q1_Data %>% group_by(EVTYPE) %>% summarise(Total_Fatalities=sum(FATALITIES)) %>% arrange(desc(Total_Fatalities)) %>% filter(Total_Fatalities>=100) ->Q1_Fatalities
#Let's review the result of the data transformation for fatalities
head(Q1_Fatalities)
## # A tibble: 6 x 2
## EVTYPE Total_Fatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
#Summarize, organize and sort the data by event type for injuries >=500 (to simplify the plot)
Q1_Data %>% group_by(EVTYPE) %>% summarise(Total_Injuries=sum(INJURIES)) %>% arrange(desc(Total_Injuries)) %>% filter(Total_Injuries>=500) ->Q1_Injuries
#Let's review the result of the data transformation for injuries
head(Q1_Injuries)
## # A tibble: 6 x 2
## EVTYPE Total_Injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
To answer the second question, we need to extract and transform the data as well.
#Extract data to answer question #2
# To calculate the total damage by event, we will add Property and Crop damage
Q2_Data <- select(Raw_Data, EVTYPE,PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(Q2_Data)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 25.0 K 0
## 2 TORNADO 2.5 K 0
## 3 TORNADO 25.0 K 0
## 4 TORNADO 2.5 K 0
## 5 TORNADO 2.5 K 0
## 6 TORNADO 2.5 K 0
The data indicate that to find the total cost, we need to compute the total property and crop damage. To calculate the cost for each row, we will multiply the numeric PROPDMG and CROPDMG with the corresponding K, M, and B. To achieve this, we will first replace the K, M and B values with corresponding values.
## # A tibble: 6 x 2
## EVTYPE Total_Cost
## <chr> <dbl>
## 1 TORNADOES, TSTM WIND, HAIL 1602500000
## 2 HIGH WINDS/COLD 117500000
## 3 HURRICANE OPAL/HIGH WINDS 110000000
## 4 WINTER STORM HIGH WINDS 65000000
## 5 Heavy Rain/High Surf 15000000
## 6 LAKESHORE FLOOD 7540000
To find out which natural events are the most harmful, we plot the data manipulated above.
pl<- ggplot(Q1_Fatalities,aes(x=reorder(EVTYPE, -Total_Fatalities), y=Total_Fatalities)) + geom_bar(stat="identity", fill='red')
pl<-pl + ggtitle("Top 20 Fatalities by Event Type in the US")
pl<- pl + labs(x = "Event Type", y = "Fatalities")
pl <- pl + theme(axis.text.x = element_text(angle = 90, hjust = 1))
pl <- pl + theme(plot.title = element_text(hjust = 0.5))
print (pl)
This plot indicates that Tornados, Excessive Heat and Flash Flood are the top causes for most fatalities in the U.S.
pl<- ggplot(Q1_Injuries,aes(x=reorder(EVTYPE,-Total_Injuries), y=Total_Injuries)) + geom_bar(stat="identity", fill='blue')
pl<-pl + ggtitle("Top Injury Causes by Event Type in the US")
pl<- pl + labs(x = "Event Type", y = "Injuries")
pl <- pl + theme(axis.text.x = element_text(angle = 90, hjust = 1))
pl <- pl + theme(plot.title = element_text(hjust = 0.5))
print (pl)
According to the plot above, most injuries are caused by tornadoes, strong winds, flooding and excessive heat.
The natural events cause damage to personal property and crop. The total cost of this damage to the U.S. economy by rootcause is as in the plot below.
pl<- ggplot(Q2_Summary,aes(x=reorder(EVTYPE,-Total_Cost), y=Total_Cost)) + geom_bar(stat="identity", fill='green')
pl<-pl + ggtitle("Top Damage Causes by Event Type in the US")
pl<- pl + labs(x = "Event Type", y = "Total Damage ($)")
pl <- pl + theme(axis.text.x = element_text(angle = 90, hjust = 1))
pl <- pl + theme(plot.title = element_text(hjust = 0.5))
print (pl)
This plot indicates that strong winds and tornadoes are the number one reason for economic loss. In this group, tornadoes are certainly a billion dollar event.
Educating the general public and local authorities may reduce this damage.