In this project, we are going to explore the NOAA Storm Database and answer some basic questions about severe weather events. The questions are about to find out the typical events related to population health and economic consequences in United States. All of the outputs would be shown in tables and plot based on R codes.
Download the data
if(!file.exists("StormData.csv")){
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(url,destfile='StormData.csv.bz2',method="curl")}
# Since it would probably be time consuming for the downloading,
# So I have already set the eval to be false, however you could run this code to download by hand.
Read the data
data <- read.csv(bzfile("StormData.csv.bz2"))
# to check out the dataframes
# and get an idea of preprocessing
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
A brief explaination of the factors:
We are going to answer the questions of the cost level of population health & economic consequences in several events.
EVTYPE is the crucial factor, the key of these two answers, indicated by types of events.
After a general review, population health is reflected by FATALITIES & INJURIES, indicated by the numbers of fatalities and injuries; Economic consequences are reflected by PROPDMG & PROPDMGEXP & CROPDMG & CROPDMGEXP, indicated by the damage of property and crops.
In the databook Storm Data Documentation, the levels of damage are Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. So we would have to execute a data preprocessing to build a subset of factors we will use and add two columns to it, named by prop&crop + magnitude, the expenses of cost.(This step is for the clarity of the plots, united by dollars)
preprocess the data
subdata <- subset(data, select = c('EVTYPE','FATALITIES','INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP'))
#sift the factors of this report, and add two new columns which we will use
subdata[, c("propmagnitude", "cropmagnitude")] <- 0
subdata[subdata$PROPDMGEXP=="K",]$propmagnitude <- subdata[subdata$PROPDMGEXP=="K",]$PROPDMG * 10^3
subdata[subdata$PROPDMGEXP=="M",]$propmagnitude <- subdata[subdata$PROPDMGEXP=="M",]$PROPDMG * 10^6
subdata[subdata$PROPDMGEXP=="B",]$propmagnitude <- subdata[subdata$PROPDMGEXP=="B",]$PROPDMG * 10^9
subdata[subdata$CROPDMGEXP=="K",]$cropmagnitude <- subdata[subdata$CROPDMGEXP=="K",]$CROPDMG * 10^3
subdata[subdata$CROPDMGEXP=="M",]$cropmagnitude <- subdata[subdata$CROPDMGEXP=="M",]$CROPDMG * 10^6
subdata[subdata$CROPDMGEXP=="B",]$cropmagnitude <- subdata[subdata$CROPDMGEXP=="B",]$CROPDMG * 10^9
#the prepossing is done, u could use a View() command to check if prop&crop magnitude exist
#crop damage magnitude is kinda small in this case, if you use head again you will see prop is there
head(subdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## propmagnitude cropmagnitude
## 1 25000 0
## 2 2500 0
## 3 25000 0
## 4 2500 0
## 5 2500 0
## 6 2500 0
Questions1:
Across the United States, which types of events are most harmful with respect to population health?
1: the most harmful events respect to fatality.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
library(ggplot2)
#forcats package is important for rearrange events order for plot in ggplot2
#Hint: dplyr is a starter for the mutate function inside forcats, based on my experience
fatality <- aggregate(FATALITIES ~ EVTYPE, subdata, sum)
fatality <- fatality[with(fatality,order(-FATALITIES)),]
fatality <- fatality[1:15,]
#you can also use fatality$EVTYPE <- factor(fatality$EVTYPE, levels = fatality$EVTYPE)
#to change the order of the types of events in fatality by reinput the levels
#Hint: You could check with order(fatality$EVTYPE)
#You would find out that the order of factors in levels is not ready for plot in order
fatality %>%
mutate(EVTYPE = fct_reorder(EVTYPE, FATALITIES)) %>%
ggplot(aes(EVTYPE, FATALITIES)) +
geom_bar(stat="identity", fill="dodgerblue", width=0.5) +
xlab("Types of events") +
ylab("Numbers of fatalities") +
ggtitle("Top 15 harmful events respect to fatality") +
coord_flip() +
theme_bw()
2: the most harmful events resprect to injury.
library(dplyr)
library(forcats)
library(ggplot2)
injury <- aggregate(INJURIES ~ EVTYPE, subdata, sum)
injury <- injury[with(injury,order(-INJURIES)),]
injury <- injury[1:15,]
#you can also use injury$EVTYPE <- factor(injury$EVTYPE, levels = injury$EVTYPE)
#to change the order of the types of events in injury by reinput the levels
injury %>%
mutate(EVTYPE = fct_reorder(EVTYPE, INJURIES)) %>%
ggplot(aes(EVTYPE, INJURIES)) +
geom_bar(stat="identity", fill="dodgerblue", width=0.5) +
xlab("Types of events") +
ylab("Numbers of injuries") +
ggtitle("Top 15 harmful events respect to injury") +
coord_flip() +
theme_bw()
Conclusion: TORNADO is the most harmful event with respect to population health in both fatality and injury across the United States, the barplots show the most 15 harmful events in both field.
Question2:
Across the United States, which types of events have the greatest economic consequences?
library(dplyr)
library(forcats)
library(ggplot2)
expense <- aggregate(propmagnitude + cropmagnitude ~ EVTYPE, subdata, sum)
#we have to rename the second column of expense, otherwise a cancer
names(expense)[2]<-paste("EXPENSE")
expense <- expense[with(expense,order(-EXPENSE)),]
expense <- expense[1:15,]
expense %>%
mutate(EVTYPE = fct_reorder(EVTYPE, EXPENSE)) %>%
ggplot(aes(EVTYPE, EXPENSE)) +
geom_bar(stat="identity", fill="dodgerblue", width=0.5) +
xlab("Types of events") +
ylab("Total costs in dollars") +
ggtitle("Top 15 events with the greatest economic consequences") +
coord_flip() +
theme_bw()
Conclusion: FLOOD is the event with the greatest economic consequences across United States, followed by the events are also containing the greatest economic consequences such as HURRICANE, TORNADO, etc.