All over the world, severe weather events have been occuring every year, and cause considerable damage to people’s health and economy. We want to figure out ‘What type of weather event has caused the most damage to people’s health, economy in U.S. from 1950 to 2011. As an extension of this exploration, it is expected to be able to prevent such damage by preparing precautions. We use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database for our analysis, there are ’913’ weather events types such as ‘Tornado’, ‘Hail’, ‘Lightning’. The damage to people’s health is divided into ‘fatality’ and ‘injury’, and the damage to economy is divided into ‘property damage’ and ‘crop damage’. As the result of this analysis, ‘Tornado’ is the most harmful type in both ‘fatality’ and ‘injury’ over the 61 years we’ve tallied, and in terms of economy, the weather type that has cost the most economic damage that combine ‘property damage’ and ‘crop damage’ is ‘Hurricane’. By investigating the mechanism that these two types of event occur damage to health and economy, it is expected to be able to prepare precaution that will reduce damage from these events.
Before analysis, We’ll load some packages.
Necessary packages
are ‘dplyr’, ‘stringr’, ‘ggplot2’.
We’ll load some additional
packages that are dependent.
library(base)
library(stats)
library(dplyr)
library(stringr)
library(ggplot2)
In this part, We will read raw data(‘repdata_data_StormData.csv.bz2’)
and process it to make our future analysis easier.
First,
Let’s read raw data.
Storm <- read.csv('repdata_data_StormData.csv.bz2')
And let’s take a quick look at the data!
dim(Storm)
## [1] 902297 37
names(Storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
head(Storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
There are 902297 rows and 37 columns. And this data contain
information about severe weather events(like Hurricane, Flood..)
including the event’s location, damage it caused.
So, There are kinds of columns like ‘event type’(EVTYPE), ‘state
where the events occured’(STATE), ‘when they occured’(BGN_DATE,
BGN_TIME, END_DATE, END_TIME).
Among this columns, We extract only 7 columns we need for our
analysis.
The list of columns are “EVTYPE”, “FATALITIES”,
“INJURIES”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP”.
needed_columns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
selected_Storm <- Storm %>% select(all_of(needed_columns))
dim(selected_Storm)
## [1] 902297 7
names(selected_Storm)
## [1] "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP"
## [6] "CROPDMG" "CROPDMGEXP"
So, selection of columns we want is completed.
Also, there are some inappropriate EVTYPE in this dataset. There are 2 cases about it.
So, we will trim corresponding rows in first case, and in second case, remove corresponding rows.
selected_Storm$EVTYPE <- str_trim(selected_Storm$EVTYPE)
good <- !grepl("Summary", selected_Storm$EVTYPE)
Storm3 <- selected_Storm[good, ]
So, in our first question(Across the United States, which types of
events are most harmful with respect to population health?), we will use
this processed ‘Storm3’ data.
And for our second Question(Across the United States, which types of
events have the greatest economic consequences?), we need some
additional processing procedure.
In ‘Storm3’ data, There are
2 variables(“PROPDMGEXP”, “CROPDMGEXP”). This 2 variables mean the Unit
of the number of property damage(“PROPDMG”), crop damage(“CROPDMG”).(in
dollar amount).
As the example value of this 2 variables,
“K”(Kilo) means 10^3, “M”(Million) means 10^6, “b”(Billion) means 10^9.
And the “” Unit is used when the “PROPDMG”, “CROPDMG” value is zero.
Other than these 4 values of Unit, there are also values like
“m”, “+”, “5”, “?”, “h”, “7”, “H”, “-”. Because we could not able to
figure out meaning of these variables, We will remove rows that have
value like them.
And also remove the rows that both “PROPDMG”,
“CROPDMG” value are zero.
filter_Storm <- Storm3 %>%
filter(PROPDMG != 0 | PROPDMG != 0) %>%
filter(PROPDMGEXP %in% c("K", "M", "B", "")) %>%
filter(CROPDMGEXP %in% c("K", "M", "B", ""))
unique(filter_Storm$PROPDMGEXP)
## [1] "K" "M" "B" ""
unique(filter_Storm$CROPDMGEXP)
## [1] "" "M" "K" "B"
So, Only the rows that has one of the Unit value among (““,”K”, “M”,
“B”).
In our second question, we will use this ‘filter_Storm’
data.
Specifically, which types of events have caused the highest number of
fatalities and injuries?
For answering it, we will group our
processed data(Storm3) by ‘EVTYPE’ variable and calculate sum of the
fatality and injury for each group.
group_Storm <- Storm3 %>% group_by(EVTYPE) %>%
summarize(fatality = sum(FATALITIES), injury = sum(INJURIES))
and arrange the result in order of the fatalities and injuries.
ordering1 <-
group_Storm[order(group_Storm$fatality, decreasing = TRUE),]
ordering2 <-
group_Storm[order(group_Storm$injury, decreasing = TRUE),]
The result about this Question 1 is suggested in ‘Result’ part.
In our raw dataset(Storm), we can identify the economic consequences
through 2 variables.(“PROPDMG”(property damage), “CROPDMG”(crop
damage)). And we calculate the total economic damage caused by specific
event by adding these 2 variables. For this, we will make 2 new colomns
that contain the damage amount in a unified unit.(each name is
“property” and “crop”). And after it, we will make an additional column
that means total amount of damage in dollar by adding 2
columns(“property”, “crop”).
Let’s do it!
for(i in 1:nrow(filter_Storm)){
if(filter_Storm$PROPDMGEXP[i] == ""){
filter_Storm$property[i] <- 0
}else if(filter_Storm$PROPDMGEXP[i] == "K"){
filter_Storm$property[i] <- filter_Storm$PROPDMG[i] * 1000
}else if(filter_Storm$PROPDMGEXP[i] == "M"){
filter_Storm$property[i] <- filter_Storm$PROPDMG[i] * 1000000
}else if(filter_Storm$PROPDMGEXP[i] == "B"){
filter_Storm$property[i] <- filter_Storm$PROPDMG[i] * 1000000000
}
}
First make ‘property’ column by combining ‘PROPDMG’ column with ‘PROPDMGEXP’. And also do the same thing with ‘CROPDMG’, ‘CROPDMGEXP’ to make ‘crop’ column.
for(i in 1:nrow(filter_Storm)){
if(filter_Storm$CROPDMGEXP[i] == ""){
filter_Storm$crop[i] <- 0
}else if(filter_Storm$CROPDMGEXP[i] == "K"){
filter_Storm$crop[i] <- filter_Storm$CROPDMG[i] * 1000
}else if(filter_Storm$PROPDMGEXP[i] == "M"){
filter_Storm$crop[i] <- filter_Storm$CROPDMG[i] * 1000000
}else if(filter_Storm$PROPDMGEXP[i] == "B"){
filter_Storm$crop[i] <- filter_Storm$CROPDMG[i] * 1000000000
}
}
head(unique(filter_Storm$property))
## [1] 25000 2500 2500000 250000 30 250
head(unique(filter_Storm$crop))
## [1] 0e+00 1e+10 5e+05 1e+06 4e+06 5e+04
So, 2 new columns has made.
And let’s make ‘total’ column
by adding ‘property’ and ‘crop’ columns.
filter_Storm$total <- filter_Storm$property + filter_Storm$crop
head(unique(filter_Storm$total))
## [1] 25000 2500 2500000 250000 30 250
Okay, the new column we could use to calculate the amount of damage
has made.
So, which types of events have the greatest economic
consequences?
For answering it, we will group ‘filter_Storm’ data
by ‘EVTYPE’ variable and calculate sum of ‘total’ column for each
group.
final_Storm <- filter_Storm %>% group_by(EVTYPE) %>%
summarize(economic = sum(total))
And arrange it in order of ‘economic’.
ordering3 <- final_Storm[order(final_Storm$economic, decreasing = TRUE),]
The result about this Question 2 is also suggested in ‘Result’ part.
Let’s see the result.
##Plot it
par(mfrow = c(1,2), cex.axis=0.5, cex.main=.75)
barplot(ordering1$fatality[1:3], names.arg = ordering1$EVTYPE[1:3],
xlab = "Type", ylab = "Fatality")
title(main = "Top3 Sum of fatalities by types of event 1950-2011")
barplot(ordering2$injury[1:3], names.arg = ordering2$EVTYPE[1:3],
xlab = "Type", ylab = "Injury")
title(main = "Top3 Sum of Injuries by types of event 1950-2011")
Among various types of events, Tornado has caused the most fatalities
and injuries.
It caused 5633 fatalities and 91346 injuries in
1950-2011.
Let’s see the result by plotting it!
g <- ggplot(ordering3[1:3,], aes(reorder(EVTYPE, economic, decreasing=TRUE), economic/10^9))
g + geom_bar(stat='identity') + labs(x='Type', y='Economic Damage(Billon dollars)',
title = 'Top3 Type of Event that caused Economic
Damage 1950-2011') +
theme(plot.title = element_text(hjust = 0.5))
So, the type that caused the most economic damage across United States in 1950-2011 is, ‘Hurricane’. It cost almost 814.750229 Billion dollars!