Synopsis

All over the world, severe weather events have been occuring every year, and cause considerable damage to people’s health and economy. We want to figure out ‘What type of weather event has caused the most damage to people’s health, economy in U.S. from 1950 to 2011. As an extension of this exploration, it is expected to be able to prevent such damage by preparing precautions. We use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database for our analysis, there are ’913’ weather events types such as ‘Tornado’, ‘Hail’, ‘Lightning’. The damage to people’s health is divided into ‘fatality’ and ‘injury’, and the damage to economy is divided into ‘property damage’ and ‘crop damage’. As the result of this analysis, ‘Tornado’ is the most harmful type in both ‘fatality’ and ‘injury’ over the 61 years we’ve tallied, and in terms of economy, the weather type that has cost the most economic damage that combine ‘property damage’ and ‘crop damage’ is ‘Hurricane’. By investigating the mechanism that these two types of event occur damage to health and economy, it is expected to be able to prepare precaution that will reduce damage from these events.

Load necessary packages

Before analysis, We’ll load some packages.
Necessary packages are ‘dplyr’, ‘stringr’, ‘ggplot2’.
We’ll load some additional packages that are dependent.

library(base)
library(stats)
library(dplyr)
library(stringr)
library(ggplot2)

Data Processing

In this part, We will read raw data(‘repdata_data_StormData.csv.bz2’) and process it to make our future analysis easier.

First, Let’s read raw data.

Storm <- read.csv('repdata_data_StormData.csv.bz2')

And let’s take a quick look at the data!

dim(Storm)
## [1] 902297     37
names(Storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
head(Storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

There are 902297 rows and 37 columns. And this data contain information about severe weather events(like Hurricane, Flood..) including the event’s location, damage it caused.

So, There are kinds of columns like ‘event type’(EVTYPE), ‘state where the events occured’(STATE), ‘when they occured’(BGN_DATE, BGN_TIME, END_DATE, END_TIME).

Among this columns, We extract only 7 columns we need for our analysis.
The list of columns are “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP”.

needed_columns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
                    "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
selected_Storm <- Storm %>% select(all_of(needed_columns))
dim(selected_Storm)
## [1] 902297      7
names(selected_Storm)
## [1] "EVTYPE"     "FATALITIES" "INJURIES"   "PROPDMG"    "PROPDMGEXP"
## [6] "CROPDMG"    "CROPDMGEXP"

So, selection of columns we want is completed.

Also, there are some inappropriate EVTYPE in this dataset. There are 2 cases about it.

  1. name containing blank space like ” HIGH SURF ADVISORY”.
  2. name that doesn’t mean any type of event like “Summary August 2-3”(containing ‘summary’)

So, we will trim corresponding rows in first case, and in second case, remove corresponding rows.

selected_Storm$EVTYPE <- str_trim(selected_Storm$EVTYPE)
good <- !grepl("Summary", selected_Storm$EVTYPE)
Storm3 <- selected_Storm[good, ]

So, in our first question(Across the United States, which types of events are most harmful with respect to population health?), we will use this processed ‘Storm3’ data.

And for our second Question(Across the United States, which types of events have the greatest economic consequences?), we need some additional processing procedure.

In ‘Storm3’ data, There are 2 variables(“PROPDMGEXP”, “CROPDMGEXP”). This 2 variables mean the Unit of the number of property damage(“PROPDMG”), crop damage(“CROPDMG”).(in dollar amount).
As the example value of this 2 variables, “K”(Kilo) means 10^3, “M”(Million) means 10^6, “b”(Billion) means 10^9. And the “” Unit is used when the “PROPDMG”, “CROPDMG” value is zero.

Other than these 4 values of Unit, there are also values like “m”, “+”, “5”, “?”, “h”, “7”, “H”, “-”. Because we could not able to figure out meaning of these variables, We will remove rows that have value like them.
And also remove the rows that both “PROPDMG”, “CROPDMG” value are zero.

filter_Storm <- Storm3 %>% 
  filter(PROPDMG != 0 | PROPDMG != 0) %>%
  filter(PROPDMGEXP %in% c("K", "M", "B", "")) %>%
  filter(CROPDMGEXP %in% c("K", "M", "B", ""))
unique(filter_Storm$PROPDMGEXP)
## [1] "K" "M" "B" ""
unique(filter_Storm$CROPDMGEXP)
## [1] ""  "M" "K" "B"

So, Only the rows that has one of the Unit value among (““,”K”, “M”, “B”).

In our second question, we will use this ‘filter_Storm’ data.

Question 1. Across the United States, which types of events are most harmful with respect to population health?

Specifically, which types of events have caused the highest number of fatalities and injuries?
For answering it, we will group our processed data(Storm3) by ‘EVTYPE’ variable and calculate sum of the fatality and injury for each group.

group_Storm <- Storm3 %>% group_by(EVTYPE) %>% 
  summarize(fatality = sum(FATALITIES), injury = sum(INJURIES))

and arrange the result in order of the fatalities and injuries.

ordering1 <-
  group_Storm[order(group_Storm$fatality, decreasing = TRUE),]

ordering2 <-
  group_Storm[order(group_Storm$injury, decreasing = TRUE),]

The result about this Question 1 is suggested in ‘Result’ part.

Question 2. Across the United States, which types of events have the greatest economic consequences?

In our raw dataset(Storm), we can identify the economic consequences through 2 variables.(“PROPDMG”(property damage), “CROPDMG”(crop damage)). And we calculate the total economic damage caused by specific event by adding these 2 variables. For this, we will make 2 new colomns that contain the damage amount in a unified unit.(each name is “property” and “crop”). And after it, we will make an additional column that means total amount of damage in dollar by adding 2 columns(“property”, “crop”).

Let’s do it!

for(i in 1:nrow(filter_Storm)){
  if(filter_Storm$PROPDMGEXP[i] == ""){
    filter_Storm$property[i] <- 0
  }else if(filter_Storm$PROPDMGEXP[i] == "K"){
    filter_Storm$property[i] <- filter_Storm$PROPDMG[i] * 1000
  }else if(filter_Storm$PROPDMGEXP[i] == "M"){
    filter_Storm$property[i] <- filter_Storm$PROPDMG[i] * 1000000
  }else if(filter_Storm$PROPDMGEXP[i] == "B"){
    filter_Storm$property[i] <- filter_Storm$PROPDMG[i] * 1000000000
  }
}

First make ‘property’ column by combining ‘PROPDMG’ column with ‘PROPDMGEXP’. And also do the same thing with ‘CROPDMG’, ‘CROPDMGEXP’ to make ‘crop’ column.

for(i in 1:nrow(filter_Storm)){
  if(filter_Storm$CROPDMGEXP[i] == ""){
    filter_Storm$crop[i] <- 0
  }else if(filter_Storm$CROPDMGEXP[i] == "K"){
    filter_Storm$crop[i] <- filter_Storm$CROPDMG[i] * 1000
  }else if(filter_Storm$PROPDMGEXP[i] == "M"){
    filter_Storm$crop[i] <- filter_Storm$CROPDMG[i] * 1000000
  }else if(filter_Storm$PROPDMGEXP[i] == "B"){
    filter_Storm$crop[i] <- filter_Storm$CROPDMG[i] * 1000000000
  }
}
head(unique(filter_Storm$property))
## [1]   25000    2500 2500000  250000      30     250
head(unique(filter_Storm$crop))
## [1] 0e+00 1e+10 5e+05 1e+06 4e+06 5e+04

So, 2 new columns has made.

And let’s make ‘total’ column by adding ‘property’ and ‘crop’ columns.

filter_Storm$total <- filter_Storm$property + filter_Storm$crop
head(unique(filter_Storm$total))
## [1]   25000    2500 2500000  250000      30     250

Okay, the new column we could use to calculate the amount of damage has made.
So, which types of events have the greatest economic consequences?
For answering it, we will group ‘filter_Storm’ data by ‘EVTYPE’ variable and calculate sum of ‘total’ column for each group.

final_Storm <- filter_Storm %>% group_by(EVTYPE) %>%
  summarize(economic = sum(total))

And arrange it in order of ‘economic’.

ordering3 <- final_Storm[order(final_Storm$economic, decreasing = TRUE),]

The result about this Question 2 is also suggested in ‘Result’ part.

Results

Question 1. Across the United States, which types of events are most harmful with respect to population health?

Let’s see the result.

##Plot it
par(mfrow = c(1,2), cex.axis=0.5, cex.main=.75)
barplot(ordering1$fatality[1:3], names.arg = ordering1$EVTYPE[1:3],
        xlab = "Type", ylab = "Fatality")
title(main = "Top3 Sum of fatalities by types of event 1950-2011")
barplot(ordering2$injury[1:3], names.arg = ordering2$EVTYPE[1:3],
        xlab = "Type", ylab = "Injury")

title(main = "Top3 Sum of Injuries by types of event 1950-2011")

Among various types of events, Tornado has caused the most fatalities and injuries.
It caused 5633 fatalities and 91346 injuries in 1950-2011.

Question 2. Across the United States, which types of events have the greatest economic consequences?

Let’s see the result by plotting it!

g <- ggplot(ordering3[1:3,], aes(reorder(EVTYPE, economic, decreasing=TRUE), economic/10^9))
g + geom_bar(stat='identity') + labs(x='Type', y='Economic Damage(Billon dollars)',
                      title = 'Top3 Type of Event that caused Economic
                      Damage 1950-2011') + 
  theme(plot.title = element_text(hjust = 0.5))

So, the type that caused the most economic damage across United States in 1950-2011 is, ‘Hurricane’. It cost almost 814.750229 Billion dollars!