Data Processing

Let’s initialize data file name, zip_file name, and valid URL for raw datasets.

file_name <- "repdata-data-StormData.csv"
zip_file  <- "repdata-data-StormData.csv.bz2"
url       <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

Before input data, let’s check whether the data is loaded in current directory.

# Check if the data is downloaded and download when applicable
if (!file.exists(file_name)) {
    download.file(url, destfile = zip_file)
    unzip(zip_file)
    file.remove(zip_file)
}

Now, Let’s load the data for preprocessing and show first 5 events

full_data <- read.csv(file_name)
dim(full_data)

## [1] 902297     37

head(full_data)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Since the data is loaded, let’s input dplyr library for manipulating data

library(dplyr)

The objective of this analysis is to explore the damage to society from all events, and we majorly focus on the damage to either human and economy.
Let’s see how many different events were collected.

length(unique(full_data$EVTYPE))

## [1] 985

We have total 985 different event types in the data. Let’s sort the event name and see the first 50 events

sort(unique(full_data$EVTYPE))[1:50]

##  [1]    HIGH SURF ADVISORY           COASTAL FLOOD                
##  [3]  FLASH FLOOD                    LIGHTNING                    
##  [5]  TSTM WIND                      TSTM WIND (G45)              
##  [7]  WATERSPOUT                     WIND                         
##  [9] ?                              ABNORMAL WARMTH               
## [11] ABNORMALLY DRY                 ABNORMALLY WET                
## [13] ACCUMULATED SNOWFALL           AGRICULTURAL FREEZE           
## [15] APACHE COUNTY                  ASTRONOMICAL HIGH TIDE        
## [17] ASTRONOMICAL LOW TIDE          AVALANCE                      
## [19] AVALANCHE                      BEACH EROSIN                  
## [21] Beach Erosion                  BEACH EROSION                 
## [23] BEACH EROSION/COASTAL FLOOD    BEACH FLOOD                   
## [25] BELOW NORMAL PRECIPITATION     BITTER WIND CHILL             
## [27] BITTER WIND CHILL TEMPERATURES Black Ice                     
## [29] BLACK ICE                      BLIZZARD                      
## [31] BLIZZARD AND EXTREME WIND CHIL BLIZZARD AND HEAVY SNOW       
## [33] Blizzard Summary               BLIZZARD WEATHER              
## [35] BLIZZARD/FREEZING RAIN         BLIZZARD/HEAVY SNOW           
## [37] BLIZZARD/HIGH WIND             BLIZZARD/WINTER STORM         
## [39] BLOW-OUT TIDE                  BLOW-OUT TIDES                
## [41] BLOWING DUST                   blowing snow                  
## [43] Blowing Snow                   BLOWING SNOW                  
## [45] BLOWING SNOW & EXTREME WIND CH BLOWING SNOW- EXTREME WIND CHI
## [47] BLOWING SNOW/EXTREME WIND CHIL BREAKUP FLOODING              
## [49] BRUSH FIRE                     BRUSH FIRES                   
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

We found that the event names have not been organized well. There are some leading
and trailing spaces, some punctuations, many similarties, and etc.
Let’s make them all upper cases and remove punctuations and irrelevant spaces.

full_data$EVTYPE <- toupper(full_data$EVTYPE)
full_data$EVTYPE <- gsub("[[:punct:]]", " ",full_data$EVTYPE)# remove punctuaion
full_data$EVTYPE <- gsub("\\s+"," ",full_data$EVTYPE) # remove consequtive spaces
full_data$EVTYPE <- trimws(full_data$EVTYPE, which = "both") # remove leading and trailing spaces
sort(unique(full_data$EVTYPE))[1:50]

##  [1] ""                               "ABNORMAL WARMTH"               
##  [3] "ABNORMALLY DRY"                 "ABNORMALLY WET"                
##  [5] "ACCUMULATED SNOWFALL"           "AGRICULTURAL FREEZE"           
##  [7] "APACHE COUNTY"                  "ASTRONOMICAL HIGH TIDE"        
##  [9] "ASTRONOMICAL LOW TIDE"          "AVALANCE"                      
## [11] "AVALANCHE"                      "BEACH EROSIN"                  
## [13] "BEACH EROSION"                  "BEACH EROSION COASTAL FLOOD"   
## [15] "BEACH FLOOD"                    "BELOW NORMAL PRECIPITATION"    
## [17] "BITTER WIND CHILL"              "BITTER WIND CHILL TEMPERATURES"
## [19] "BLACK ICE"                      "BLIZZARD"                      
## [21] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"       
## [23] "BLIZZARD FREEZING RAIN"         "BLIZZARD HEAVY SNOW"           
## [25] "BLIZZARD HIGH WIND"             "BLIZZARD SUMMARY"              
## [27] "BLIZZARD WEATHER"               "BLIZZARD WINTER STORM"         
## [29] "BLOW OUT TIDE"                  "BLOW OUT TIDES"                
## [31] "BLOWING DUST"                   "BLOWING SNOW"                  
## [33] "BLOWING SNOW EXTREME WIND CH"   "BLOWING SNOW EXTREME WIND CHI" 
## [35] "BLOWING SNOW EXTREME WIND CHIL" "BREAKUP FLOODING"              
## [37] "BRUSH FIRE"                     "BRUSH FIRES"                   
## [39] "COASTAL EROSION"                "COASTAL FLOOD"                 
## [41] "COASTAL FLOODING"               "COASTAL FLOODING EROSION"      
## [43] "COASTAL STORM"                  "COASTAL SURGE"                 
## [45] "COASTAL TIDAL FLOOD"            "COASTALFLOOD"                  
## [47] "COASTALSTORM"                   "COLD"                          
## [49] "COLD AIR FUNNEL"                "COLD AIR FUNNELS"

We can see there is still similarities, such as singular or plurals, TSTM or Thunderstorm, Flash Flood, Flood Flash, Flash Flood Flood or etc.

full_data$EVTYPE <- gsub("S$", "", full_data$EVTYPE) # plural to singular by removing trailing S
full_data[which(full_data$EVTYPE %in% c("FLASH FLOOODING","FLASH FLOOD FLOOD",
                                        "FLOOD FLASH","FLOOD FLASH FLOOD",
                                        "FLOOD FLASHFLOOD","FLOOD FLOOD FLASH",
                                        "FLASH FLODDING")),]$EVTYPE <- "FLASH FLOOD"
full_data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", full_data$EVTYPE, fixed = TRUE)

Even though there is still many similarties, in this project, we only focus on the top 10 the most influecial events. In orer to do that, let’s group the data by event and
summarize by the total count and see top 20 occurracne.

(full_data %>% group_by(EVTYPE) %>% summarise(CNT = n()) %>% arrange(desc(CNT)))[1:20,]

## # A tibble: 20 x 2
##    EVTYPE                      CNT
##    <chr>                     <int>
##  1 THUNDERSTORM WIND        323370
##  2 HAIL                     288661
##  3 TORNADO                   60653
##  4 FLASH FLOOD               54969
##  5 FLOOD                     25330
##  6 HIGH WIND                 21749
##  7 LIGHTNING                 15756
##  8 HEAVY SNOW                15708
##  9 MARINE THUNDERSTORM WIND  11987
## 10 HEAVY RAIN                11768
## 11 WINTER STORM              11436
## 12 WINTER WEATHER             7045
## 13 FUNNEL CLOUD               6932
## 14 WATERSPOUT                 3845
## 15 STRONG WIND                3773
## 16 URBAN SML STREAM FLD       3392
## 17 WILDFIRE                   2769
## 18 BLIZZARD                   2719
## 19 DROUGHT                    2488
## 20 ICE STORM                  2006

For the rest similarties, we correct the EVTYPE when we analize the rest data based on
whether correction of EVTYPE is really necessary.

First, let’s explore the damage to human from all the events during past half century.
In order to do that, let’s group the data by all events types and sum up all the injuries based on the event type, then store the top 10 largest number of injuries in to variable large_injuries

large_injuries <- full_data %>%
    group_by(EVTYPE) %>%
    summarise(INJURIES = sum(INJURIES, na.rm = TRUE)) %>%
    filter(INJURIES %in% sort(INJURIES,decreasing = TRUE)[1:10])

Let’s see the result in descending order by total injuries.

print(arrange(large_injuries, desc(INJURIES)))

## # A tibble: 10 x 2
##    EVTYPE            INJURIES
##    <chr>                <dbl>
##  1 TORNADO              91346
##  2 THUNDERSTORM WIND     9363
##  3 FLOOD                 6789
##  4 EXCESSIVE HEAT        6525
##  5 LIGHTNING             5230
##  6 HEAT                  2100
##  7 ICE STORM             1975
##  8 FLASH FLOOD           1792
##  9 HIGH WIND             1439
## 10 HAIL                  1361

Do similar to total fatalities and see the result

large_fatalities <- full_data %>%
    group_by(EVTYPE) %>%
    summarise(FATALITIES = sum(FATALITIES, na.rm = TRUE)) %>%
    filter(FATALITIES %in% sort(FATALITIES,decreasing = TRUE)[1:10])
print(arrange(large_fatalities, desc(FATALITIES)))

## # A tibble: 10 x 2
##    EVTYPE            FATALITIES
##    <chr>                  <dbl>
##  1 TORNADO                 5633
##  2 EXCESSIVE HEAT          1903
##  3 FLASH FLOOD             1011
##  4 HEAT                     937
##  5 LIGHTNING                817
##  6 THUNDERSTORM WIND        701
##  7 RIP CURRENT              572
##  8 FLOOD                    470
##  9 HIGH WIND                283
## 10 AVALANCHE                224

We can see that Tornado gives the most damage to people in death or injuries.

Let’s plot both fatalities and injuries, and visualize the differences. In order to do that, let’s combine the injuries and fatalities data by event type

fatal_injur <- inner_join(large_fatalities, large_injuries, by = "EVTYPE")
library(reshape2) # need reshape2 library for melt function betlow
fatal_injur<- melt(fatal_injur, id.vars = c("EVTYPE"))
print(fatal_injur)

##               EVTYPE   variable value
## 1     EXCESSIVE HEAT FATALITIES  1903
## 2        FLASH FLOOD FATALITIES  1011
## 3              FLOOD FATALITIES   470
## 4               HEAT FATALITIES   937
## 5          HIGH WIND FATALITIES   283
## 6          LIGHTNING FATALITIES   817
## 7  THUNDERSTORM WIND FATALITIES   701
## 8            TORNADO FATALITIES  5633
## 9     EXCESSIVE HEAT   INJURIES  6525
## 10       FLASH FLOOD   INJURIES  1792
## 11             FLOOD   INJURIES  6789
## 12              HEAT   INJURIES  2100
## 13         HIGH WIND   INJURIES  1439
## 14         LIGHTNING   INJURIES  5230
## 15 THUNDERSTORM WIND   INJURIES  9363
## 16           TORNADO   INJURIES 91346

Now, let’s visualize the total fatalites and injureis caused by major event type

library(ggplot2)
ggplot(data = fatal_injur, aes(x = EVTYPE, y = value,
                               group = variable, fill = variable)) + 
    geom_bar(stat = "identity",width = 0.5, position = "dodge") + 
    coord_flip() + 
    labs(title = "Total number of fatalities and injuires by event", 
         x = "Event Type", y = "Number of causalties")

From the graph, we can see that Tornado injuired human the most significant than any other event type; also, it took the most people’s life.

Now, let’s talk about economical influence from event.
From the data, we see that PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP were recorded all the property or crop damage caused by event. Let’s take out the data has economic influnces, which is DMG >0, from full_data to
“P_CROPDMG_data”

P_CROPDMG_data <- full_data[which(full_data$PROPDMG > 0 | full_data$PROPDMG > 0),
                            c("EVTYPE","PROPDMG","PROPDMGEXP",
                              "CROPDMG","CROPDMGEXP","REFNUM")]
summary(P_CROPDMG_data)

##     EVTYPE             PROPDMG          PROPDMGEXP        CROPDMG       
##  Length:239174      Min.   :   0.01   K      :227481   Min.   :  0.000  
##  Class :character   1st Qu.:   2.50   M      : 11319   1st Qu.:  0.000  
##  Mode  :character   Median :  10.00   0      :   209   Median :  0.000  
##                     Mean   :  45.51          :    76   Mean   :  4.398  
##                     3rd Qu.:  25.00   B      :    40   3rd Qu.:  0.000  
##                     Max.   :5000.00   5      :    18   Max.   :978.000  
##                                       (Other):    31                    
##    CROPDMGEXP         REFNUM      
##         :145037   Min.   :     1  
##  K      : 92734   1st Qu.:281994  
##  M      :  1356   Median :479228  
##  k      :    21   Mean   :487226  
##  0      :    16   3rd Qu.:707229  
##  ?      :     6   Max.   :902260  
##  (Other):     4

As we see, the PROPDMG and CROPDMG are only numerical values that stand for total economical
damage, and PROPDMGEXP and CROPDMGEXP are categorial values that stand for units of amouts.
“B” stands for billion, “K” stands for thousands, “M” stands for million, and 1,2,3,4 stands
for the power of 10s.
First, let’s organize the CROPDMGEXP and PROPDMGXP by removing all punctuation
and make them all capital

P_CROPDMG_data$PROPDMGEXP <- toupper(gsub("[[:punct:]]",0,P_CROPDMG_data$PROPDMGEXP))
P_CROPDMG_data$CROPDMGEXP <- toupper(gsub("[[:punct:]]",0,P_CROPDMG_data$CROPDMGEXP))

Also, lets’s correct letter to specific numbers, such as K to 3, M to 6,

P_CROPDMG_data[grepl("M", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "6"
P_CROPDMG_data[grepl("K", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "3"
P_CROPDMG_data[grepl("B", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "9"
P_CROPDMG_data[grepl("H", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "2"

P_CROPDMG_data[grepl("M", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "6"
P_CROPDMG_data[grepl("K", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "3"
P_CROPDMG_data[grepl("B", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "9"
P_CROPDMG_data[grepl("H", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "2"

table(P_CROPDMG_data$PROPDMGEXP)

## 
##             0      2      3      4      5      6      7      9 
##     76    215      8 227482      4     18  11329      2     40

table(P_CROPDMG_data$CROPDMGEXP)

## 
##             0      3      6      9 
## 145037     22  92755   1357      3

P_CROPDMG_data[which(P_CROPDMG_data$PROPDMGEXP == ""),"PROPDMGEXP"] <- "0"
P_CROPDMG_data[which(P_CROPDMG_data$CROPDMGEXP == ""),"CROPDMGEXP"] <- "0"

Let’s only focus on property damage.

PROP_DMG <- P_CROPDMG_data[,c("EVTYPE","PROPDMG","PROPDMGEXP","REFNUM")]

Then, adjust the PROPDMG with actual property damgage

PROP_DMG$PROPDMGEXP <- as.numeric(PROP_DMG$PROPDMGEXP)
PROP_DMG$PROPDMG <- PROP_DMG$PROPDMG*10^PROP_DMG$PROPDMGEXP

Now, lets’s group by the EVTYPE and sum up property damage by EVTYPE,
and see top 10 events

TOTAL_PROP_DMG <- PROP_DMG %>% 
    group_by(EVTYPE) %>% 
    summarise(PROP_SUM = sum(PROPDMG)) %>%
    filter(PROP_SUM %in% sort(PROP_SUM,decreasing = TRUE)[1:10])
print(arrange(TOTAL_PROP_DMG, desc(PROP_SUM)))

## # A tibble: 10 x 2
##    EVTYPE                 PROP_SUM
##    <chr>                     <dbl>
##  1 FLOOD             144663709807 
##  2 HURRICANE TYPHOON  69305840000 
##  3 TORNADO            56947380676.
##  4 STORM SURGE        43323536000 
##  5 FLASH FLOOD        17279483492.
##  6 HAIL               15735267513.
##  7 HURRICANE          11868319010 
##  8 THUNDERSTORM WIND   9920860826.
##  9 TROPICAL STORM      7703890550 
## 10 WINTER STORM        6688997251

We can see the Flood makes the greatest property damage, which is total $144,663,709,807. Let’s see the difference from graph

ggplot(data = TOTAL_PROP_DMG, aes(x = EVTYPE, y = PROP_SUM)) +
    geom_bar(stat = "identity", width = 0.5, position = "dodge") + 
    coord_flip()

For Crop Damage, lets do similary.

CROP_DMG <- P_CROPDMG_data[,c("EVTYPE","CROPDMG","CROPDMGEXP","REFNUM")]
CROP_DMG$CROPDMGEXP <- as.numeric(CROP_DMG$CROPDMGEXP)
CROP_DMG$CROPDMG <- CROP_DMG$CROPDMG*10^CROP_DMG$CROPDMGEXP

Now, lets’s group by the EVTYPE and sum up property damage by EVTYPE,
and see top 10 events

TOTAL_CROP_DMG <- CROP_DMG %>% 
    group_by(EVTYPE) %>% 
    summarise(CROP_SUM = sum(CROPDMG)) %>%
    filter(CROP_SUM %in% sort(CROP_SUM,decreasing = TRUE)[1:10])
print(arrange(TOTAL_CROP_DMG, desc(CROP_SUM)))

## # A tibble: 10 x 2
##    EVTYPE              CROP_SUM
##    <chr>                  <dbl>
##  1 RIVER FLOOD       5028734000
##  2 ICE STORM         5022113500
##  3 FLOOD             4073493450
##  4 HURRICANE         2688910000
##  5 HURRICANE TYPHOON 2607822800
##  6 HAIL              1878569103
##  7 FLASH FLOOD       1477879350
##  8 DROUGHT           1231755000
##  9 THUNDERSTORM WIND 1017159388
## 10 HIGH WIND          636085650

We can see the river flood makes the greatest crop damage, which is total $50,228,734,000.

Damage to Society from all natural disater in past 50 years in US

Introduction

Data

Data Processing

Result