Synopsis

In America, storm and other severe weather events happen across the whole countries, causing huge public health and economic damages, like fatalities, injuries, and property damages. This report aims to run an ananlysis on the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database is tracking American severe weather events including storm.

The results of analysis presents in the following section through figures. It shows the scales of the influence of the high-ranking severe weather events on public health and economy.

The analysis of public health data indicates that tornado is the most harmful severe weather event which causes huge numbers of fatalities and injuries, while the flood has the most threatening impacts on both property and crop.

Project Target

The data analysis must address the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? Across the United States, which types of events have the greatest economic consequences?

Data cleaning

Getting and loading the weather impacts data

Here is the main data pre-processing procedure.

#Checking the Url
filename<-"repdata_data_StormData.csv.bz2"
if(!file.exists(filename)){
        fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileUrl,filename,method = "curl")
}

#Loading the raw data
storm_data <- read.csv(bzfile(filename))
weather_event <- as.data.frame(storm_data)

#Loading the relevant packages
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Data

Introduction

Quoting from the instruction of Course Project 2 on Coursera, The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The size of this file is 47MB.

This file inludes the data which start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

The relevant introduction documents are also offered in the instruction. Some documentation are also available on the database website. They show how some of the variables are constructed/defined.

Data cleaning

Based on the loaded data, the cleaning process starts with the following codes.

Pre-processing

#Checking the whole data
str(weather_event)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
dim(weather_event)
## [1] 902297     37
#Subsetting the useful data fragment
weather_impact <- select(weather_event,c("EVTYPE","FATALITIES","INJURIES",
                                  "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"))
head(weather_impact)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0
tail(weather_impact)
##                EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER          0        0       0          K       0          K
## 902293      HIGH WIND          0        0       0          K       0          K
## 902294      HIGH WIND          0        0       0          K       0          K
## 902295      HIGH WIND          0        0       0          K       0          K
## 902296       BLIZZARD          0        0       0          K       0          K
## 902297     HEAVY SNOW          0        0       0          K       0          K
#check the missing value in the subset
sum(is.na(weather_impact$FATALITIES))
## [1] 0
sum(is.na(weather_impact$INJURIES))
## [1] 0
sum(is.na(weather_impact$PROPDMG))
## [1] 0
sum(is.na(weather_impact$PROPDMGEXP))
## [1] 0
sum(is.na(weather_impact$CROPDMG))
## [1] 0
sum(is.na(weather_impact$CROPDMGEXP))
## [1] 0

Re-organizing and re-grouping the data fragment

##Transforming the subset data
sort(table(weather_impact$EVTYPE),decreasing=TRUE)[1:20]
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE 
##                     3392                     2761

Here I Re-arrange and group all the weather event with the highly mentioned keywords like “HAIL”, “WIND”, “FLOOD”, “RAIN” etc. All the new-grouped event type will be collected into a new variable called BRIEF_EVENT. Through this variable, I organise the data in a clear way with a few similar key words in one variable. Keyword “OTHERS” covers all the rare weather events not in the top 20.

#Weather events re-group
weather_impact$BRIEF_EVENT<-"OTHERS"
weather_impact$BRIEF_EVENT[grep("HAIL", weather_impact$EVTYPE, ignore.case = TRUE)] <- "HAIL"
weather_impact$BRIEF_EVENT[grep("FLOOD", weather_impact$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
weather_impact$BRIEF_EVENT[grep("WIND", weather_impact$EVTYPE, ignore.case = TRUE)] <- "WIND"
weather_impact$BRIEF_EVENT[grep("STORM", weather_impact$EVTYPE, ignore.case = TRUE)] <- "STORM"
weather_impact$BRIEF_EVENT[grep("TORNADO", weather_impact$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
weather_impact$BRIEF_EVENT[grep("LIGHTNING", weather_impact$EVTYPE, ignore.case = TRUE)] <- "LIGHTNING"
weather_impact$BRIEF_EVENT[grep("SNOW", weather_impact$EVTYPE, ignore.case = TRUE)] <- "SNOW"
weather_impact$BRIEF_EVENT[grep("RAIN", weather_impact$EVTYPE, ignore.case = TRUE)] <- "RAIN"
weather_impact$BRIEF_EVENT[grep("HEAT", weather_impact$EVTYPE, ignore.case = TRUE)] <- "HEAT"
weather_impact$BRIEF_EVENT[grep("WINTER", weather_impact$EVTYPE, ignore.case = TRUE)] <- "WINTER"
sort(weather_impact$BRIEF_EVENT,decreasing = TRUE)[1:20]
##  [1] "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER"
##  [9] "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER"
## [17] "WINTER" "WINTER" "WINTER" "WINTER"
#Economic damage re-group
#check the contents distributions in PROPDMGEXP and CROPDMGEXP variable
sort(table(weather_impact$PROPDMGEXP),decreasing = TRUE)
## 
##             K      M      0      B      5      1      2      ?      m      H 
## 465934 424665  11330    216     40     28     25     13      8      7      6 
##      +      7      3      4      6      -      8      h 
##      5      5      4      4      4      1      1      1
sort(table(weather_impact$CROPDMGEXP),decreasing = TRUE)
## 
##             K      M      k      0      B      ?      2      m 
## 618413 281832   1994     21     19      9      7      1      1

Quoting the Storm Data Documentation, the characters in PROPDMGEXP and CROPDMGEXP variable means that “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions". In the following sub-section,I convert the mentioned damage data into same unit level.

weather_impact$PROPDMGEXP <- as.character(weather_impact$PROPDMGEXP)
weather_impact$PROPDMGEXP[is.na(weather_impact$PROPDMGEXP)] <- 0 
weather_impact$PROPDMGEXP[!grepl("K|M|B", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- 0
weather_impact$PROPDMGEXP[grep("K", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- "3"
weather_impact$PROPDMGEXP[grep("M", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- "6"
weather_impact$PROPDMGEXP[grep("B", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- "9"
weather_impact$PROPDMGEXP <- as.numeric(as.character(weather_impact$PROPDMGEXP))
weather_impact$PROPERTY <- weather_impact$PROPDMG * 10^weather_impact$PROPDMGEXP

weather_impact$CROPDMGEXP <- as.character(weather_impact$CROPDMGEXP)
weather_impact$CROPDMGEXP[is.na(weather_impact$CROPDMGEXP)] <- 0 
weather_impact$CROPDMGEXP[!grepl("K|M|B", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- 0
weather_impact$CROPDMGEXP[grep("K", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- "3"
weather_impact$CROPDMGEXP[grep("M", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- "6"
weather_impact$CROPDMGEXP[grep("B", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- "9"
weather_impact$CROPDMGEXP <- as.numeric(as.character(weather_impact$CROPDMGEXP))
weather_impact$CROP <- weather_impact$CROPDMG * 10^weather_impact$CROPDMGEXP

sort(table(weather_impact$PROPERTY),decreasing = TRUE)[1:20]
## 
##       0    5000   10000    1000    2000   25000   50000    3000   20000   15000 
##  663123   31731   21787   17544   17186   17104   13596   10364    9179    8617 
##  250000     500   1e+05    2500   30000   5e+05    4000    8000   75000 2500000 
##    8439    6707    6302    5807    4391    4000    3202    2877    2419    2411
sort(table(weather_impact$CROP),decreasing = TRUE)[1:20]
## 
##      0   5000  10000  50000  1e+05   1000   2000  25000  20000  5e+05  15000 
## 880198   4097   2349   1984   1233    956    951    830    758    721    598 
##    500   3000 250000  2e+05  1e+06  30000  75000 150000  3e+05 
##    568    554    513    479    447    317    290    268    250

Data Analysis

In this section, the data analysis process is offered.

#Clean and calculate the data of Public Health variables

#Calculate the total fatalities and injuries numbers
total_health_loss<-ddply(weather_impact, .(BRIEF_EVENT), summarize, 
                         Total = sum(FATALITIES + INJURIES,  na.rm = TRUE))
total_health_loss$Type <- "Fatalities and Injuries"

#Fatalities
Fatalities <- ddply(weather_impact, .(BRIEF_EVENT), summarize, 
                    Total = sum(FATALITIES, na.rm = TRUE))
Fatalities$Type <- "Fatalities"

#Injuries
Injuries <- ddply(weather_impact, .(BRIEF_EVENT), summarize, 
                  Total = sum(INJURIES, na.rm = TRUE))
Injuries$Type <- "Injuries"

#Health damage
Health_damage <- rbind(Fatalities,Injuries)

Health_Event<-join(Fatalities, Injuries, by="BRIEF_EVENT")
Health_Event
##    BRIEF_EVENT Total       Type Total     Type
## 1        FLOOD  1524 Fatalities  8602 Injuries
## 2         HAIL    15 Fatalities  1371 Injuries
## 3         HEAT  3138 Fatalities  9224 Injuries
## 4    LIGHTNING   817 Fatalities  5232 Injuries
## 5       OTHERS  1809 Fatalities  6993 Injuries
## 6         RAIN   114 Fatalities   305 Injuries
## 7         SNOW   164 Fatalities  1164 Injuries
## 8        STORM   416 Fatalities  5338 Injuries
## 9      TORNADO  5661 Fatalities 91407 Injuries
## 10        WIND  1209 Fatalities  9001 Injuries
## 11      WINTER   278 Fatalities  1891 Injuries
#Clean and calculate the data of economic variables
# total economic damage
total_economic_damage <- ddply(weather_impact, .(BRIEF_EVENT), 
                               summarize, Total = sum(PROPERTY + CROP,  na.rm = TRUE))
total_economic_damage$Type <- "Property and Crop Damage"

# Property Damage 
PROPERTY <- ddply(weather_impact, .(BRIEF_EVENT), 
                  summarize, Total = sum(PROPERTY, na.rm = TRUE))
PROPERTY$Type <- "Property"

# Crop Damage
CROP <- ddply(weather_impact, .(BRIEF_EVENT), 
              summarize, Total = sum(CROP, na.rm = TRUE))
CROP$Type <- "crop"

# Economic Damage
Economic_Damage <- rbind(PROPERTY, CROP)
Economic_Event <- join(PROPERTY, CROP,by="BRIEF_EVENT")
Economic_Event
##    BRIEF_EVENT        Total     Type       Total Type
## 1        FLOOD 167502193929 Property 12266906100 crop
## 2         HAIL  15733043048 Property  3046837473 crop
## 3         HEAT     20325750 Property   904469280 crop
## 4    LIGHTNING    933974947 Property    12097090 crop
## 5       OTHERS  96313022890 Property 23576788780 crop
## 6         RAIN   3270230192 Property   919315800 crop
## 7         SNOW   1024169752 Property   134683100 crop
## 8        STORM  66304209893 Property  6374469888 crop
## 9      TORNADO  58593098029 Property   417461520 crop
## 10        WIND  10847086618 Property  1403719150 crop
## 11      WINTER   6777295251 Property    47444000 crop

Result Output

Based on the Project Target, the results of this analysis could answer the following issues: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?

For Question 1, the following plot is presented.

#Plot the health damage 
library(ggplot2)
Health_damage$BRIEF_EVENT <- as.factor(Health_damage$BRIEF_EVENT)

HP <- ggplot(Health_damage,aes(x=BRIEF_EVENT,y=Total,fill=Type))+ geom_bar(stat = "identity") +
        coord_flip() +
        labs(x="Event",y="Numbers of Fatalities and Injuries",title="Severe Weather Impacts on Public Health") +
        theme(plot.title = element_text(hjust = 1))
HP+theme_bw()

For Question 2, the following plot is offered.

#Plot the economic damage
Economic_Damage$BRIEF_EVENT <- as.factor(Economic_Damage$BRIEF_EVENT)
EP <- ggplot(Economic_Damage,aes(x=BRIEF_EVENT,y=Total,fill=Type))+ geom_bar(stat = "identity") +
        coord_flip() +
        labs(x="Event",y="Amouts of Dollar Loss",title="Severe Weather Impacts on Economy") +
        theme(plot.title = element_text(hjust = 1))
EP+theme_bw()

Conclusion

Based on the result output, the most harmful severe weather should be tornado, causing many casualties. And the flood cause the largest property and crop damage, threating the development of economy.