Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data source can be obtain by

[“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”]

There are also some documentation of the database available. The information can be get as below

-National Weather Service Storm Data Documentation

[https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]

-National Climatic Data Center Storm Events FAQ

[https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf]

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

The analysis was conduct on R version 4.0.3 (2020-10-10), Platform: x86_64-w64-mingw32/x64 (64-bit), Window 10 system.

Question for Analysis

Based on the introduction there are various kind of question can be ask by scientists to conduct analysis research.

By the project setting question which is :-

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

by having this 2 question we can start our analysis by going through data processing

R package used for the analysis

R package such as data.table,dplyr,ggplot2,and plyr are required to run the analysis

library(data.table)
library(dplyr)
library(ggplot2)
library(plyr)

Data Processing

First of all the analysis begin with the download of source file and load the data into variable StormRawData.

Download source file data

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                      destfile ="./stormdata.csv.bz2")

StormRawData<-fread("./stormdata.csv.bz2")

names(StormRawData)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

str(StormRawData)

## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

With the use of names and str function, we can identify the names of the Raw Data and the variable type of the Raw Data. Based on str function the Raw Data contain 902297 observation(rows) and 37 variable(columns).

Subset Data for Analysis

ColumnnameH<-c("EVTYPE","FATALITIES","INJURIES")
ColumnnameE<-c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")

StormHealthData<-select(StormRawData,ColumnnameH)
StormEconomyData<-select(StormRawData,ColumnnameE)

In this analysis there are 2 set of data has been subset out from the Raw Data and further process to another 4 tiny data set for analysis. The Data which subset from Raw Data are StormHealthData with the variables of Event Type, Fatal Case,and Injure Case which use to conduct analysis for question 1. While another subset data from Raw Data will be StormEconomyData with the variables of Event Type, Property Damage , Property Damage Exponent, Crop Damage,Crop Damage Exponent which use to conduct analysis for question 2.

Data Processing On Subset Data 1 StormHealthData

print(table(is.na(StormHealthData$FATALITIES)))

## 
##  FALSE 
## 902297

print(table(is.na(StormHealthData$INJURIES)))

## 
##  FALSE 
## 902297

By using is.na function to ensure that there are no missing value in both Fatal Case and Injure Case data. Both have the return of 902297 in False which indicate no missing value by refer back on str function row show 902297.

Data Processing On Subset Data 2 StormEconomyData

print(unique(StormEconomyData$PROPDMGEXP))

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

print(unique(StormEconomyData$CROPDMGEXP))

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

With the unique function we can see that both Property Damage Exponent and Crop Damage Exponent contain missing variable, unknown variables such as “-”,“?”. Hence, conversion will need to carry out as below:-

“0”: 1 “1”: 10 “2”: 100 “3”: 1.000 “4”: 10.000 “5”: 100.000 “6”: 1.000.000 “7”: 10.000.000 “8”: 100.000.000 “9”: 1.000.000.000 “H”: 100 “K”: 1.000 “M”: 1.000.000 “B”: 1.000.000.000

And the conversion has been carry up using code as below:

tomap1<-unique(StormEconomyData$PROPDMGEXP)
tomap2<-unique(StormEconomyData$CROPDMGEXP)

map1<-c(10^3,10^6,1,10^9,10^6,1,1,10^5,10^6,1,10^4,10^2,10^3,10^2,10^7,10^2,1,10,10^8)
map2<-c(1,10^6,10^3,10^6,10^9,1,1,10^3,10^2)

StormEconomyData$PROPDMGEXP<-mapvalues(StormEconomyData$PROPDMGEXP,from = tomap1, to = map1)
StormEconomyData$CROPDMGEXP<-mapvalues(StormEconomyData$CROPDMGEXP,from = tomap2, to = map2)

str(StormEconomyData)

## Classes 'data.table' and 'data.frame':   902297 obs. of  5 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "1000" "1000" "1000" "1000" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "1" "1" "1" "1" ...
##  - attr(*, ".internal.selfref")=<externalptr>

By refer on str the variables type for property damage exponent and corp damage exponent are in character. Therefore, with the following code convert the data type from character to numerical to easy the further process on generating Tiny Data set.

StormEconomyData$PROPDMGEXP<-as.numeric(as.character(StormEconomyData$PROPDMGEXP))
StormEconomyData$CROPDMGEXP<-as.numeric(as.character(StormEconomyData$CROPDMGEXP))

Tiny Data set for Answering Question 1

Across the United States, which types of events are most harmful with respect to population health?

To answer this question I have generate 2 tiny data set which is StormFatal the sum of Fatal Case based on Type of Event. Another tiny data set is Strominjure the sum of Injuries Case based on Type of Event. From this 2 data set the top 10 Fatal case event and top 10 Injuries case event have been pick to conduct the analysis.

StormFatal<-aggregate(FATALITIES~EVTYPE,data=StormHealthData,FUN=sum)
StormFatal<-arrange(StormFatal,desc(FATALITIES))
Top10StormFatal<-StormFatal[1:10,]

StormInjure<-aggregate(INJURIES~EVTYPE,StormHealthData,FUN=sum)
StormInjure<-arrange(StormInjure,desc(INJURIES))
Top10StormInjure<-StormInjure[1:10,]

Tiny Data set for Answering Question 2

Across the United States, which types of events have the greatest economic consequences?

To answer this question I have generate 2 tiny data set which is StormPropertyDamage the sum of Property Damage multiply Property Damage Exponent based on Type of Event. Another tiny data set is StormCropDamage the sum of Crop Damage multiply Crop Damage Exponent based on Type of Event. In both data set 1 new variable was created which is the damage cost in billion. From this 2 data set the top 10 Property Damage Event and top 10 Crop Damage event have been pick to conduct the analysis

StormPropertyDamage<-aggregate(PROPDMG*PROPDMGEXP~EVTYPE,StormEconomyData,sum)
colnames(StormPropertyDamage)<-c("EVTYPE","TotalDamage")
StormPropertyDamage<-arrange(StormPropertyDamage,desc(TotalDamage))
StormPropertyDamage<-mutate(StormPropertyDamage,TotalInBillion= TotalDamage/10^9)
StormPropertyDamage<-StormPropertyDamage[1:10,]

StormCropDamage<-aggregate(CROPDMG*CROPDMGEXP~EVTYPE,StormEconomyData,sum)
colnames(StormCropDamage)<-c("EVTYPE","TotalDamage")
StormCropDamage<-arrange(StormCropDamage,desc(TotalDamage))
StormCropDamage<-mutate(StormPropertyDamage,TotalInBillion=TotalDamage/10^9)
StormCropDamage<-StormCropDamage[1:10,]

Analysis Result

graph1<-ggplot(data=Top10StormFatal,aes(x=reorder(EVTYPE,-FATALITIES),y=FATALITIES,fill=EVTYPE))
graph1<-graph1+geom_bar(stat="identity")
graph1<-graph1+theme(axis.text.x = element_text(angle =90))+xlab("Type of Event")
graph1<-graph1+ylab("Number of Fatal Cases")+ggtitle("Top 10 Fatalities Event")+theme(legend.position="none")
graph1<-graph1+ylim(c(0,6000))
graph1

graph2<-ggplot(data = Top10StormInjure,aes(x=reorder(EVTYPE,-INJURIES),y=INJURIES,fill=EVTYPE))
graph2<-graph2+geom_bar(stat="identity")
graph2<-graph2+theme(axis.text.x= element_text(angle= 90))+xlab("Type of Event")
graph2<-graph2+ylab("NUmber of Injuries Cases")+ggtitle("Top 10 Injuries Event")
graph2<-graph2+theme(legend.position = "none")
graph2<-graph2+ylim(c(0,100000))
graph2

From the above 2 graph on “Top 10 Fatalities Event” and “Top 10 Injuries Event” tornado was the most high in the bar plot on both fatalities event and injuries event which mean that tornado was the most harmful to the population health.

graph3<-ggplot(StormPropertyDamage,aes(x=reorder(EVTYPE,-TotalInBillion),y=TotalInBillion,fill=EVTYPE))
graph3<-graph3+geom_bar(stat="identity")
graph3<-graph3+theme(axis.text.x = element_text(angle=90))+xlab("Type of Event")
graph3<-graph3+ylab("Property Damage Cost in Billion")+ggtitle("Top 10 Property Damage Event")
graph3<-graph3+theme(legend.position = "none")
graph3

graph4<-ggplot(StormCropDamage,aes(x=reorder(EVTYPE,-TotalInBillion),y=TotalInBillion,fill=EVTYPE))
graph4<-graph4+geom_bar(stat="identity")
graph4<-graph4+theme(axis.text.x = element_text(angle = 90))+xlab("Type of Event")
graph4<-graph4+ylab("Crop Damage Cost in Billion")+ggtitle("Top 10 Crop Damage Event")
graph4<-graph4+theme(legend.position = "none")
graph4

From the above 2 graph on “Top 10 Property Damage Event” and “Top 10 Crop Damage Event” flood was the most high in the bar plot on both property damage event and crop damage event which mean that flood was the most economic consequences event.

Conclusion

Tornado event are the most harmful event for population health. On the other hand, flood will be the most economic consequence event.

Reproducible Research Project 2

Yee Thong