Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data source can be obtain by
[“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”]
There are also some documentation of the database available. The information can be get as below
-National Weather Service Storm Data Documentation
[https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]
-National Climatic Data Center Storm Events FAQ
[https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf]
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The analysis was conduct on R version 4.0.3 (2020-10-10), Platform: x86_64-w64-mingw32/x64 (64-bit), Window 10 system.
Based on the introduction there are various kind of question can be ask by scientists to conduct analysis research.
By the project setting question which is :-
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
by having this 2 question we can start our analysis by going through data processing
R package such as data.table,dplyr,ggplot2,and plyr are required to run the analysis
library(data.table)
library(dplyr)
library(ggplot2)
library(plyr)
First of all the analysis begin with the download of source file and load the data into variable StormRawData.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile ="./stormdata.csv.bz2")
StormRawData<-fread("./stormdata.csv.bz2")
names(StormRawData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(StormRawData)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
With the use of names and str function, we can identify the names of the Raw Data and the variable type of the Raw Data. Based on str function the Raw Data contain 902297 observation(rows) and 37 variable(columns).
ColumnnameH<-c("EVTYPE","FATALITIES","INJURIES")
ColumnnameE<-c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
StormHealthData<-select(StormRawData,ColumnnameH)
StormEconomyData<-select(StormRawData,ColumnnameE)
In this analysis there are 2 set of data has been subset out from the Raw Data and further process to another 4 tiny data set for analysis. The Data which subset from Raw Data are StormHealthData with the variables of Event Type, Fatal Case,and Injure Case which use to conduct analysis for question 1. While another subset data from Raw Data will be StormEconomyData with the variables of Event Type, Property Damage , Property Damage Exponent, Crop Damage,Crop Damage Exponent which use to conduct analysis for question 2.
print(table(is.na(StormHealthData$FATALITIES)))
##
## FALSE
## 902297
print(table(is.na(StormHealthData$INJURIES)))
##
## FALSE
## 902297
By using is.na function to ensure that there are no missing value in both Fatal Case and Injure Case data. Both have the return of 902297 in False which indicate no missing value by refer back on str function row show 902297.
print(unique(StormEconomyData$PROPDMGEXP))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
print(unique(StormEconomyData$CROPDMGEXP))
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
With the unique function we can see that both Property Damage Exponent and Crop Damage Exponent contain missing variable, unknown variables such as “-”,“?”. Hence, conversion will need to carry out as below:-
“0”: 1 “1”: 10 “2”: 100 “3”: 1.000 “4”: 10.000 “5”: 100.000 “6”: 1.000.000 “7”: 10.000.000 “8”: 100.000.000 “9”: 1.000.000.000 “H”: 100 “K”: 1.000 “M”: 1.000.000 “B”: 1.000.000.000
And the conversion has been carry up using code as below:
tomap1<-unique(StormEconomyData$PROPDMGEXP)
tomap2<-unique(StormEconomyData$CROPDMGEXP)
map1<-c(10^3,10^6,1,10^9,10^6,1,1,10^5,10^6,1,10^4,10^2,10^3,10^2,10^7,10^2,1,10,10^8)
map2<-c(1,10^6,10^3,10^6,10^9,1,1,10^3,10^2)
StormEconomyData$PROPDMGEXP<-mapvalues(StormEconomyData$PROPDMGEXP,from = tomap1, to = map1)
StormEconomyData$CROPDMGEXP<-mapvalues(StormEconomyData$CROPDMGEXP,from = tomap2, to = map2)
str(StormEconomyData)
## Classes 'data.table' and 'data.frame': 902297 obs. of 5 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "1000" "1000" "1000" "1000" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "1" "1" "1" "1" ...
## - attr(*, ".internal.selfref")=<externalptr>
By refer on str the variables type for property damage exponent and corp damage exponent are in character. Therefore, with the following code convert the data type from character to numerical to easy the further process on generating Tiny Data set.
StormEconomyData$PROPDMGEXP<-as.numeric(as.character(StormEconomyData$PROPDMGEXP))
StormEconomyData$CROPDMGEXP<-as.numeric(as.character(StormEconomyData$CROPDMGEXP))
Across the United States, which types of events are most harmful with respect to population health?
To answer this question I have generate 2 tiny data set which is StormFatal the sum of Fatal Case based on Type of Event. Another tiny data set is Strominjure the sum of Injuries Case based on Type of Event. From this 2 data set the top 10 Fatal case event and top 10 Injuries case event have been pick to conduct the analysis.
StormFatal<-aggregate(FATALITIES~EVTYPE,data=StormHealthData,FUN=sum)
StormFatal<-arrange(StormFatal,desc(FATALITIES))
Top10StormFatal<-StormFatal[1:10,]
StormInjure<-aggregate(INJURIES~EVTYPE,StormHealthData,FUN=sum)
StormInjure<-arrange(StormInjure,desc(INJURIES))
Top10StormInjure<-StormInjure[1:10,]
Across the United States, which types of events have the greatest economic consequences?
To answer this question I have generate 2 tiny data set which is StormPropertyDamage the sum of Property Damage multiply Property Damage Exponent based on Type of Event. Another tiny data set is StormCropDamage the sum of Crop Damage multiply Crop Damage Exponent based on Type of Event. In both data set 1 new variable was created which is the damage cost in billion. From this 2 data set the top 10 Property Damage Event and top 10 Crop Damage event have been pick to conduct the analysis
StormPropertyDamage<-aggregate(PROPDMG*PROPDMGEXP~EVTYPE,StormEconomyData,sum)
colnames(StormPropertyDamage)<-c("EVTYPE","TotalDamage")
StormPropertyDamage<-arrange(StormPropertyDamage,desc(TotalDamage))
StormPropertyDamage<-mutate(StormPropertyDamage,TotalInBillion= TotalDamage/10^9)
StormPropertyDamage<-StormPropertyDamage[1:10,]
StormCropDamage<-aggregate(CROPDMG*CROPDMGEXP~EVTYPE,StormEconomyData,sum)
colnames(StormCropDamage)<-c("EVTYPE","TotalDamage")
StormCropDamage<-arrange(StormCropDamage,desc(TotalDamage))
StormCropDamage<-mutate(StormPropertyDamage,TotalInBillion=TotalDamage/10^9)
StormCropDamage<-StormCropDamage[1:10,]
graph1<-ggplot(data=Top10StormFatal,aes(x=reorder(EVTYPE,-FATALITIES),y=FATALITIES,fill=EVTYPE))
graph1<-graph1+geom_bar(stat="identity")
graph1<-graph1+theme(axis.text.x = element_text(angle =90))+xlab("Type of Event")
graph1<-graph1+ylab("Number of Fatal Cases")+ggtitle("Top 10 Fatalities Event")+theme(legend.position="none")
graph1<-graph1+ylim(c(0,6000))
graph1
graph2<-ggplot(data = Top10StormInjure,aes(x=reorder(EVTYPE,-INJURIES),y=INJURIES,fill=EVTYPE))
graph2<-graph2+geom_bar(stat="identity")
graph2<-graph2+theme(axis.text.x= element_text(angle= 90))+xlab("Type of Event")
graph2<-graph2+ylab("NUmber of Injuries Cases")+ggtitle("Top 10 Injuries Event")
graph2<-graph2+theme(legend.position = "none")
graph2<-graph2+ylim(c(0,100000))
graph2
From the above 2 graph on “Top 10 Fatalities Event” and “Top 10 Injuries Event” tornado was the most high in the bar plot on both fatalities event and injuries event which mean that tornado was the most harmful to the population health.
graph3<-ggplot(StormPropertyDamage,aes(x=reorder(EVTYPE,-TotalInBillion),y=TotalInBillion,fill=EVTYPE))
graph3<-graph3+geom_bar(stat="identity")
graph3<-graph3+theme(axis.text.x = element_text(angle=90))+xlab("Type of Event")
graph3<-graph3+ylab("Property Damage Cost in Billion")+ggtitle("Top 10 Property Damage Event")
graph3<-graph3+theme(legend.position = "none")
graph3
graph4<-ggplot(StormCropDamage,aes(x=reorder(EVTYPE,-TotalInBillion),y=TotalInBillion,fill=EVTYPE))
graph4<-graph4+geom_bar(stat="identity")
graph4<-graph4+theme(axis.text.x = element_text(angle = 90))+xlab("Type of Event")
graph4<-graph4+ylab("Crop Damage Cost in Billion")+ggtitle("Top 10 Crop Damage Event")
graph4<-graph4+theme(legend.position = "none")
graph4
From the above 2 graph on “Top 10 Property Damage Event” and “Top 10 Crop Damage Event” flood was the most high in the bar plot on both property damage event and crop damage event which mean that flood was the most economic consequences event.
Tornado event are the most harmful event for population health. On the other hand, flood will be the most economic consequence event.