This is the report to show Health Casualty and Economic Damage in US due to Storms and other severe weather events occurred between 1950 and 2011.This report involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.In this report I have carried out analysis to find out top 10 event types that are most harmful with respect to population health causing fatalities/injuries and causing Economic damage in US $ Billions.
For Data processing I have downloaded the data from storm database from link https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.Below is the code to download and read raw csv data into a table called data.For the ease of calcualtions in later steps I have converted few columns of interest to character and numeric.
temp <- tempfile()
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,temp,method="curl")
# Reading CSV file in a table and loading data with colClasses as Character and numeric for columns of interest i.e EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG
data <- read.table(temp,header = TRUE,sep = ",",colClasses = c(EVTYPE="character",FATALITIES = "numeric",INJURIES ="numeric",PROPDMG = "numeric",CROPDMG="numeric"),strip.white = TRUE)
Below is the Sample contents of the file
# Display the contents of the file
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
To get the actual list of the Events names i have manual created a list from the link https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf.
# Preparing a list for the actual Event names from the National Weather Service documents
list <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
# Changing the list to upper case
list <- toupper(list)
Below is the output of the list created.
# Display the contents of the file
list
## [1] "ASTRONOMICAL LOW TIDE" "AVALANCHE"
## [3] "BLIZZARD" "COASTAL FLOOD"
## [5] "COLD/WIND CHILL" "DEBRIS FLOW"
## [7] "DENSE FOG" "DENSE SMOKE"
## [9] "DROUGHT" "DUST DEVIL"
## [11] "DUST STORM" "EXCESSIVE HEAT"
## [13] "EXTREME COLD/WIND CHILL" "FLASH FLOOD"
## [15] "FLOOD" "FROST/FREEZE"
## [17] "FUNNEL CLOUD" "FREEZING FOG"
## [19] "HAIL" "HEAT"
## [21] "HEAVY RAIN" "HEAVY SNOW"
## [23] "HIGH SURF" "HIGH WIND"
## [25] "HURRICANE (TYPHOON)" "ICE STORM"
## [27] "LAKE-EFFECT SNOW" "LAKESHORE FLOOD"
## [29] "LIGHTNING" "MARINE HAIL"
## [31] "MARINE HIGH WIND" "MARINE STRONG WIND"
## [33] "MARINE THUNDERSTORM WIND" "RIP CURRENT"
## [35] "SEICHE" "SLEET"
## [37] "STORM SURGE/TIDE" "STRONG WIND"
## [39] "THUNDERSTORM WIND" "TORNADO"
## [41] "TROPICAL DEPRESSION" "TROPICAL STORM"
## [43] "TSUNAMI" "VOLCANIC ASH"
## [45] "WATERSPOUT" "WILDFIRE"
## [47] "WINTER STORM" "WINTER WEATHER"
To filter the data based on the actual List of Events,first i need to remove all Junk values to space by using str_replace_all command available in stringdist package.Once we have much cleaner set of data in column EVTYPE we will do amatch command available in stringdist with the actual list of Events.This will futher clean up data by doing a close match on the data we original had in the raw csv file with the actuall list of EVENTS.Futher we also need to convert the PROPDMGEXP and CROPDMGEXP which has alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions to K = 1000,M=1000000 and B=1000000000.The actual expense is equal to PROPDMGEXP mutiply by PROPDMG for property damage and CROPDMGEXP mutiply by CROPDMG for crop damage.Creating a new dataset as newdata with 2 new columns added PROPDMGEXP NEW and CROPDMGEXP NEW containg data coverion of PROPDMG and CROPDMG in $
The below code provides snapshot of cleaning up data as creating a new data set explained above.
# Using stringr for str_replace_all and stringdist for amatch commands
library(stringr)
## Warning: package 'stringr' was built under R version 3.1.2
library(stringdist)
## Warning: package 'stringdist' was built under R version 3.1.2
# Removing Junk Values from EVTYPE columns and replacing them with blank
data$EVTYPE <- str_replace_all(data$EVTYPE, "[[:punct:]]"," ")
# Doing a partial match to get the closest match with the EVTYPE data and actual Event list.
data$EVTYPE <- list[amatch(str_trim(toupper(data$EVTYPE)), list, maxDist=Inf)]
# Converting PROPDMGEXP to CROPDMGEXP to K = 1000,M=1000000 and B=1000000000
data$PROPDMGEXP <- c(1000,1000000,1000000000)[match(str_trim(toupper(data$PROPDMGEXP)),c("K","M","B"))]
data$CROPDMGEXP <- c(1000,1000000,1000000000)[match(str_trim(toupper(data$CROPDMGEXP)),c("K","M","B"))]
# Creating new dataset new_data with 2 new columns PROPDMGEXP_NEW and CROPDMGEXP_NEW containg data coverion of PROPDMG and CROPDMG in $
newdata <- cbind(cbind(data,"PROPDMGEXPNEW" = (as.numeric(data$PROPDMGEXP)*data$PROPDMG)),"CROPDMGEXPNEW" = (as.numeric(data$CROPDMGEXP)*data$CROPDMG))
After creating the new dataset I have created a sperate dataset for Health related impact due to Event type by aggregating the data from 1950 till 2011 for FATALITIES and INJURIES.First I have created a dataset dataFATALITIES, I have aggreagated data for EVTYPE and FATALITIES and added a new column HEALTH having a default value of FATALITIES, I have also renamed columan FATALITIES as COUNTS.Similarly I have creatd a data set dataINJURIES and I have aggreagated data for EVTYPE and INJURIES and added a new column HEALTH having a default value of INJURIES, I have also renamed columns INJURIES as COUNTS.The above steps were done to create a new dataset Health, having FATALITIES and INJURIES combined together and extracting top 10 records by ordering data from dataFATALITIES and dataINJURIES in descending order. The Health data set will give us top 10 records for EVTYPE causing FATALITIES and INJURIES.
# Aggreagated data for EVTYPE and FATALITIES and added a new column HEALTH having a default value of FATALITIES
dataFATALITIES <- cbind(setNames(aggregate(FATALITIES ~ EVTYPE,newdata,sum),c("EVTYPE","COUNT")),"HEALTH" = 'FATALITIES')
# Aggreagated data for EVTYPE and INJURIES and added a new column HEALTH having a default value of INJURIES
dataINJURIES <- cbind(setNames(aggregate(INJURIES ~ EVTYPE,newdata,sum),c("EVTYPE","COUNT")),"HEALTH" = 'INJURIES')
#create a new dataset Health, having FATALITIES and INJURIES combined together and extracting top 10 records by ordering data from dataFATALITIES and dataINJURIES #in descending order
Health <- rbind(head(dataFATALITIES[order(-dataFATALITIES$COUNT),],10),head(dataINJURIES[order(-dataINJURIES$COUNT),],10))
Similarly I have created a sperate dataset for Economic related impact due to Event type by aggregating the data from 1950 till 2011 for PROPDMG and CROPDMG.First I have created a dataset dataPROPDMG, I have aggreagated data for EVTYPE and PROPDMGEXPNEW and added a new column ECODAMAGE having a default value of PROPERTY DAMAGE, I have also renamed columan PROPDMGEXPNEW as COUNTS.Similarly I have creatd a data set dataCROPDMG and I have aggreagated data for EVTYPE and CROPDMGEXPNEW and added a new column ECODAMAGE having a default value of CROP DAMAGE, I have also renamed columns CROPDMGEXPNEW as COUNTS.The above steps were done to create a new dataset Property, having PROPDMGEXPNEW and CROPDMGEXPNEW combined together and extracting top 10 records by ordering data from dataPROPDMG and dataCROPDMG in descending order. The Health data set will give us top 10 records for EVTYPE causing PROPDMGEXPNEW and CROPDMGEXPNEW.
#Aggreagated data for EVTYPE and PROPDMGEXPNEW and added a new column ECODAMAGE having a default value of PROPERTY DAMAGE
dataPROPDMG <- cbind(setNames(aggregate(PROPDMGEXPNEW ~ EVTYPE,newdata,sum),c("EVTYPE","COUNT")),"ECODAMAGE" = 'PROPERTY DAMAGE')
#Aggreagated data for EVTYPE and CROPDMGEXPNEW and added a new column ECODAMAGE having a default value of CROP DAMAGE
dataCROPDMG <- cbind(setNames(aggregate(CROPDMGEXPNEW ~ EVTYPE,newdata,sum),c("EVTYPE","COUNT")),"ECODAMAGE" = 'CROP DAMAGE')
#New dataset Property, having PROPDMGEXPNEW and CROPDMGEXPNEW combined together and extracting top 10 records by ordering data from dataPROPDMG and dataCROPDMG in #descending order
Property <- rbind(head(dataPROPDMG[order(-dataPROPDMG$COUNT),],10),head(dataCROPDMG[order(-dataCROPDMG$COUNT),],10))
Finally creating a dataset Propertytotal containing total Economic damage due to prperty and crop damage due to top 10 Event Types.
#Aggreagated data for EVTYPE and PROPDMGEXPNEW and added a new column ECODAMAGE having a default value of PROPERTY DAMAGE
Propertytotal <- aggregate(COUNT ~ EVTYPE,Property,sum)
1.Below is the graph showing top 10 Event types causing FATALITIES and INJURIES
library(ggplot2)
p = ggplot(Health, aes(x = EVTYPE,y = COUNT,fill=HEALTH)) +
geom_bar(stat="identity") +
facet_grid(HEALTH ~ .) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = COUNT),position=position_jitter(width=0),size =3,vjust=-0.25) +
xlab("Event Types") +
ylab("Population health impact") +
ggtitle("Top 10 most harmful Event Types with respect to population health")
print(p)
As per above graph Tornado is the most Harmful to population Health with 5633 FATALITIES and 91364 INJURIES
2.Below is the graph showing top 10 Event types causing Property and Crop damages in US Billion $.
library(ggplot2)
g = ggplot(Property, aes(x = EVTYPE,y = COUNT/1000000000)) + geom_bar(stat="identity") + facet_grid(ECODAMAGE ~ .) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(COUNT/1000000000,digits = 0)),position=position_jitter(width=0),size =3,vjust=-0.25) +
xlab("Event Types") +
ylab("Economic Damage in US Billion $") +
ggtitle("Property AND Crop damage due to top 10 different Events Types in US Billion $")
print(g)
As per above graph Flood is the most Harmful to US Economic with 145 Billion $ of Property Damage and Draught causing 14 Billion US $ crop damage
3.Below is the graph showing top 10 Event types causing total Economic damage in US Billion $.
library(ggplot2)
q = ggplot(Propertytotal, aes(x = EVTYPE,y = COUNT/1000000000)) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(COUNT/1000000000,digits = 0)),position=position_jitter(width=0),size =3,vjust=-0.25) +
xlab("Event Types") +
ylab("Economic Damage in US Billion $") +
ggtitle("Total damage due to top 10 different Events Types in US Billion $")
print(q)
As per above graph Flood is the most Harmful to US Economic with 151 Billion $ of total damage