Synopsis

This report contains the analysis of Storm Data and its impact on population health and economy. Storm Database contains detailed information about Storms and other similar events that have occurred in United States and have affected public health and had great consequences on the economy.

This analysis contains code distributed across R chunks for downloading and reading the data, cleaning the data, exlporing the data by some basic heuristics and by creating some plots we will finally be inferring about which type of events are most harmful with respect to population health and economy.

The analysis was done by analysing first the population health and then economic data.

In each analysis data was grouped by Event Types (EVTYPE) column and aggregated against different type of parameters available to measure a particular effect. Like Fatalities and Injuries for population health and Crop and Property damage for economy.

Data in different column were also normalized to fit all the data in a range of 0-1 that gives us better understanding both in reading and viewing it as a plot.

Note:

For this report to compile and run will require Storm Data present in your working directory. If its present it will proceed further If not the R chunk for downloading the data will download the required file.
The data contains over 90k rows hence certain chunks for reading and processing the data have been cached. But it will be take some time first time this document is knitted.

The Storm Data has been downloaded from this location whereas documentation and FAQs were used for support.

Data Processing

This section covers the entire analysis part. This has sub-sections that covers data processing, cleaning, analysis and plotting part.

Required Libraries

require(dplyr)
require(knitr)
require(reshape2)
require(ggplot2)
require(grid)
require(gridExtra)
require(lubridate)

Downloading

## Downloading and Extracting file. This checks if file is already present in the directory.
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";
fileName<-"repdata-data-StormData.csv.bz2";
## Checking If Storm Data is present in the current workspace
if(!(file.exists(fileName))) {
        print("Required Storm Data Not Present..Downloading..")
        download.file(url = fileUrl,destfile = fileName)
} else {
        print("Storm Data Present")
}

## [1] "Storm Data Present"

Reading Data

stormdata<-read.csv(fileName)

Cleaning Data

There are two things that needs to be fixed in this data.

Filter all the data that were recorded prior to 1996

This is needed because prior to 1996 only data for Tornadoes (references here), Thunderstorm Winds and Hail were recorded. If we will not filter out these years this will give undue bias to these events as compared to other events.

stormdata$BGN_DATE<-as.Date(stormdata$BGN_DATE,"%m/%d/%Y")
stormdata<-stormdata[year(stormdata$BGN_DATE)>=1996,]

Group Same Events Recorded Differently

The next cleaning the we can see in this data is regarding grouping of event types that are same but are recorded differently.

Some Example includes

ACTUAL EVENT	DIFFERENT SYNONYMS PRESENT IN DATA
WIND	WIND,HIGH WIND, HIGH WIND/BLIZZARD
TORNADO	TORNADO F0, TORNADOS, WATERSPROUT/TORNADO

These event types which are actually same but are document different with little variations should be collapsed as one. The known event types in which they should be collapsed can be found in Page 6 of faq page

This cleaning is important because we may group the events which are exactly same but are recorded differently into one and can avoid redundant events to show up when we are analysing top events.

Grouping of these similar event types can be achieved by replacing all similar variations by their actual Event Name.

Finding similar events are done with the help of regex search.

##First changing the case of all event types to uppercase
stormdata$EVTYPE<-toupper(stormdata$EVTYPE)

##For Tornado Only One Event Type 'TORNADO' should be there
stormdata$EVTYPE[grep(pattern = "TORNADO",x=stormdata$EVTYPE)]<-"TORNADO"

##For Wind there are six unique type of wind events
##High Wind, Marine High Wind, Marine Strong Wind, Marine Thunderstorm Wind, Strong Wind, Thunderstorm Wind
##Grouping Events For Thunderstorm Wind First
stormdata$EVTYPE[grep('(((?=.*TSTM)|(?=.*T(.*)ORM)|(?=.*T(.*)ROM)))^(?!.*MARINE)(?!.*NON)',stormdata$EVTYPE,perl = T)]<-"THUNDERSTORM WIND"
stormdata$EVTYPE[grep('(?=.*WIND)(?=.*HIGH)(?!.*MARINE)',stormdata$EVTYPE,perl = T)]<-"HIGH WIND"
stormdata$EVTYPE[grep('((?=.*TSTM)|(?=.*THUNDERSTORM))(?=.*MARINE)',stormdata$EVTYPE,perl = T)]<-"MARINE THUNDERSTORM WIND"
stormdata$EVTYPE[grep('(?=.*STRONG)(?=.*WIND)(?!.*MARINE)',stormdata$EVTYPE,perl = T)]<-"STRONG WIND"

##Grouping Events For  Hail Type
stormdata$EVTYPE[grep('(?=.*HAIL)^(?!.*MARINE)',stormdata$EVTYPE,perl = T)]<-"HAIL"

##Grouping Events For cold/Wind Chill and Extreme Cold/Wind Chill
stormdata$EVTYPE[grep('(?=.*WIND)(?=.*CHILL)^(?!.*EXTREME)',stormdata$EVTYPE,perl = T)]<-"COLD/WIND CHILL"
stormdata$EVTYPE[grep('(?=.*WIND)(?=.*CHILL)(?=.*EXTREME)',stormdata$EVTYPE,perl = T)]<-"EXTREME COLD/WIND CHILL"

##Grouping Events For Heavy Snow
stormdata$EVTYPE[grep('(?=.*SNOW)(?=.*HEAVY)^(?!.*RAIN)',stormdata$EVTYPE,perl = T)]<-"HEAVY SNOW"

##Grouping Events For Rain
stormdata$EVTYPE[grep('(^(HEAVY RAIN))|((^(?!.*/)*(?=.*RAIN))^(?!.*FLOOD)(?!.*SLEET)(?!.*LIGHT))',stormdata$EVTYPE,perl = T)]<-"HEAVY RAIN"

##Grouping Events For Hurricane
stormdata$EVTYPE[grep('(?=.*HURRI)|(?=.*TYPHOON)',stormdata$EVTYPE,perl = T)]<-"HURRICANE (TYPHOON)"
stormdata$EVTYPE[grep('(?=.*FIRE)',stormdata$EVTYPE,perl = T)]<-"WILDFIRE"

Analysing Effect On Population Health

This section analyse the effect of different Event Types on Population health.

To Analyse the effect of different type of Storm Events on Public Health we can evaluate the fatalities and injuries caused by different events. For this we will focus on only the EVTTYPE, FATALITIES and INJURIES columns of the data.

Extract those.

health_data<-stormdata[,c("EVTYPE","FATALITIES","INJURIES")]
grouped_by_eventtype<-group_by(health_data,EVTYPE)

Analysing Effect On Fatalities

Once we have health data first we evaluate the effect of different event types on fatalities by group this by “EVTYPE” and summarise against the total FATALITIES.

summarised_health_data_fats<-as.data.frame(summarise_each(grouped_by_eventtype,funs(sum),-INJURIES))

Here are top 10 events that have caused maximum fatalities.

sorted_health_data_fats<-arrange(summarised_health_data_fats,desc(FATALITIES))
kable(head(sorted_health_data_fats,10),format = "markdown")

EVTYPE	FATALITIES
EXCESSIVE HEAT	1797
TORNADO	1511
FLASH FLOOD	887
THUNDERSTORM WIND	737
LIGHTNING	651
FLOOD	414
RIP CURRENT	340
HEAT	237
HIGH WIND	235
AVALANCHE	223

Analysing Effect On Injuries

Similarly we evaluate the effect of different event types on Injuries.

summarised_health_data_injs<-as.data.frame(summarise_each(grouped_by_eventtype,funs(sum),-FATALITIES))

Here are top 10 events that have caused maximum injuries.

sorted_health_data_injs<-arrange(summarised_health_data_injs,desc(INJURIES))
kable(head(sorted_health_data_injs,10))

EVTYPE	INJURIES
TORNADO	20667
THUNDERSTORM WIND	7497
FLOOD	6758
EXCESSIVE HEAT	6391
LIGHTNING	4141
FLASH FLOOD	1674
WILDFIRE	1458
HURRICANE (TYPHOON)	1328
HEAT	1222
HIGH WIND	1083

Combing the data to get a composite view

To get a combined view of different Event Types on Public Population we can analyse its effect using features Fatalities and Injuries.

Prior to that we will have to normalize our data because the two columns have data in different ranges. Hence the final plot may not be as clear and descriptive as we want. For Normalization we will use the standard min/max normalization to normalize the values within range of 0 to 1.

##This is a simple function that normalizes all rows of data-frame within value from 0 to 1 except if row name is "EVTYPE"
normalize_col<-function(col_name,df)
{
        if(! col_name %in% "EVTYPE")
        {
                (df[col_name]-min(df[col_name]))/(max(df[col_name])-min(df[col_name]))
        }
        else
        {
                df[col_name]       
        }
}

##Normalizing the data to arrive at better relative values and to get a more clear plot.
listoutput<-lapply(names(sorted_health_data_fats),function(x){normalize_col(x,sorted_health_data_fats)})
norm_fats_health_data<-do.call(cbind,listoutput)

listoutput<-lapply(names(sorted_health_data_injs),function(x){normalize_col(x,sorted_health_data_injs)})
norm_injs_health_data<-do.call(cbind,listoutput)

##Converting Both the data to long format so that they can be Row binded and we can arrive at common columns
top10_fats_health_data_long<-melt(head(norm_fats_health_data,10),id=c("EVTYPE"),variable.name="FEATURE",value.name = "COUNT")

top10_injs_health_data_long<-melt(head(norm_injs_health_data,10),id=c("EVTYPE"),variable.name="FEATURE",value.name = "COUNT")

combined_health_data<-rbind(top10_fats_health_data_long,top10_injs_health_data_long)

Plotting the data for better understanding and viewing

For better understanding lets plot the data using ggplot2 graphic system.

## Creating plot for top 10 event types when combined all parameters
ggplot(combined_health_data,aes(x=EVTYPE,y=COUNT,fill=FEATURE))+
        geom_bar(stat="identity",position = "dodge")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1,colour = "black",size=15,face="bold"))+
    labs(list(title="Top Event Types That Have Caused Maximum Fatalities And Injuries",x="Events",y="Total Count"))

Analysing Effect On Economy

This section analyse the effect of different Event Types on Economy.

Now effect on economy as per the data present in our data base can be evaluated on two parameters Property Damage and Crop Damage.

For this we will focus on PROPDMG,PROPDMGEXP for Property Damage and CROPDMG,CROPDMGEXP for Crop Damage.

Analysing Effect On Property Damage

For analysing effect on Property Damage extract the required data.

prop_dmg_data<-stormdata[,c("EVTYPE","PROPDMG","PROPDMGEXP")]

Here column PROPDMGEXP contains the single letter identifier for the exponentiation as Hundred (H), Thousand (K), Million (M), Billion (B) To calculate total damage we will have to multiply actual “PROPDMG” and its exponent present in “PROPDMGEX” to get column “PROPCASH”

##Creating Multiplier
mulitplier_replace<-c("H"=100,"K"=1000,"M"=10000000,"B"=1000000000)
##Converting Exponent column to character and upper case as for some of the data we have exponents present as lower case.
prop_dmg_data$PROPDMGEXP<-toupper(as.character(prop_dmg_data$PROPDMGEXP))
##Subsetting only those data for which we have multiplier available
prop_dmg_data<-prop_dmg_data[prop_dmg_data$PROPDMGEXP %in% c("K","H","M","B"),]
prop_dmg_data$PROPCASH<-prop_dmg_data$PROPDMG*mulitplier_replace[prop_dmg_data$PROPDMGEXP]

Grouping and Summarising the data

prop_dmg_data_grouped_by<-group_by(prop_dmg_data,EVTYPE)
summarised_prop_dmg_data<-as.data.frame(summarise_each(prop_dmg_data_grouped_by,funs(sum),-PROPDMG,-PROPDMGEXP))

Here are top 10 events that have caused incurred maximum Property Loss.

sorted_summ_prop_dmg_data<-arrange(summarised_prop_dmg_data,desc(PROPCASH))
kable(head(sorted_summ_prop_dmg_data,10),format = "markdown")

EVTYPE	PROPCASH
FLOOD	329710773550
THUNDERSTORM WIND	198262081490
TORNADO	187941925710
HURRICANE (TYPHOON)	176284049010
FLASH FLOOD	132110873910
HAIL	124683773420
WILDFIRE	53689609500
HIGH WIND	37975978360
DROUGHT	10433551000
HEAVY RAIN	5566394440

Analysing Effect On Crop Damage

For analysing effect on Crop Damage extract the required data.

crop_dmg_data<-stormdata[,c("EVTYPE","CROPDMG","CROPDMGEXP")]

Here column CROPDMGEXP contains the single letter identifier for the exponentiation as Hundred (H), Thousand (K), Million (M), Billion (B) To calculate total damage we will have to multiply actual “CROPDMG” and its exponent present in “CROPDMGEX” to get column “CROPCASH”

##Converting Exponent column to character and upper case as for some of the data we have exponents present as lower case.
crop_dmg_data$CROPDMGEXP<-toupper(as.character(crop_dmg_data$CROPDMGEXP))
##Subsetting only those data for which we have multiplier available
crop_dmg_data<-crop_dmg_data[crop_dmg_data$CROPDMGEXP %in% c("K","H","M","B"),]
crop_dmg_data$CROPCASH<-crop_dmg_data$CROPDMG*mulitplier_replace[crop_dmg_data$CROPDMGEXP]

Grouping and Summarising the data

crop_dmg_data_grouped_by<-group_by(crop_dmg_data,EVTYPE)
summarised_crop_dmg_data<-as.data.frame(summarise_each(crop_dmg_data_grouped_by,funs(sum),-CROPDMG,-CROPDMGEXP))

Here are top 10 events that have caused incurred maximum Crop Loss.

sorted_summ_crop_dmg_data<-arrange(summarised_crop_dmg_data,desc(CROPCASH))
kable(head(sorted_summ_crop_dmg_data,10),format = "markdown")

EVTYPE	CROPCASH
DROUGHT	124487146000
FLOOD	48424798400
HURRICANE (TYPHOON)	39851427800
HAIL	20485572450
THUNDERSTORM WIND	15565602600
EXTREME COLD	13050733000
FLASH FLOOD	11909991700
FROST/FREEZE	10887446000
HEAVY RAIN	7304609800
HIGH WIND	6185751300

Combing the data to get a composite view

We are combining the top10 events of both damages type and are showing them together.

This will also show us that there are some events that have caused only one type of damage alone but to a significant extreme.

##Normalizing the data to arrive at better relative values and to get a more clear plot.
listoutput<-lapply(names(sorted_summ_prop_dmg_data),function(x){normalize_col(x,sorted_summ_prop_dmg_data)})
norm_prop_dmg_data<-do.call(cbind,listoutput)

listoutput<-lapply(names(sorted_summ_crop_dmg_data),function(x){normalize_col(x,sorted_summ_crop_dmg_data)})
norm_crop_dmg_data<-do.call(cbind,listoutput)

##Converting Both the data to long format so that they can be Row binded and we can arrive at common columns
top10_prop_dmg_data_long<-melt(head(norm_prop_dmg_data,10),id=c("EVTYPE"),variable.name="DAMAGETYPE",value.name = "AMOUNT")

top10_crop_dmg_data_long<-melt(head(norm_crop_dmg_data,10),id=c("EVTYPE"),variable.name="DAMAGETYPE",value.name = "AMOUNT")

combined_dmg_data<-rbind(top10_crop_dmg_data_long,top10_prop_dmg_data_long)

Plotting the Data For Better Understanding and Viewing

Plotting the combined data to get a better understanding of top 10 Events that are harmful To Property and Crop individually.

ggplot(combined_dmg_data,aes(x=EVTYPE,y=AMOUNT,fill=DAMAGETYPE))+
        geom_bar(stat="identity",position = "dodge")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1,colour = "black",size=15,face="bold"))+
    labs(list(title="Top Events That Have Incurred Maximum Damage to Both Property And Crop",x="Events",y="Total Damage"))

Results

Some Inferences From the 1st Graph

Top Events Causing Maximum Fatalities

Top Events that have caused maximum fatalities in descending order are

Excessive Heat > Tornado > Flash Flood > Thunderstorm Wind > Lightning

Top Events Causing Maximum Injuries

Top Events that have caused maximum injuries in descending order are

Tornado > Thunderstorm Wind > Flood > Excessive Heat > Lightning

Some Other Inferences

Excessive heat is the deadliest event causing maximum fatalities.
Tornado ranks next contributing maximum in Injuries and then second maximum in Fatalities.
Wildfire, Hurricane (Typhoon) ranks lowest in top 10 events and are least concerning.
High Wind and Heat ranks down in the bottom of top 10 events that causes both but least fatalities and injuries.
Flash Flood and Thunderstorm Wind amongst other maximum contributors.

Some Inferences From the 2nd Graph

Top Events Incurring Maximum Property Damage

Top Events that have incurred maximum property damage in descending order are

Flood > Thunderstorm Wind > Tornado > Hurricane (Typhoon) > Flash Flood

Top Events Incurring Maximum Crop Damage

Top Events that have incurred maximum crop damage in descending order are

Drought > Flood > Hurricane (Typhoon) > Hail > Thunderstorm Wind

Some other inferences

Flood deadliest with top contributor in the Property Damage and Second in Crop Damage.
Drought is contributing in maximum crop damage and very little in property damage.
Tornado though causing only property damage is one of the maxium contributor to it.
Least concerning events are High Wind and Heavy Rain.

But for best results it would be better if we combine both the data i.e. Health and Property Damage together and see the top event types that have caused maximum effect across these two sectors.

##Setting the common column names
colnames(combined_health_data)<-c("EVTYPE","TYPE","RELATIVEVALUE")
colnames(combined_dmg_data)<-c("EVTYPE","TYPE","RELATIVEVALUE")

combined_top10_data<-rbind(combined_health_data,combined_dmg_data)

After we have combined the top10 data from all the sources. Let’s plot them together

gplot<-ggplot(combined_top10_data,aes(x=EVTYPE,y=RELATIVEVALUE,fill=TYPE))
gplot<-gplot+geom_bar(stat="identity")
gplot<-gplot+theme(axis.text.x = element_text(angle = 90, hjust = 1,colour = "black",size = 15,face="bold"))
gplot<-gplot+labs(list(title="Top Events That Have Caused Maximum Effect In Both Population Health And Economy"))
gplot

Inferences From The Combined Graph

From the last combined graph we can infer the effect of top events across population health and economy. As inferring from the graph.

Tornado is the deadliest event causing maximum destruction in Fatalities, Injuries and Property damage.
Flood lesser than Tornado but has effect in all the four sectors with more Property Damage and Crop Damage as compared to Fatalities and Injuries.
Avlanche, RIP Current, Heavy Rain, Frost/Freeze and Extreme Cold sits in the bottom of top 10 events across all sectors.
Drought Causes Maximum Crop Damage than any other event.
Excessive Heat Caused Maximum Fatalities than any other event.

Rest of the Top Events and their effect across all sectors can also be clearly understood from the graph.

Storm Data - Analysis On Public Health And Economy

Dhawal Kapil

February 14, 2016