This report contains the analysis of Storm Data and its impact on population health and economy. Storm Database contains detailed information about Storms and other similar events that have occurred in United States and have affected public health and had great consequences on the economy.
This analysis contains code distributed across R chunks for downloading and reading the data, cleaning the data, exlporing the data by some basic heuristics and by creating some plots we will finally be inferring about which type of events are most harmful with respect to population health and economy.
The analysis was done by analysing first the population health and then economic data.
In each analysis data was grouped by Event Types (EVTYPE) column and aggregated against different type of parameters available to measure a particular effect. Like Fatalities and Injuries for population health and Crop and Property damage for economy.
Data in different column were also normalized to fit all the data in a range of 0-1 that gives us better understanding both in reading and viewing it as a plot.
Note:
The Storm Data has been downloaded from this location whereas documentation and FAQs were used for support.
This section covers the entire analysis part. This has sub-sections that covers data processing, cleaning, analysis and plotting part.
require(dplyr)
require(knitr)
require(reshape2)
require(ggplot2)
require(grid)
require(gridExtra)
require(lubridate)
## Downloading and Extracting file. This checks if file is already present in the directory.
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";
fileName<-"repdata-data-StormData.csv.bz2";
## Checking If Storm Data is present in the current workspace
if(!(file.exists(fileName))) {
print("Required Storm Data Not Present..Downloading..")
download.file(url = fileUrl,destfile = fileName)
} else {
print("Storm Data Present")
}
## [1] "Storm Data Present"
stormdata<-read.csv(fileName)
There are two things that needs to be fixed in this data.
This is needed because prior to 1996 only data for Tornadoes (references here), Thunderstorm Winds and Hail were recorded. If we will not filter out these years this will give undue bias to these events as compared to other events.
stormdata$BGN_DATE<-as.Date(stormdata$BGN_DATE,"%m/%d/%Y")
stormdata<-stormdata[year(stormdata$BGN_DATE)>=1996,]
The next cleaning the we can see in this data is regarding grouping of event types that are same but are recorded differently.
Some Example includes
| ACTUAL EVENT | DIFFERENT SYNONYMS PRESENT IN DATA |
|---|---|
| WIND | WIND,HIGH WIND, HIGH WIND/BLIZZARD |
| TORNADO | TORNADO F0, TORNADOS, WATERSPROUT/TORNADO |
These event types which are actually same but are document different with little variations should be collapsed as one. The known event types in which they should be collapsed can be found in Page 6 of faq page
This cleaning is important because we may group the events which are exactly same but are recorded differently into one and can avoid redundant events to show up when we are analysing top events.
Grouping of these similar event types can be achieved by replacing all similar variations by their actual Event Name.
Finding similar events are done with the help of regex search.
##First changing the case of all event types to uppercase
stormdata$EVTYPE<-toupper(stormdata$EVTYPE)
##For Tornado Only One Event Type 'TORNADO' should be there
stormdata$EVTYPE[grep(pattern = "TORNADO",x=stormdata$EVTYPE)]<-"TORNADO"
##For Wind there are six unique type of wind events
##High Wind, Marine High Wind, Marine Strong Wind, Marine Thunderstorm Wind, Strong Wind, Thunderstorm Wind
##Grouping Events For Thunderstorm Wind First
stormdata$EVTYPE[grep('(((?=.*TSTM)|(?=.*T(.*)ORM)|(?=.*T(.*)ROM)))^(?!.*MARINE)(?!.*NON)',stormdata$EVTYPE,perl = T)]<-"THUNDERSTORM WIND"
stormdata$EVTYPE[grep('(?=.*WIND)(?=.*HIGH)(?!.*MARINE)',stormdata$EVTYPE,perl = T)]<-"HIGH WIND"
stormdata$EVTYPE[grep('((?=.*TSTM)|(?=.*THUNDERSTORM))(?=.*MARINE)',stormdata$EVTYPE,perl = T)]<-"MARINE THUNDERSTORM WIND"
stormdata$EVTYPE[grep('(?=.*STRONG)(?=.*WIND)(?!.*MARINE)',stormdata$EVTYPE,perl = T)]<-"STRONG WIND"
##Grouping Events For Hail Type
stormdata$EVTYPE[grep('(?=.*HAIL)^(?!.*MARINE)',stormdata$EVTYPE,perl = T)]<-"HAIL"
##Grouping Events For cold/Wind Chill and Extreme Cold/Wind Chill
stormdata$EVTYPE[grep('(?=.*WIND)(?=.*CHILL)^(?!.*EXTREME)',stormdata$EVTYPE,perl = T)]<-"COLD/WIND CHILL"
stormdata$EVTYPE[grep('(?=.*WIND)(?=.*CHILL)(?=.*EXTREME)',stormdata$EVTYPE,perl = T)]<-"EXTREME COLD/WIND CHILL"
##Grouping Events For Heavy Snow
stormdata$EVTYPE[grep('(?=.*SNOW)(?=.*HEAVY)^(?!.*RAIN)',stormdata$EVTYPE,perl = T)]<-"HEAVY SNOW"
##Grouping Events For Rain
stormdata$EVTYPE[grep('(^(HEAVY RAIN))|((^(?!.*/)*(?=.*RAIN))^(?!.*FLOOD)(?!.*SLEET)(?!.*LIGHT))',stormdata$EVTYPE,perl = T)]<-"HEAVY RAIN"
##Grouping Events For Hurricane
stormdata$EVTYPE[grep('(?=.*HURRI)|(?=.*TYPHOON)',stormdata$EVTYPE,perl = T)]<-"HURRICANE (TYPHOON)"
stormdata$EVTYPE[grep('(?=.*FIRE)',stormdata$EVTYPE,perl = T)]<-"WILDFIRE"
This section analyse the effect of different Event Types on Population health.
To Analyse the effect of different type of Storm Events on Public Health we can evaluate the fatalities and injuries caused by different events. For this we will focus on only the EVTTYPE, FATALITIES and INJURIES columns of the data.
Extract those.
health_data<-stormdata[,c("EVTYPE","FATALITIES","INJURIES")]
grouped_by_eventtype<-group_by(health_data,EVTYPE)
Once we have health data first we evaluate the effect of different event types on fatalities by group this by “EVTYPE” and summarise against the total FATALITIES.
summarised_health_data_fats<-as.data.frame(summarise_each(grouped_by_eventtype,funs(sum),-INJURIES))
Here are top 10 events that have caused maximum fatalities.
sorted_health_data_fats<-arrange(summarised_health_data_fats,desc(FATALITIES))
kable(head(sorted_health_data_fats,10),format = "markdown")
| EVTYPE | FATALITIES |
|---|---|
| EXCESSIVE HEAT | 1797 |
| TORNADO | 1511 |
| FLASH FLOOD | 887 |
| THUNDERSTORM WIND | 737 |
| LIGHTNING | 651 |
| FLOOD | 414 |
| RIP CURRENT | 340 |
| HEAT | 237 |
| HIGH WIND | 235 |
| AVALANCHE | 223 |
Similarly we evaluate the effect of different event types on Injuries.
summarised_health_data_injs<-as.data.frame(summarise_each(grouped_by_eventtype,funs(sum),-FATALITIES))
Here are top 10 events that have caused maximum injuries.
sorted_health_data_injs<-arrange(summarised_health_data_injs,desc(INJURIES))
kable(head(sorted_health_data_injs,10))
| EVTYPE | INJURIES |
|---|---|
| TORNADO | 20667 |
| THUNDERSTORM WIND | 7497 |
| FLOOD | 6758 |
| EXCESSIVE HEAT | 6391 |
| LIGHTNING | 4141 |
| FLASH FLOOD | 1674 |
| WILDFIRE | 1458 |
| HURRICANE (TYPHOON) | 1328 |
| HEAT | 1222 |
| HIGH WIND | 1083 |
To get a combined view of different Event Types on Public Population we can analyse its effect using features Fatalities and Injuries.
Prior to that we will have to normalize our data because the two columns have data in different ranges. Hence the final plot may not be as clear and descriptive as we want. For Normalization we will use the standard min/max normalization to normalize the values within range of 0 to 1.
##This is a simple function that normalizes all rows of data-frame within value from 0 to 1 except if row name is "EVTYPE"
normalize_col<-function(col_name,df)
{
if(! col_name %in% "EVTYPE")
{
(df[col_name]-min(df[col_name]))/(max(df[col_name])-min(df[col_name]))
}
else
{
df[col_name]
}
}
##Normalizing the data to arrive at better relative values and to get a more clear plot.
listoutput<-lapply(names(sorted_health_data_fats),function(x){normalize_col(x,sorted_health_data_fats)})
norm_fats_health_data<-do.call(cbind,listoutput)
listoutput<-lapply(names(sorted_health_data_injs),function(x){normalize_col(x,sorted_health_data_injs)})
norm_injs_health_data<-do.call(cbind,listoutput)
##Converting Both the data to long format so that they can be Row binded and we can arrive at common columns
top10_fats_health_data_long<-melt(head(norm_fats_health_data,10),id=c("EVTYPE"),variable.name="FEATURE",value.name = "COUNT")
top10_injs_health_data_long<-melt(head(norm_injs_health_data,10),id=c("EVTYPE"),variable.name="FEATURE",value.name = "COUNT")
combined_health_data<-rbind(top10_fats_health_data_long,top10_injs_health_data_long)
For better understanding lets plot the data using ggplot2 graphic system.
## Creating plot for top 10 event types when combined all parameters
ggplot(combined_health_data,aes(x=EVTYPE,y=COUNT,fill=FEATURE))+
geom_bar(stat="identity",position = "dodge")+
theme(axis.text.x = element_text(angle = 90, hjust = 1,colour = "black",size=15,face="bold"))+
labs(list(title="Top Event Types That Have Caused Maximum Fatalities And Injuries",x="Events",y="Total Count"))
This section analyse the effect of different Event Types on Economy.
Now effect on economy as per the data present in our data base can be evaluated on two parameters Property Damage and Crop Damage.
For this we will focus on PROPDMG,PROPDMGEXP for Property Damage and CROPDMG,CROPDMGEXP for Crop Damage.
For analysing effect on Property Damage extract the required data.
prop_dmg_data<-stormdata[,c("EVTYPE","PROPDMG","PROPDMGEXP")]
Here column PROPDMGEXP contains the single letter identifier for the exponentiation as Hundred (H), Thousand (K), Million (M), Billion (B) To calculate total damage we will have to multiply actual “PROPDMG” and its exponent present in “PROPDMGEX” to get column “PROPCASH”
##Creating Multiplier
mulitplier_replace<-c("H"=100,"K"=1000,"M"=10000000,"B"=1000000000)
##Converting Exponent column to character and upper case as for some of the data we have exponents present as lower case.
prop_dmg_data$PROPDMGEXP<-toupper(as.character(prop_dmg_data$PROPDMGEXP))
##Subsetting only those data for which we have multiplier available
prop_dmg_data<-prop_dmg_data[prop_dmg_data$PROPDMGEXP %in% c("K","H","M","B"),]
prop_dmg_data$PROPCASH<-prop_dmg_data$PROPDMG*mulitplier_replace[prop_dmg_data$PROPDMGEXP]
Grouping and Summarising the data
prop_dmg_data_grouped_by<-group_by(prop_dmg_data,EVTYPE)
summarised_prop_dmg_data<-as.data.frame(summarise_each(prop_dmg_data_grouped_by,funs(sum),-PROPDMG,-PROPDMGEXP))
Here are top 10 events that have caused incurred maximum Property Loss.
sorted_summ_prop_dmg_data<-arrange(summarised_prop_dmg_data,desc(PROPCASH))
kable(head(sorted_summ_prop_dmg_data,10),format = "markdown")
| EVTYPE | PROPCASH |
|---|---|
| FLOOD | 329710773550 |
| THUNDERSTORM WIND | 198262081490 |
| TORNADO | 187941925710 |
| HURRICANE (TYPHOON) | 176284049010 |
| FLASH FLOOD | 132110873910 |
| HAIL | 124683773420 |
| WILDFIRE | 53689609500 |
| HIGH WIND | 37975978360 |
| DROUGHT | 10433551000 |
| HEAVY RAIN | 5566394440 |
For analysing effect on Crop Damage extract the required data.
crop_dmg_data<-stormdata[,c("EVTYPE","CROPDMG","CROPDMGEXP")]
Here column CROPDMGEXP contains the single letter identifier for the exponentiation as Hundred (H), Thousand (K), Million (M), Billion (B) To calculate total damage we will have to multiply actual “CROPDMG” and its exponent present in “CROPDMGEX” to get column “CROPCASH”
##Converting Exponent column to character and upper case as for some of the data we have exponents present as lower case.
crop_dmg_data$CROPDMGEXP<-toupper(as.character(crop_dmg_data$CROPDMGEXP))
##Subsetting only those data for which we have multiplier available
crop_dmg_data<-crop_dmg_data[crop_dmg_data$CROPDMGEXP %in% c("K","H","M","B"),]
crop_dmg_data$CROPCASH<-crop_dmg_data$CROPDMG*mulitplier_replace[crop_dmg_data$CROPDMGEXP]
Grouping and Summarising the data
crop_dmg_data_grouped_by<-group_by(crop_dmg_data,EVTYPE)
summarised_crop_dmg_data<-as.data.frame(summarise_each(crop_dmg_data_grouped_by,funs(sum),-CROPDMG,-CROPDMGEXP))
Here are top 10 events that have caused incurred maximum Crop Loss.
sorted_summ_crop_dmg_data<-arrange(summarised_crop_dmg_data,desc(CROPCASH))
kable(head(sorted_summ_crop_dmg_data,10),format = "markdown")
| EVTYPE | CROPCASH |
|---|---|
| DROUGHT | 124487146000 |
| FLOOD | 48424798400 |
| HURRICANE (TYPHOON) | 39851427800 |
| HAIL | 20485572450 |
| THUNDERSTORM WIND | 15565602600 |
| EXTREME COLD | 13050733000 |
| FLASH FLOOD | 11909991700 |
| FROST/FREEZE | 10887446000 |
| HEAVY RAIN | 7304609800 |
| HIGH WIND | 6185751300 |
Combing the data to get a composite view
We are combining the top10 events of both damages type and are showing them together.
This will also show us that there are some events that have caused only one type of damage alone but to a significant extreme.
##Normalizing the data to arrive at better relative values and to get a more clear plot.
listoutput<-lapply(names(sorted_summ_prop_dmg_data),function(x){normalize_col(x,sorted_summ_prop_dmg_data)})
norm_prop_dmg_data<-do.call(cbind,listoutput)
listoutput<-lapply(names(sorted_summ_crop_dmg_data),function(x){normalize_col(x,sorted_summ_crop_dmg_data)})
norm_crop_dmg_data<-do.call(cbind,listoutput)
##Converting Both the data to long format so that they can be Row binded and we can arrive at common columns
top10_prop_dmg_data_long<-melt(head(norm_prop_dmg_data,10),id=c("EVTYPE"),variable.name="DAMAGETYPE",value.name = "AMOUNT")
top10_crop_dmg_data_long<-melt(head(norm_crop_dmg_data,10),id=c("EVTYPE"),variable.name="DAMAGETYPE",value.name = "AMOUNT")
combined_dmg_data<-rbind(top10_crop_dmg_data_long,top10_prop_dmg_data_long)
Plotting the combined data to get a better understanding of top 10 Events that are harmful To Property and Crop individually.
ggplot(combined_dmg_data,aes(x=EVTYPE,y=AMOUNT,fill=DAMAGETYPE))+
geom_bar(stat="identity",position = "dodge")+
theme(axis.text.x = element_text(angle = 90, hjust = 1,colour = "black",size=15,face="bold"))+
labs(list(title="Top Events That Have Incurred Maximum Damage to Both Property And Crop",x="Events",y="Total Damage"))
Top Events that have caused maximum fatalities in descending order are
Excessive Heat > Tornado > Flash Flood > Thunderstorm Wind > Lightning
Top Events that have caused maximum injuries in descending order are
Tornado > Thunderstorm Wind > Flood > Excessive Heat > Lightning
Top Events that have incurred maximum property damage in descending order are
Flood > Thunderstorm Wind > Tornado > Hurricane (Typhoon) > Flash Flood
Top Events that have incurred maximum crop damage in descending order are
Drought > Flood > Hurricane (Typhoon) > Hail > Thunderstorm Wind
But for best results it would be better if we combine both the data i.e. Health and Property Damage together and see the top event types that have caused maximum effect across these two sectors.
##Setting the common column names
colnames(combined_health_data)<-c("EVTYPE","TYPE","RELATIVEVALUE")
colnames(combined_dmg_data)<-c("EVTYPE","TYPE","RELATIVEVALUE")
combined_top10_data<-rbind(combined_health_data,combined_dmg_data)
After we have combined the top10 data from all the sources. Let’s plot them together
gplot<-ggplot(combined_top10_data,aes(x=EVTYPE,y=RELATIVEVALUE,fill=TYPE))
gplot<-gplot+geom_bar(stat="identity")
gplot<-gplot+theme(axis.text.x = element_text(angle = 90, hjust = 1,colour = "black",size = 15,face="bold"))
gplot<-gplot+labs(list(title="Top Events That Have Caused Maximum Effect In Both Population Health And Economy"))
gplot
From the last combined graph we can infer the effect of top events across population health and economy. As inferring from the graph.
Rest of the Top Events and their effect across all sectors can also be clearly understood from the graph.