This data analysis trys to answer the following questions with the help of U.S. NOAA Storm database.
Across the United States, which types of weather events are most harmful with respect to population health?
Across the United States, which types of weather events have the greatest economic consequences?
Data is provided by U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Data set used in this analysis was downloaded from here Documentation for this data set is available here
In this section, we download the data set which is comma seperated value (csv) text file, encrypted with bz2 algorithm. We can read this data using read.csv method in R.
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2";
download.file(url,"storm_data.csv.bz2");
raw_data<-read.csv("storm_data.csv.bz2");
Let’s take a glance at the loaded data set named raw_data.
print(dim(raw_data));
## [1] 902297 37
print(names(raw_data));
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
There are 37 columns in the data set. To answer our questions we may not need all these columns. For each question, we have to consider different set of variables.
To answer the first question we have to consider “FATALITIES” & “INJURIES” to weight the harmful effects of each “EVTYPE”(Tornado, Flood etc) towards population health.
## Create new data frame filtering unwanted variables
first<-data.frame(raw_data$EVTYPE,raw_data$FATALITIES,raw_data$INJURIES);
names(first)<-c("EventType","Fatalities","Injuries");
## Aggregating total fatalities by Event type
fatalities<-aggregate(first$Fatalities,by=list(first$Event),FUN=sum);
names(fatalities)<-c("Event","Total_Fatalities");
## Order the fatalities in descending order
fatalities<-fatalities[order(fatalities$Total_Fatalities,decreasing=T),];
## Aggregating total injuries by Event type
injuries<-aggregate(first$Injuries,by=list(first$Event),FUN=sum);
names(injuries)<-c("Event","Total_Injuries");
## Order the injuries in descending order
injuries<-injuries[order(injuries$Total_Injuries,decreasing=T),];
To answer the second question, we consider “EVTYPE”, “PROPDMG”,“PROPDMGEXP”,“CROPDMG” and “CROPDMGEXP”, because they represent the Economic impact.
second<-data.frame(raw_data$EVTYPE,raw_data$PROPDMG,raw_data$PROPDMGEXP,raw_data$CROPDMG,raw_data$CROPDMGEXP);
names(second)<-c("Event","PropertyDmg","PropertyDmgExp","CropDmg","CropDmgExp");
“PropertyDmgExp” and “CropDmgExp” represents the exponents of “PropertyDmg” and “CropDmg” respectively. Let’s look at these exponent values
print(levels(second$PropertyDmgExp));
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
print(levels(second$CropDmgExp));
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
According to the documentation, provided by U.S.N.O.A.A. Each non-numeric level should be interpreted as below
| Symbol | Interpretation |
|---|---|
| “” “?” “-” “+” | 0 |
| “h” “H” | 2 |
| “k” “K” | 3 |
| “m” “M” | 6 |
| “b” “B” | 9 |
Let’s convert these non-numerical levels. First update the ‘PropertyDmgExp’ column.
## Assign zero for miscellaneous values
levels(second$PropertyDmgExp)[levels(second$PropertyDmgExp)=="" | levels(second$PropertyDmgExp)=="-" | levels(second$PropertyDmgExp)=="?" | levels(second$PropertyDmgExp)=="+"]<-"0";
## Assign 9 for Billion
levels(second$PropertyDmgExp)[levels(second$PropertyDmgExp)=="B"]<-"9";
## Assign 6 for Million
levels(second$PropertyDmgExp)[levels(second$PropertyDmgExp)=="m" | levels(second$PropertyDmgExp)=="M"]<-"6";
## Assign 3 for thousand's
levels(second$PropertyDmgExp)[levels(second$PropertyDmgExp)=="K"]<-"3";
## Assign 2 for hundred's
levels(second$PropertyDmgExp)[levels(second$PropertyDmgExp)=="h" | levels(second$PropertyDmgExp)=="H"]<-"2";
Now, update ‘CropDmgExp’ column.
## Assign zero for miscellaneous values
levels(second$CropDmgExp)[levels(second$CropDmgExp)=="?" | levels(second$CropDmgExp)==""]<-"0";
## Assign 3 for thousand's
levels(second$CropDmgExp)[levels(second$CropDmgExp)=="k" | levels(second$CropDmgExp)=="K"]<-"3";
## Assign 6 for Million
levels(second$CropDmgExp)[levels(second$CropDmgExp)=="m" | levels(second$CropDmgExp)=="M"]<-"6";
## Assign 9 for Billion
levels(second$CropDmgExp)[levels(second$CropDmgExp)=="B"]<-"9";
Now, in order to get the actual damage of ‘Property’ and ‘Crop’, we exponentiate the ‘PropertyDmg’, ‘CropDmg’ with ‘PropertyDmgExp’, ‘CropDmgExp’ respectively.
## Handling Property data
prop<-data.frame(second$Event,second$PropertyDmg,second$PropertyDmgExp);
names(prop)<-c("Event","PropertyDmg","PropertyDmgExp");
prop$total<-prop$PropertyDmg*10**as.numeric(as.character(prop$PropertyDmgExp));
prop<-prop[order(prop$total,decreasing=T),];
## Handling Crop data
crop<-data.frame(second$Event,second$CropDmg,second$CropDmgExp);
names(crop)<-c("Event","CropDmg","CropDmgExp");
crop$total<-crop$CropDmg*10**as.numeric(as.character(crop$CropDmgExp));
crop<-crop[order(crop$total,decreasing=T),];
## Aggregate crop and property data
crop_agg<-aggregate(crop$total,by=list(crop$Event),FUN=sum);
prop_agg<-aggregate(prop$total,by=list(prop$Event),FUN=sum);
names(crop_agg)<-c("Event","Total_Crop");
names(prop_agg)<-c("Event","Total_Prop");
## Re-order in decreasing order of total damage
crop_agg<-crop_agg[order(crop_agg$Total_Crop,decreasing = T),];
prop_agg<-prop_agg[order(prop_agg$Total_Prop,decreasing = T),];
## Merge crop and property data by "Event"
comm<-merge(prop_agg,crop_agg,by="Event");
Data set is transformed as required to address our questions.
Let’s look at the FATALITIES and INJURIES individually.
head(cbind(fatalities,injuries));
## Event Total_Fatalities Event Total_Injuries
## 834 TORNADO 5633 TORNADO 91346
## 130 EXCESSIVE HEAT 1903 TSTM WIND 6957
## 153 FLASH FLOOD 978 FLOOD 6789
## 275 HEAT 937 EXCESSIVE HEAT 6525
## 464 LIGHTNING 816 LIGHTNING 5230
## 856 TSTM WIND 504 HEAT 2100
Above table shows that TORNADO is the top most event which has both highest number of fatalities and injuries. However, fatalities and injuries together contribute to population health. So, let’s merge and re-order them.
## Load 'ggplot' for plotting
library(ggplot2);
## Merge fatalities and injuries
first<-merge(injuries,fatalities);
## Add up total fatalities and injuries in to 'Total_Health_Damage'
first$Total_Health_Damage<-first$Total_Fatalities+first$Total_Injuries;
## Order the data set in the descending order of 'Total_Health_Damage'
first<-first[order(first$Total_Health_Damage,decreasing=T),];
## Plot top 5 harmful events and their impact
first_plot<-ggplot(head(first,5),aes(x=reorder(Event,Total_Health_Damage),y=Total_Health_Damage,fill=Total_Health_Damage))+geom_bar(stat="identity")+labs(title="Total Health Damage by Event",y="Total number of fatalities and injuries",x="Weather Events")+coord_flip();
print(first_plot);
Not surprisingly, Tornado is the major weather event that has huge impact on population health, resulting in 5633 fatalities and 91.346K injuries.
Now let’s individually look at the crop and property damage.
## Individually look at the crop and property damages
head(cbind(crop_agg,prop_agg));
## Event Total_Crop Event Total_Prop
## 95 DROUGHT 13972566000 FLOOD 144657709807
## 170 FLOOD 5661968450 HURRICANE/TYPHOON 69305840000
## 590 RIVER FLOOD 5029459000 TORNADO 56947380677
## 427 ICE STORM 5022113500 STORM SURGE 43323536000
## 244 HAIL 3025954473 FLASH FLOOD 16822673979
## 402 HURRICANE 2741910000 HAIL 15735267513
From the above table, Drough is the major event when we consider Crop damage. Flood is the major event when we consider property damage. However, both crop and property damage contribute towards Economical loss.
## Add up crop damage and property damage
comm$Grand_Total<-comm$Total_Prop+comm$Total_Crop;
comm<-comm[order(comm$Grand_Total,decreasing = T),];
library(ggplot2);
second_plot<-ggplot(head(comm,5),aes(x=reorder(Event,Grand_Total),y=Grand_Total,fill=Grand_Total))+geom_bar(stat="identity")+labs(title="Total Economical damage by event",x="Weather Events",y="Total crop and property damage")+coord_flip();
print(second_plot);
Flood seems to be the weather event that caused major Economical loss. However, if we consider crop loss alone towards Economical loss, then the weather event responsible will be drought.