Data was read in from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine which events are the most harmful to population health and the economy. For health effects the sum of injuries and fatalities was totaled by event type which showed tornados to cause the most harm. For economic effects the property and crop damage was added together to determine that the most damage in that regard has been caused by floods.
Checks to see if requried packages used below are installed, and if not, installs them.
if(all(c("ggplot2","plyr") %in% rownames(installed.packages())) == FALSE)
install.packages("ggplot2","plyr")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.2
Downloads the file if it is not present in the current directory then loads the data in.
if(!(file.exists(paste(getwd(),"/repdata-data-StormData.csv.bz2",sep=""))))
{
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
paste(getwd(),"/repdata-data-StormData.csv.bz2",sep=""))
}
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
The data is then subsetted for health related events and the injuries and fatalities are added together:
healthdata<-data[,c("EVTYPE","FATALITIES","INJURIES")]
healthdata <- ddply(healthdata, .(EVTYPE),mutate,fatinj = FATALITIES + INJURIES)
healthdata <- ddply(healthdata,.(EVTYPE),summarize,sum.fatinj=sum(fatinj))
The eco data requried more work due to having various exponents which multiplied with the damage number to determine the total damage. Taking a look at the exponent variable for both crops and property shows some blank and inconsistent labels:
summary(data$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
summary(data$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
According to the Storm Data Documentation “Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions." For the remaining labels I changed them to their numeric values so I can multiply them later.
ecodata <- data[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="k"] <-"1000"
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="K"] <-"1000"
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="m"] <-"1000000"
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="M"] <-"1000000"
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="B"] <-"1000000000"
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="2"] <-"100"
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)=="?"] <-"1"
The blank exponents have very few non-zero values so I changed them to 1.
data[(data$CROPDMGEXP==""&data$CROPDMG > 0),c("EVTYPE","CROPDMGEXP","CROPDMG")]
## EVTYPE CROPDMGEXP CROPDMG
## 221857 HAIL 3
## 238757 THUNDERSTORM WINDS 4
## 240397 THUNDERSTORM WINDS 4
levels(ecodata$CROPDMGEXP)[levels(ecodata$CROPDMGEXP)==""] <-"1"
The property labels have more variance to them. Looking at the blank exponents reveals a relatively small number of blank observations.
count(data[(data$PROPDMGEXP==""&data$PROPDMG > 0),c("EVTYPE","PROPDMGEXP")])
## EVTYPE PROPDMGEXP freq
## 1 FLASH FLOOD 20
## 2 FLASH FLOOD WINDS 1
## 3 FLASH FLOOD/FLOOD 1
## 4 FLASH FLOODING 1
## 5 FLOOD 2
## 6 HAIL 7
## 7 HEAVY SNOW SQUALLS 1
## 8 HIGH WINDS 1
## 9 LIGHTNING 1
## 10 THUNDERSTORM WIND 1
## 11 THUNDERSTORM WINDS 35
## 12 THUNDERSTORM WINDS HAIL 4
## 13 TORNADO 1
For the remaining labels I assumed the numbers were the exponent and added the appropriate power of 10. For all other non-standard labels I used a power of 1.
levels(ecodata$PROPDMGEXP) <- c('1','1','1','1','1','10','100','1000','10000','100000','1000000','10000000','100000000','1000000000','1','1','1000','1000000','1000000')
Finally the factors were converted into numbers and multiplied by the damage amount to determine a total amount of damage by each event type.
ecodata$PROPDMGEXP <- as.numeric(as.character(ecodata$PROPDMGEXP))
ecodata$CROPDMGEXP <- as.numeric(as.character(ecodata$CROPDMGEXP))
ecodata <- ddply(ecodata, .(EVTYPE),mutate,Tdmg = (CROPDMGEXP * CROPDMG)+(PROPDMGEXP*PROPDMG))
ecodata <- ddply(ecodata,.(EVTYPE),summarize,Tdmg=sum(Tdmg))
The cleaned up data now has the total impact by event type over the entire dataset but is out of order. Ordering the results shows:
healthdataresult <- healthdata[order(healthdata$sum.fatinj,decreasing = TRUE),]
knitr::kable(healthdataresult[1:10,])
| EVTYPE | sum.fatinj | |
|---|---|---|
| 834 | TORNADO | 96979 |
| 130 | EXCESSIVE HEAT | 8428 |
| 856 | TSTM WIND | 7461 |
| 170 | FLOOD | 7259 |
| 464 | LIGHTNING | 6046 |
| 275 | HEAT | 3037 |
| 153 | FLASH FLOOD | 2755 |
| 427 | ICE STORM | 2064 |
| 760 | THUNDERSTORM WIND | 1621 |
| 972 | WINTER STORM | 1527 |
ggplot(healthdataresult[1:5,],aes(x=factor(EVTYPE),y=sum.fatinj))+
geom_bar(stat="identity")+xlab("Event Type")+ylab("Sum of Injuries and Fatalities")+
ggtitle("Top 5 Injuries Plus Fatalities by Storm Event Type")
Tornados have clearly caused the most damage overall by combined injuries and fatalities. The graph makes it clear that they caused more damage than the next four events combined .
Next the economic data is ordered to show:
ecodataresult <- ecodata[order(ecodata$Tdmg,decreasing = TRUE),]
knitr::kable(ecodataresult[1:10,])
| EVTYPE | Tdmg | |
|---|---|---|
| 170 | FLOOD | 150319678257 |
| 411 | HURRICANE/TYPHOON | 71913712800 |
| 834 | TORNADO | 57362333787 |
| 670 | STORM SURGE | 43323541000 |
| 244 | HAIL | 18761221471 |
| 153 | FLASH FLOOD | 18243991079 |
| 95 | DROUGHT | 15018672000 |
| 402 | HURRICANE | 14610229010 |
| 590 | RIVER FLOOD | 10148404500 |
| 427 | ICE STORM | 8967041360 |
ggplot(ecodataresult[1:5,],aes(x=factor(EVTYPE),y=Tdmg))+
geom_bar(stat="identity")+xlab("Event Type")+ylab("Sum of Economic Damage ($)")+
ggtitle("Top 5 Economic Costs by Storm Event Type")
Flooding has caused the most economic damage overall although the runners up in this case are closer than with the health data. Tornados are the 3rd most economically damaging event as well.
Additional analysis might include determining the average impact per event rather than the total as well as to break down the results by region/year to determine trends for various regions.