Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The puropose of this analysis is to:
Determine the events that are most hamrful with respect to to population health
Determine the events that have the greatest economic consequences
The purpose of this section is to gather the raw data from the source, and transform it so that the purpose of the analysis can be achieved.
First, the data has to be downloaded, extracted, and read into R.
url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,"Data.bz2",method = "curl")
raw.data=read.csv(bzfile("Data.bz2"),stringsAsFactors = F)
head(raw.data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
As we can see, only some of the columns from the raw.data database are needed. This is filtered here.
harmful.data=data.frame(raw.data$EVTYPE,raw.data$FATALITIES,raw.data$INJURIES,
raw.data$PROPDMG,raw.data$PROPDMGEXP,raw.data$CROPDMG,raw.data$CROPDMGEXP, stringsAsFactors = F)
names(harmful.data)=c("Event","Fatalities","Injuries","PropDmg","PropDmgExp","CropDmg","CropDmgExp")
head(harmful.data)
## Event Fatalities Injuries PropDmg PropDmgExp CropDmg CropDmgExp
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Also, it is quite important to notice that the property damage fields are composed of a X.XX number format field, and an exponential field which has to be tranformed.
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="0"]=0
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="+"]=0
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="-"]=0
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="?"]=0
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="1"]=10
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="2"]=100
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="3"]=1000
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="4"]=1e+04
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="5"]=1e+05
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="6"]=1e+06
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="7"]=1e+07
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="8"]=1e+08
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="B"]=1e+09
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="H"]=100
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="h"]=100
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="K"]=1000
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="M"]=1e+06
harmful.data$PropDmgExp[harmful.data$PropDmgExp=="m"]=1e+06
harmful.data$PropDmgExp=as.numeric(harmful.data$PropDmgExp)
harmful.data$Property=harmful.data$PropDmg*harmful.data$PropDmgExp
The same process has to be done with the crop damage information.
harmful.data$CropDmgExp[harmful.data$CropDmgExp==""]=0
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="0"]=0
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="?"]=0
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="2"]=100
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="k"]=1000
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="K"]=1000
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="m"]=1e+06
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="M"]=1e+06
harmful.data$CropDmgExp[harmful.data$CropDmgExp=="B"]=1e+06
harmful.data$CropDmgExp=as.numeric(harmful.data$CropDmgExp)
harmful.data$Crop=harmful.data$CropDmg*harmful.data$CropDmgExp
Since there is a huge number of events in the database, it is useful to only work with the top 10 events. This code will filter the top 10 events that harmed the most population.
harmful.data$Event=as.factor(harmful.data$Event)
harmful.data$TotPop=harmful.data$Fatalities+harmful.data$Injuries
top.population=aggregate(TotPop~Event,harmful.data,"sum")
top.population=top.population[order(top.population$TotPop, decreasing = T), ]
top.population=top.population[1:10,1]
Once we have the top 10 events and the necessary information the database has to be aggregated so that each row belongs to a particular event. This new database will still separate the fatalities and the injuries per event.
fatalities=aggregate(Fatalities~Event,harmful.data,FUN = "sum")
names(fatalities)=c("Event","Amount")
fatalities$Type="Fatality"
fatalities=fatalities[fatalities$Event%in%top.population,]
injuries=aggregate(Injuries~Event,harmful.data,FUN = "sum")
names(injuries)=c("Event","Amount")
injuries$Type="Injury"
injuries=injuries[injuries$Event%in%top.population,]
population=rbind(injuries,fatalities)
Since there is a huge number of events in the database, it is useful to only work with the top 10 events. This code will filter the top 10 events that had the greatest property damage.
harmful.data$TotProp=harmful.data$Property+harmful.data$Crop
top.property=aggregate(TotProp~Event,harmful.data,"sum")
top.property=top.property[order(top.property$TotProp, decreasing = T), ]
top.property=top.property[1:10,1]
Once we have the top 10 events and the necessary information the database has to be aggregated so that each row belongs to a particular event. This new database will still separate the crop and the property damage per event.
prop=aggregate(Property~Event,harmful.data,FUN = "sum")
names(prop)=c("Event","Amount")
prop$Type="Property"
prop=prop[prop$Event%in%top.property,] ###
prop=prop[complete.cases(prop),]
crop=aggregate(Crop~Event,harmful.data,FUN = "sum")
names(crop)=c("Event","Amount")
crop$Type="Crop"
crop=crop[crop$Event%in%top.property,]
crop=crop[complete.cases(crop),]
property=rbind(prop,crop)
Two graphs will be created. The first for finding the event which has harmed the most population, and the second for obtaining the event that had the greatest property damage
A barplot with the top 10 events will be created, taking into account the separation from injuries and fatalities.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
ggplot(population, aes(x=Event, y=Amount, fill=Type))+
geom_bar(stat = "identity")+coord_flip()+labs(y="Affected people",
title="Top 10 Harmful Events to Population")+
scale_fill_manual(values = c("green","blue"))
A barplot with the top 10 events will be created, taking into account the separation from crop and property damage.
ggplot(property, aes(x=Event, y=Amount, fill=Type))+
geom_bar(stat = "identity")+coord_flip()+labs(y="Damage Cost",
title="Top 10 Events by Property Damage Cost")+
scale_fill_manual(values = c("green","blue"))
With this graphs it is possible to conclude that
Tornados are the most harmful events both regarding fatalities and injuries.
Floods are the events that have the gratest damage cost both in property and crop damage.