In this report we aim to identify (1) which type of events are the most harmful with respect to population health and (2) which types of events have the greatest economic consequences, both across the United States. To investigate this, we obtained the data for the years between 1950 and 2001 from the NOAA storm Database. From these data, we found that, across the United States, Tornado are the most harmful events for population health, followed by Excessive heat events. On the other hand, Flood events are have the greates economic consequences.
From the NOAA storm database we obtained data on severe events across the U.S. during the time period 1950-2011. The data includes not only the type of event registered and its characteristics but also the social and economic impact
First of all, we set the working directory and after that we download and unzip the file. After that, we read the unzip data with fread command
setwd("C:/Users/Willem/Documents/DATA_SCIENCE/000_Coursera/Course_5/Assignment2")
if (!file.exists("Data/Stormdata.zip"))
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "Data/Stormdata.zip")
filename <- "./Data/Stormdata"
md0 <- fread(filename)
After reading the data, we check the structure and the first rows of the data. We observe that the data set contain 902297 observation of 37 variables
str(md0)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : chr "3" "2" "2" "2" ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(md0)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4: 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5: 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6: 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1: TORNADO 0 0
## 2: TORNADO 0 0
## 3: TORNADO 0 0
## 4: TORNADO 0 0
## 5: TORNADO 0 0
## 6: TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1: NA 0 14.0 100 3 0 0
## 2: NA 0 2.0 150 2 0 0
## 3: NA 0 0.1 123 2 0 0
## 4: NA 0 0.0 100 2 0 0
## 5: NA 0 0.0 150 2 0 0
## 6: NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1: 15 25.0 K 0
## 2: 0 2.5 K 0
## 3: 2 25.0 K 0
## 4: 2 2.5 K 0
## 5: 2 2.5 K 0
## 6: 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1: 3040 8812 3051 8806 1
## 2: 3042 8755 0 0 2
## 3: 3340 8742 0 0 3
## 4: 3458 8626 0 0 4
## 5: 3412 8642 0 0 5
## 6: 3450 8748 0 0 6
In order to ask the know the which type of events are the most harmful with respect to population health, we need to sum the number of fatalities (and injuries) per type of event. So we construct two new variables aggregating these numbers per event. We also rename the columns in order to have perfeclty labeled the columns
injur <- aggregate(md0$INJURIES, by=list(Category=md0$EVTYPE), FUN=sum)
colnames(injur) <- c("Event", "Injuries")
fatal <- aggregate(md0$FATALITIES, by=list(Category=md0$EVTYPE), FUN=sum)
colnames(fatal) <- c("Event", "Fatalities")
After that, we merge these two variables into a single data set and sort the resulting data set in decreasing order. In this way, the first type of events of the sorted data set will be the most harmful. Please note that here we have considered that the most harmful event is the one that cause more fatalities. Other approaches could be also possible (such as sum of fatalities and injuries, etc)
md1 <- merge(injur, fatal, by="Event")
md1 <- md1[order(-md1$Fatalities),]
The economic consequences are given by the variable “PROPDMG” that expresses the magnitude of the number (quantity) and the variable “PROPDMGEXP” that give us the units of these quantities (“B” for billions dollars, “M” for millions dollars, “K” for thousands dollars and “” for dollars). Let’s check how many different values have the variable “PROPDMGEXP”
unique(md0$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
As we can see, there are other values for this variable, apart from “”,“K”, “M” and “B”. We could consider these observations as invalid. Let’s see how many “invalid” observations are and the percentage that it represents
x <- sum(md0$PROPDMGEXP == "") + sum(md0$PROPDMGEXP == "K") + sum(md0$PROPDMGEXP == "B")+
sum(md0$PROPDMGEXP == "M")
(1-x/nrow(md0))*100
## [1] 0.03635167
Less than 0.04% of the values are invalid. This percentage is small enough so it should not be a problem. Therefore, we are going to subset a dataset with only the variables we are interested in (md2), and after that construct a new variable called “mult” to allow us to convert the character variable “PROPDMGEXP” to a numeric value. Finally, we are going to multiply the quantity “PROPDMG” variable by this new created mult variable. The resulting variable “Econom” quantifies the total cost of the observation Please, note that the observations where the “PROPDMGEXP” variable is different from “”,“K”, “M” or “B”, the variable “mult” is equal to 0, so the total cost (“Econom”) is equal to zero.
md2 <- md0[, c("EVTYPE", "PROPDMG", "PROPDMGEXP")]
md2 = as.data.table(md2)
md2$mult <- 0
md2[PROPDMGEXP=="K", mult:=1000]
md2[PROPDMGEXP=="M", mult:=1000000]
md2[PROPDMGEXP=="B", mult:=1000000000]
md2[PROPDMGEXP=="", mult:=1]
md2$Econom <- md2$PROPDMG*md2$mult
Now, we could aggregate the economic consequences per event, in the same way we have done before for the harmful events. Once aggregated, we sort the subset
cost <- aggregate(md2$Econom, by=list(Category=md0$EVTYPE), FUN=sum)
colnames(cost) <- c("Event", "Cost")
cost <- cost[order(-cost$Cost),]
Once we have done all the previous transformation and have created the required variable, we could answer the questions
The top 10 harmulf events (remember that we have considered that the most harmful event is the one that cause more fatalities) are the followings:
head(md1, 10)
## Event Injuries Fatalities
## 834 TORNADO 91346 5633
## 130 EXCESSIVE HEAT 6525 1903
## 153 FLASH FLOOD 1777 978
## 275 HEAT 2100 937
## 464 LIGHTNING 5230 816
## 856 TSTM WIND 6957 504
## 170 FLOOD 6789 470
## 585 RIP CURRENT 232 368
## 359 HIGH WIND 1137 248
## 19 AVALANCHE 170 224
As we can see in the above table, Tornado events are the most harmful events with 5633 deaths and 91346 people injured
Let’s see visually the results, plotting the top ten events with a barplot. We are going to plot both the number of fatalities and the number of injuries by event
md10 <- md1[1:10,]
windows()
par(mar=c(6,6,3,1), mgp=c(4,0.5,0), mfrow=c(2,1))
with(md10,barplot(md10$Fatalities, names.arg = md10$Event, las=2, cex.names=0.7, ylab = "People dead", main = "Nº of Fatalities per Event"))
with(md10,barplot(md10$Injuries, names.arg = md10$Event, las=2, cex.names=0.7, ylab = "People injured", main = "Nº of Injuries per Event"))
In the same way, in the following table we list the top ten events that caouses the worts economic consequences. The Cost is expressed in dollars
head(cost, 10)
## Event Cost
## 170 FLOOD 144657709807
## 411 HURRICANE/TYPHOON 69305840000
## 834 TORNADO 56925660483
## 670 STORM SURGE 43323536000
## 153 FLASH FLOOD 16140811717
## 244 HAIL 15727366777
## 402 HURRICANE 11868319010
## 848 TROPICAL STORM 7703890550
## 972 WINTER STORM 6688497250
## 359 HIGH WIND 5270046260
We observe that Flood are the events that cause the worst economic consequences resulting in losses of more than 144 billion dollars, followed by hurricane/typhoon events with losses of 69 billion dollars. We could plot the results for the top ten events
cost10 <- cost[1:10,]
windows()
par(mar=c(8,6,3,1))
with(cost10,barplot(cost10$Cost/1000000000, names.arg = cost10$Event, las=2, cex.names=0.7, ylab = "Economic cost (in Billion $)", main = "Top 10 most costly events"))