The analysis address two questions about the impact of different event types in the NOAA storm database. First, the health impact. Then, the economic impact. The data es pre-processed to drop unused variables, clean inconsistent explanation characters, and group & summarize the impact by each event type. The health impact is measured with the total amount of fatalities and the total amount of injuries. The economic impact is measured with the total estimated monetary cost in terms of property cost plus crop cost. The final recommendation is to prioritize tornados as the worst event by all means.
The 2 questions to be answered in this analysis are:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
To complete the following analysis it is required to load these libraries in R.
library(ggplot2)
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This is the starting point for the analysis. The data used for this analysis is the NOAA Storm Database available in the course site in the following link Storm Data.
In order to read the data it is supposed that the file is already unzipped in the working directory. To read the table it is used the read.csv function. Also, by using the str function, it is shown the primary details of the database.
storm <- read.csv("./repdata_data_StormData.csv")
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
As the goal of the analysis is to indicate which types of events are more harmful to the health and economy across the United States, there is no need about specific location details like state, county, latitude, longitude, etc.
To complete the whole analysis it is going to be used 8 variables:
BGN_DATE: variable indicating the date of occurrence.
EVTYPE: variable indicating the event type.
FATALITIES: variable indicating the amount of fatalities caused by the event.
INJURIES: variable indicating the amount of injuries caused by the event.
PROPDMG: variable indicating the 3 significant digits for the estimated property damage, in USD, caused by the event.
PROPDMGEXP: variable indicating the magnitude of the damage. “K” for thousands, “M” for millions, and “B” for billions.
CROPDMG: variable indicating the 3 significant digits for the estimated crop damage, in USD, caused by the event.
CROPDMGEXP: variable indicating the magnitude of the damage. “K” for thousands, “M” for millions, and “B” for billions.
The first step to clean the data is to remove unused columns. This will be performed by sub-setting the storm data frame. Then, we convert the BGN_DATE variable to a date R format. Also, a summary of the resulting data frame is presented.
storm <- storm[,c(2, 8, 23:28)]
storm <- mutate(storm, BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y"))
summary(storm)
## BGN_DATE EVTYPE FATALITIES
## Min. :1950-01-03 HAIL :288661 Min. : 0.0000
## 1st Qu.:1995-04-20 TSTM WIND :219940 1st Qu.: 0.0000
## Median :2002-03-18 THUNDERSTORM WIND: 82563 Median : 0.0000
## Mean :1998-12-27 TORNADO : 60652 Mean : 0.0168
## 3rd Qu.:2007-07-28 FLASH FLOOD : 54277 3rd Qu.: 0.0000
## Max. :2011-11-30 FLOOD : 25326 Max. :583.0000
## (Other) :170878
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. : 0.0000 Min. : 0.00 :465934 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 K :424665 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 M : 11330 Median : 0.000
## Mean : 0.1557 Mean : 12.06 0 : 216 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000
## Max. :1700.0000 Max. :5000.00 5 : 28 Max. :990.000
## (Other): 84
## CROPDMGEXP
## :618413
## K :281832
## M : 1994
## k : 21
## 0 : 19
## B : 9
## (Other): 9
As seen in the summary, the PROPDMGEXP and CROPDMGEXP variables contain other characters than “K”, “M”, and “B”. This can be seen in the summary of each variable.
summary(storm$PROPDMGEXP)
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
summary(storm$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
In the following analysis “k” and “m” are considered as “K” and “M”. Events containing other character will not be used for now on. In this step the lower case characters are converted to upper case. Then, three data frames are created, one for the observations of the health damage, one for the valid observations in property damage, and one for the valid observations in crop damage..
for(i in 1:length(storm$PROPDMGEXP)) {
if(storm[i,6] == "k") {
storm[i,6] = "K"
} else if(storm[i,6] == "m") {
storm[i,6] = "M"
}
if(storm[i,8] == "k"){
storm[i,8] = "K"
} else if(storm[i,8] == "m"){
storm[i,8] = "M"
}
}
valid_EXP <- c("K", "M", "B")
health <- storm[, 1:4]
prop <- storm[storm$PROPDMGEXP %in% valid_EXP, c(1:2, 5:6)]
crop <- storm[storm$CROPDMGEXP %in% valid_EXP, c(1:2, 7:8)]
prop$PROPDMGEXP <- droplevels(prop$PROPDMGEXP)
crop$CROPDMGEXP <- droplevels(crop$CROPDMGEXP)
summary(health)
## BGN_DATE EVTYPE FATALITIES
## Min. :1950-01-03 HAIL :288661 Min. : 0.0000
## 1st Qu.:1995-04-20 TSTM WIND :219940 1st Qu.: 0.0000
## Median :2002-03-18 THUNDERSTORM WIND: 82563 Median : 0.0000
## Mean :1998-12-27 TORNADO : 60652 Mean : 0.0168
## 3rd Qu.:2007-07-28 FLASH FLOOD : 54277 3rd Qu.: 0.0000
## Max. :2011-11-30 FLOOD : 25326 Max. :583.0000
## (Other) :170878
## INJURIES
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.1557
## 3rd Qu.: 0.0000
## Max. :1700.0000
##
summary(prop)
## BGN_DATE EVTYPE PROPDMG PROPDMGEXP
## Min. :1950-01-03 HAIL :91966 Min. : 0.00 B: 40
## 1st Qu.:2000-07-20 THUNDERSTORM WIND:81941 1st Qu.: 0.00 K:424665
## Median :2007-09-12 TSTM WIND :62844 Median : 1.00 M: 11337
## Mean :2003-09-21 TORNADO :51827 Mean : 24.94
## 3rd Qu.:2010-01-21 FLASH FLOOD :32942 3rd Qu.: 10.00
## Max. :2011-11-30 FLOOD :17394 Max. :5000.00
## (Other) :97128
summary(crop)
## BGN_DATE EVTYPE CROPDMG CROPDMGEXP
## Min. :1993-01-04 HAIL :82305 Min. : 0.000 B: 9
## 1st Qu.:2007-12-15 THUNDERSTORM WIND:81425 1st Qu.: 0.000 K:281853
## Median :2009-05-08 FLASH FLOOD :21679 Median : 0.000 M: 1995
## Mean :2008-09-23 FLOOD :13622 Mean : 4.853
## 3rd Qu.:2010-09-03 HIGH WIND :11501 3rd Qu.: 0.000
## Max. :2011-11-30 TORNADO : 9593 Max. :990.000
## (Other) :63732
The next step in data processing is to multiply the significant digits with the appropriate number to find the total damage in USD.
for(i in 1:length(prop$PROPDMGEXP)){
if(prop[i,4] == "K") {prop$PROPDMG_T <- prop$PROPDMG * 1000}
if(prop[i,4] == "M") {prop$PROPDMG_T <- prop$PROPDMG * 1000000}
if(prop[i,4] == "B") {prop$PROPDMG_T <- prop$PROPDMG * 1000000000}
}
for(i in 1:length(crop$CROPDMGEXP)){
if(crop[i,4] == "K") {crop$CROPDMG_T <- crop$CROPDMG * 1000}
if(crop[i,4] == "M") {crop$CROPDMG_T <- crop$CROPDMG * 1000000}
if(crop[i,4] == "B") {crop$CROPDMG_T <- crop$CROPDMG * 1000000000}
}
summary(prop)
## BGN_DATE EVTYPE PROPDMG PROPDMGEXP
## Min. :1950-01-03 HAIL :91966 Min. : 0.00 B: 40
## 1st Qu.:2000-07-20 THUNDERSTORM WIND:81941 1st Qu.: 0.00 K:424665
## Median :2007-09-12 TSTM WIND :62844 Median : 1.00 M: 11337
## Mean :2003-09-21 TORNADO :51827 Mean : 24.94
## 3rd Qu.:2010-01-21 FLASH FLOOD :32942 3rd Qu.: 10.00
## Max. :2011-11-30 FLOOD :17394 Max. :5000.00
## (Other) :97128
## PROPDMG_T
## Min. : 0
## 1st Qu.: 0
## Median : 1000
## Mean : 24943
## 3rd Qu.: 10000
## Max. :5000000
##
summary(crop)
## BGN_DATE EVTYPE CROPDMG CROPDMGEXP
## Min. :1993-01-04 HAIL :82305 Min. : 0.000 B: 9
## 1st Qu.:2007-12-15 THUNDERSTORM WIND:81425 1st Qu.: 0.000 K:281853
## Median :2009-05-08 FLASH FLOOD :21679 Median : 0.000 M: 1995
## Mean :2008-09-23 FLOOD :13622 Mean : 4.853
## 3rd Qu.:2010-09-03 HIGH WIND :11501 3rd Qu.: 0.000
## Max. :2011-11-30 TORNADO : 9593 Max. :990.000
## (Other) :63732
## CROPDMG_T
## Min. : 0
## 1st Qu.: 0
## Median : 0
## Mean : 4853
## 3rd Qu.: 0
## Max. :990000
##
As the analysis questions address the total damage by event type, the next step to process data is to summarize the total damage by each event type. This will be done by using the group_by and summarize functions. In the case of property and crop damage, both data frames will be merged together to obtain the total damage of both.
health <- group_by(health, EVTYPE)
health_summ <- summarize(health, T_FAT_EVT = sum(FATALITIES),
T_INJ_EVT = sum(INJURIES))
summary(health_summ$T_FAT_EVT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 15.38 0.00 5633.00
summary(health_summ$T_INJ_EVT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 142.7 0.0 91346.0
prop <- group_by(prop, EVTYPE)
prop_summ <- summarize(prop, T_DMG_EVT = sum(PROPDMG_T))
summary(prop_summ$T_DMG_EVT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 5.875e+03 6.000e+04 2.692e+07 5.750e+05 3.212e+09
crop <- group_by(crop, EVTYPE)
crop_summ <- summarize(crop, T_DMG_EVT = sum(CROPDMG_T))
summary(crop_summ$T_DMG_EVT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 5000 33500 8609727 684525 579573280
P_C_dmg <- rbind(crop_summ, prop_summ)
P_C_dmg <- group_by(P_C_dmg,EVTYPE)
P_C_dmg <- summarize(P_C_dmg, total_dmg = sum(T_DMG_EVT))
summary(P_C_dmg)
## EVTYPE total_dmg
## HIGH SURF ADVISORY: 1 Min. :0.000e+00
## FLASH FLOOD : 1 1st Qu.:1.000e+04
## TSTM WIND : 1 Median :5.900e+04
## TSTM WIND (G45) : 1 Mean :2.856e+07
## ? : 1 3rd Qu.:6.010e+05
## AGRICULTURAL FREEZE : 1 Max. :3.312e+09
## (Other) :423
The health impact will be measured using two different metrics. The first is the amount of fatalities and the second is the amount on injured people.
The following table presents the top 10 event types by impact to health measuring the total amount of fatalities.
health_fat <- arrange(health_summ, desc(T_FAT_EVT))
head(health_fat, 10)
## # A tibble: 10 x 3
## EVTYPE T_FAT_EVT T_INJ_EVT
## <fct> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
{ggplot(health_fat[1:10,], aes(EVTYPE, weight = T_FAT_EVT)) +
geom_bar() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Top-10 Health Impact Event Types - Fatalities",
x = "Event Type",
y = "Total Fatalities",
caption = "(Top-ten health impact event types measuring the total amount of fatalities as result)")}
The following table presents the top 10 event types by impact to health measuring the total amount of injuries.
health_inj <- arrange(health_summ, desc(T_INJ_EVT))
head(health_inj, 10)
## # A tibble: 10 x 3
## EVTYPE T_FAT_EVT T_INJ_EVT
## <fct> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
## 4 EXCESSIVE HEAT 1903 6525
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 ICE STORM 89 1975
## 8 FLASH FLOOD 978 1777
## 9 THUNDERSTORM WIND 133 1488
## 10 HAIL 15 1361
{ggplot(health_inj[1:10,], aes(EVTYPE, weight = T_INJ_EVT)) +
geom_bar() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Top-10 Health Impact Event Types - Injuries",
x = "Event Type",
y = "Total Injuries",
caption = "(Top-ten health impact event types measuring the total amount of injuries as result)")}
The following event types appear in the top-ten of both measurements, the list is in order of total fatalities:
Tornado.
Excessive heat.
Flash flood.
Heat.
Lightning.
TSTM wind.
Flood.
The following table presents the top 10 event types by impact to economy measuring the total estimated monetary damage.
P_C_dmg <- arrange(P_C_dmg, desc(total_dmg))
head(P_C_dmg, 10)
## # A tibble: 10 x 2
## EVTYPE total_dmg
## <fct> <dbl>
## 1 TORNADO 3311817980
## 2 FLASH FLOOD 1598767950
## 3 TSTM WIND 1445113210
## 4 HAIL 1267938960
## 5 FLOOD 1067969360
## 6 THUNDERSTORM WIND 943491620
## 7 LIGHTNING 606765690
## 8 THUNDERSTORM WINDS 458716910
## 9 HIGH WIND 341979770
## 10 WINTER STORM 134698580
{ggplot(P_C_dmg[1:10,], aes(EVTYPE, weight = total_dmg)) +
geom_bar() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Top-10 Economy Impact Event Types",
x = "Event Type",
y = "Total Estimated Monetary Cost",
caption = "(Top-ten economy impact events type measured by total estimates monetary cost)")}
As seen on both, health and economic, analyses, the worst event type is Tornado. There are other event types that should be considered as priority such as:
Flash flood.
TSTM wind.
Flood.