The purpose of this study is to analyze the health and economic impact of different weather events. The dataset that was used, Storm Data, is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents significant weather phenomena. Events documented in the dataset start in 1950 and end in 2011. The accuracy of the data is greater for later records.
The analysis identified floods as the most harmful weather event in terms of economic impact. In terms of negative health effect, tornadoes are the most harmful, causing the greatest number of injuries and fatalities.
We start by downloading and processing the data.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
#accesed on 09.02.2018
bunzip2("StormData.csv.bz2")
#using fread from data.table to save time (compared to read.csv)
sd <- fread("StormData.csv")
##
Read 30.0% of 967216 rows
Read 51.7% of 967216 rows
Read 70.3% of 967216 rows
Read 82.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:06
str(sd)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : chr "3" "2" "2" "2" ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
There are some important variables, that need recoding. The PROPDMGEXP and CROPDMGEXP variables should be exponents for the appropriate damage variables. However, as seen below, it is not clear. To determine the meaning of each exponent, I have used this publication
table(sd$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(sd$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
#recoding the PROPDMG_EXP variable
sd$PROPDMGEXP_n <- recode(tolower(sd$PROPDMGEXP),
'h' = 10^2,
'k' = 10^3,
'm' = 10^6,
'b' = 10^9,
'+' = 1,
'1' = 10,
'2' = 10,
'3' = 10,
'4' = 10,
'5' = 10,
'6' = 10,
'7' = 10,
'8' = 10,
.default = 0)
#creating a variable with real damages (in million USD)
sd$PROPDMG_real <- sd$PROPDMG * sd$PROPDMGEXP_n / 10^6
#recoding the CROPDMG_EXP variable
sd$CROPDMGEXP_n <- recode(tolower(sd$CROPDMGEXP),
'h' = 10^2,
'k' = 10^3,
'm' = 10^6,
'b' = 10^9,
'+' = 1,
'1' = 10,
'2' = 10,
'3' = 10,
'4' = 10,
'5' = 10,
'6' = 10,
'7' = 10,
'8' = 10,
.default = 0)
sd$CROPDMG_real <- sd$CROPDMG * sd$CROPDMGEXP_n / 10^6
#to keep the data tidy, we keep only the important variables
sd <- sd %>% select(EVTYPE, PROPDMG_real, CROPDMG_real, INJURIES, FATALITIES)
First, lets prepare a summary of the economic and health effect by event types. Here the economic effect is defined as sum of property and crop damage, and health as sum of injuries and death related to the event.
sum_sd <- sd %>%
group_by(EVTYPE) %>%
summarise(econ_effect=sum(PROPDMG_real, CROPDMG_real),
health_effect=sum(INJURIES, FATALITIES))
str(sum_sd)
## Classes 'tbl_df', 'tbl' and 'data.frame': 985 obs. of 3 variables:
## $ EVTYPE : chr " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
## $ econ_effect : num 0.2 0 0.05 0 8.1 0.008 0 0 0.005 0 ...
## $ health_effect: num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
It can be observed, that the event type variable is very messy - it contains many more values, than the 48 types listed in the documentation. Let’s take a look at the 50 event types with the highest economic impact and 50 with the highest impact on health
highest_econ <- sum_sd[order(sum_sd$econ_effect, decreasing = TRUE),][1:50,1]
highest_health <- sum_sd[order(sum_sd$health_effect, decreasing = TRUE),][1:50,1]
unique(rbind(highest_econ, highest_health))
## # A tibble: 73 x 1
## EVTYPE
## <chr>
## 1 FLOOD
## 2 HURRICANE/TYPHOON
## 3 TORNADO
## 4 STORM SURGE
## 5 HAIL
## 6 FLASH FLOOD
## 7 DROUGHT
## 8 HURRICANE
## 9 RIVER FLOOD
## 10 ICE STORM
## # ... with 63 more rows
Now, let’s see which of them are not on the list from the documentation and recode them. Full analysis would require to check and recode all the event types, however it would be a tedious manual task, so we will only concentrate on the most harmful event types.
sum_sd$EVTYPE <- recode(sum_sd$EVTYPE,
"STORM SURGE" = "STORM SURGE/TIDE",
"HURRICANE" = "HURRICANE/TYPHOON",
"RIVER FLOOD" = "FLOOD",
"TSTM WIND" = "THUNDERSTORM WIND",
"HURRICANE OPAL" = "HURRICANE/TYPHOON",
"WILD/FOREST FIRE" = "WILDFIRE",
"HEAVY RAIN/SEVERE WEATHER" = "HEAVY RAIN",
"THUNDERSTORM WINDS" = "THUNDERSTORM WIND",
"TORNADOES, TSTM WIND, HAIL" = "TORNADO",
"EXTREME COLD" = "EXTREME COLD/WIND CHILL",
"SEVERE THUNDERSTORM" = "THUNDERSTORM WIND",
"HIGH WINDS" = "HIGH WIND",
"WILD FIRES" = "WILDFIRE",
"TYPHOON" = "HURRICANE/TYPHOON",
"FREEZE" = "FROST/FREEZE",
"HEAT" = "EXCESSIVE HEAT",
"HURRICANE ERIN" = "HURRICANE/TYPHOON",
"LANDSLIDE" = "DEBRIS FLOW",
"FLASH FLOODING" = "FLASH FLOOD",
"FLASH FLOOD/FLOOD" = "FLASH FLOOD",
"DAMAGING FREEZE" = "FROST/FREEZE",
"FLOOD/FLASH FLOOD" = "FLASH FLOOD",
"HAILSTORM" = "HAIL",
"EXCESSIVE WETNESS" = "FLOOD",
"River Flooding" = "FLOOD",
"COASTAL FLOODING" = "COASTAL FLOOD",
"HIGH WINDS/COLD" = "EXTREME COLD/WIND CHILL",
"FLOODING" = "FLOOD",
"FOG" = "DENSE FOG",
"RIP CURRENTS" = "RIP CURRENT",
"HEAT WAVE" = "EXCESSIVE HEAT",
"EXTREME HEAT" = "EXCESSIVE HEAT",
"GLAZE" = "FROST/FREEZE",
"ICE" = "FROST/FREEZE",
"WIND" = "STRONG WIND",
"COLD/WIND CHILL" = "EXTREME COLD/WIND CHILL",
"URBAN/SML STREAM FLD" = "FLOOD",
"TSTM WIND/HAIL" = "THUNDERSTORM WIND",
"WINTER WEATHER/MIX" = "WINTER WEATHER",
"HEAVY SURF/HIGH SURF" = "HIGH SURF",
"COLD" = "EXTREME COLD/WIND CHILL",
"WINTRY MIX" = "WINTER WEATHER",
"Heat Wave" = "EXCESSIVE HEAT",
"WINTER WEATHER MIX" = "WINTER WEATHER")
Now, that the event types were recoded, we need to summarize the data again.
sum_sd <- sum_sd %>%
group_by(EVTYPE) %>%
summarise(econ_effect=sum(econ_effect),
health_effect=sum(health_effect))
str(sum_sd)
## Classes 'tbl_df', 'tbl' and 'data.frame': 942 obs. of 3 variables:
## $ EVTYPE : chr " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
## $ econ_effect : num 0.2 0 0.05 0 8.1 0.008 0 0 0.005 0 ...
## $ health_effect: num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
Now that the data is clean, an analysis can be performed.
Let’s start by plotting the events, that are most economically harmful.
e <- ggplot(sum_sd[order(sum_sd$econ_effect, decreasing = TRUE),][1:10,], aes(EVTYPE, econ_effect))
e + geom_bar(stat = 'identity') +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Event Type") +
ylab("Damage (in millions USD)") +
ggtitle("Economic effect")
It is clearly visible, that floods are the most economically harmful weather events. In the analysed period (1950-2011) floods have caused over 150 billion USD damages in property and crops.
We can plot the health impact (number of injuries and fatalities) of each event type.
h <- ggplot(sum_sd[order(sum_sd$health_effect, decreasing = TRUE),][1:10,], aes(EVTYPE, health_effect))
h + geom_bar(stat = 'identity') +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Event Type") +
ylab("Number of injuries and fatalities") +
ggtitle("Health effect")
It is clear, that tornadoes are the most harmful weather events in terms of fatalities and injuries.