In this analysis, we’re going to get the Storm Data database from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) website. Then we’ll conduct some exploratory analysis to understand the data better. We’ll then modify the data set to better suit our analysis needs. Finally, we’ll proceed to answer the questions that were posed by the assignment:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The first step is to actually get the data, to do so, we’re going to download it from the NOAA website.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("./data/stormdata.csv.bz2")){
download.file (url, "./data/stormdata.csv.bz2")
}
if(!exists("stormdata")){
stormdata <- read.csv("./data/stormdata.csv.bz2")
}
Let’s now look at some properties of the data set.
library(tidyverse) # loading libraries that will
library(lubridate) # be useful during the analysis
dim(stormdata)
## [1] 902297 37
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
as_tibble(stormdata)
## # A tibble: 902,297 × 37
## STATE__ BGN_DATE BGN_T…¹ TIME_…² COUNTY COUNT…³ STATE EVTYPE BGN_R…⁴ BGN_AZI
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1 4/18/195… 0130 CST 97 MOBILE AL TORNA… 0 ""
## 2 1 4/18/195… 0145 CST 3 BALDWIN AL TORNA… 0 ""
## 3 1 2/20/195… 1600 CST 57 FAYETTE AL TORNA… 0 ""
## 4 1 6/8/1951… 0900 CST 89 MADISON AL TORNA… 0 ""
## 5 1 11/15/19… 1500 CST 43 CULLMAN AL TORNA… 0 ""
## 6 1 11/15/19… 2000 CST 77 LAUDER… AL TORNA… 0 ""
## 7 1 11/16/19… 0100 CST 9 BLOUNT AL TORNA… 0 ""
## 8 1 1/22/195… 0900 CST 123 TALLAP… AL TORNA… 0 ""
## 9 1 2/13/195… 2000 CST 125 TUSCAL… AL TORNA… 0 ""
## 10 1 2/13/195… 2000 CST 57 FAYETTE AL TORNA… 0 ""
## # … with 902,287 more rows, 27 more variables: BGN_LOCATI <chr>,
## # END_DATE <chr>, END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>,
## # END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>,
## # WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>,
## # PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>,
## # WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>,
## # LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, …
There are 902297 observations across 37 variables. The variable names are described above.
BGN_DATE could be a variable of interest if we wanted to compare between different periods or if we wanted to look at a specific range of time. By now, we’re content to look at the data as a whole, but it is important to understand the time frame that we’re looking at.
stormdata %>% mutate (BGN_DATE = mdy_hms(BGN_DATE)) ->
stormdata
range(stormdata$BGN_DATE)
## [1] "1950-01-03 UTC" "2011-11-30 UTC"
The above sets the time-frame of the observations between January of 1950 and November of 2011.
Based on the following question raised by the assignment:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The variables of interest are: EVTYPE, FATALITIES and INJURIES
We can group the data by the event type while summarizing with the sum of the fatalities and injuries for each event type. We can then order the observations by FATALITIES first and then INJURIES to resolve ties.
stormdata %>% as_tibble() %>% group_by(EVTYPE) %>% summarise(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES)) %>% arrange(desc(FATALITIES), desc(INJURIES)) ->
healthdmg
healthdmg
## # A tibble: 985 × 3
## EVTYPE FATALITIES INJURIES
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
## # … with 975 more rows
The table above summarizes the data in the descending order by FATALITIES, and breaks ties by using INJURIES. The tie-breaking becomes most useful in the observations that didn’t lead to any FATALITIES. We can see only those observations by filtering.
healthdmg %>% filter(FATALITIES == 0) %>% select (EVTYPE,INJURIES) -> injuries
injuries
## # A tibble: 817 × 2
## EVTYPE INJURIES
## <chr> <dbl>
## 1 Heat Wave 70
## 2 WINTER WEATHER MIX 68
## 3 SNOW/HIGH WINDS 36
## 4 THUNDERSTORMW 27
## 5 TORNADO F2 16
## 6 GLAZE/ICE STORM 15
## 7 HEAVY SNOW/ICE 10
## 8 SMALL HAIL 10
## 9 THUNDERSTORM WINDS 10
## 10 NON-SEVERE WIND DAMAGE 7
## # … with 807 more rows
The table above summarizes the events that didn’t lead to any FATALITIES by descending order of INJURIES.
Based on the following question raised by the assignment
Across the United States, which types of events have the greatest economic consequences?
The variables of interest are: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
Let’s look at that segment of the data.
stormdata %>% select (EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) %>% as_tibble() -> economicdmg
economicdmg
## # A tibble: 902,297 × 5
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <chr> <dbl> <chr> <dbl> <chr>
## 1 TORNADO 25 K 0 ""
## 2 TORNADO 2.5 K 0 ""
## 3 TORNADO 25 K 0 ""
## 4 TORNADO 2.5 K 0 ""
## 5 TORNADO 2.5 K 0 ""
## 6 TORNADO 2.5 K 0 ""
## 7 TORNADO 2.5 K 0 ""
## 8 TORNADO 2.5 K 0 ""
## 9 TORNADO 25 K 0 ""
## 10 TORNADO 25 K 0 ""
## # … with 902,287 more rows
The table above summarizes the economic damages. There’s a problem with it, though, the numbers in PROPDMG are absolute values that need to take into account the multiplicand in PROPDMGEXP. We can use the case_when function in dplyr to deal with this. But first, we need to see which cases exist.
unique(economicdmg$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
length(unique(economicdmg$PROPDMGEXP))
## [1] 19
From the above, we can see that there are many different cases. Some of them are quite obvious, such as “K” meaning thousands, “B” meaning billions and “M” meaning millions. The others are a little bit obscure.
After a quick online search, a report by Eddie Song reveals the meaning of each observation.
These are possible values of CROPDMGEXP and PROPDMGEXP:
H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
H,h = hundreds = 100
K,k = kilos = thousands = 1,000
M,m = millions = 1,000,000
B,b = billions = 1,000,000,000
(+) = 1
(-) = 0
(?) = 0
black/empty character = 0
numeric 0..8 = 10
Now we multiply the values in PROPDMG to the according exponent as per the report above.
economicdmg %>% mutate (PROPDMG = case_when(
tolower(PROPDMGEXP) == "k" ~ PROPDMG*1000,
tolower(PROPDMGEXP) == "m" ~ PROPDMG*1000000,
tolower(PROPDMGEXP) == "b" ~ PROPDMG*1000000000,
tolower(PROPDMGEXP) == "h" ~ PROPDMG*100,
PROPDMGEXP %in% 0:8 ~ PROPDMG*10,
PROPDMGEXP == "-" ~ PROPDMG*0,
PROPDMGEXP == "" ~ PROPDMG*0,
PROPDMGEXP == "+" ~ PROPDMG*1,
PROPDMGEXP == "?" ~ PROPDMG*0)) %>% select(-PROPDMGEXP) -> economicdmg
economicdmg
## # A tibble: 902,297 × 4
## EVTYPE PROPDMG CROPDMG CROPDMGEXP
## <chr> <dbl> <dbl> <chr>
## 1 TORNADO 25000 0 ""
## 2 TORNADO 2500 0 ""
## 3 TORNADO 25000 0 ""
## 4 TORNADO 2500 0 ""
## 5 TORNADO 2500 0 ""
## 6 TORNADO 2500 0 ""
## 7 TORNADO 2500 0 ""
## 8 TORNADO 2500 0 ""
## 9 TORNADO 25000 0 ""
## 10 TORNADO 25000 0 ""
## # … with 902,287 more rows
Now we need to do the same for CROPDMG and CROPDMGEXP.
## # A tibble: 902,297 × 3
## EVTYPE PROPDMG CROPDMG
## <chr> <dbl> <dbl>
## 1 TORNADO 25000 0
## 2 TORNADO 2500 0
## 3 TORNADO 25000 0
## 4 TORNADO 2500 0
## 5 TORNADO 2500 0
## 6 TORNADO 2500 0
## 7 TORNADO 2500 0
## 8 TORNADO 2500 0
## 9 TORNADO 25000 0
## 10 TORNADO 25000 0
## # … with 902,287 more rows
Now that the data is tidy, we can group it by the EVTYPE and take the sum of the damages. We will then order by the TOTALDMG in descending order to answer the proposed question.
economicdmg %>% group_by(EVTYPE) %>% summarise(TOTALDMG = sum(PROPDMG,CROPDMG)) %>% arrange(desc(TOTALDMG)) -> totaleconomicdmg
totaleconomicdmg
## # A tibble: 985 × 2
## EVTYPE TOTALDMG
## <chr> <dbl>
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352117607
## 4 STORM SURGE 43323541000
## 5 HAIL 18758224527
## 6 FLASH FLOOD 17562132111
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041810
## # … with 975 more rows
We are going to plot some plots to answer the proposed questions. Let’s start with the first question.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
First, let’s demonstrate the correlation between these two variables across the different event types.
model <- lm(INJURIES ~ FATALITIES, data = healthdmg)
reg_summary <- summary(model)
healthdmg %>%
ggplot(aes(x = FATALITIES, y = INJURIES)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
labs(caption = paste0("Regression equation: y = ",
round(reg_summary$coefficients[2, 1], 2),
"x + ",
round(reg_summary$coefficients[1, 1], 2),
", R^2 = ",
round(reg_summary$r.squared, 2)))
With the strong correlation demonstrated, the plot that represents the FATALITIES per EVTYPE is enough to answer the question above, as FATALITIES are much more important than INJURIES with respect to the health of population, and due to the fact that both these variables are so closely correlated.Let’s filter only the observations which have more than 500 fatalities, so that the plot is more readable.
healthdmg %>% filter (FATALITIES>500) %>%
ggplot(aes(x = reorder(EVTYPE,-FATALITIES), y = FATALITIES)) +
geom_col() +
theme(axis.text.x = element_text(size = 8,
angle = 45,
hjust = 1)) +
labs (x = "Event Type", y = "Fatalities") +
ggtitle ("Fatalities due to Severe Weather Events in the US",
subtitle = "From 1950 to 2011")
The plot shows, in descending order the types of event that are most harmful to health of US citizens.
Across the United States, which types of events have the greatest economic consequences?
To answer this question, we will plot as above, but considering the economic damage. We will filter only the causes that have surpassed 10 billion dollars in TOTALDMG during the period, so that the plot is more readable.
totaleconomicdmg %>% filter(TOTALDMG>10000000000) %>%
ggplot(aes(x = reorder(EVTYPE,-TOTALDMG), y = TOTALDMG)) +
geom_col() +
theme(axis.text.x = element_text(size = 8,
angle = 45,
hjust = 1)) +
labs (x = "Event Type", y = "Total Economic Damage (US Dollars)") +
ggtitle ("Economic Damage due to Severe Weather Events in the US",
subtitle = "From 1950 to 2011")
The plot above shows the types of event, in descending order, that have caused the most economic damage in the studied period.