This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Most of the variables reported in the dataset we do not need for this analysis, so first of all I dropped all the unuseful features.
These data were gathered from 1950 to 2011, with earlier years being annotated less strictly than recent years, when more precise rules were established to record natural events properly. This means a thorough cleaning process is needed before being able to actually perform any analysis on this dataset. The main topic of the analysis is to find specific event types causing most human or economic damages, so I fixed the event type values to be as compliant as possible with NOAA specifications.
I calculated the total number of fatalities and injuries per each event type, and in the same way I obtained the event type causing the greatest economic consequences. I assumed “economic consequences” refers to both property and crop damage, since these two can affect the economy to more or less the same extent.
Data can be downloaded from this link; there is no need to unzip it, since the read_csv function can deal with zipped data without issues.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"storm_data.csv.bz2")
library(tidyverse)
library(skimr)
library(stringr)
library(magrittr)
library(gridExtra)
Let’s load the data into R and have a quick look.
df <- read_csv("storm_data.csv.bz2")
head(df)
## # A tibble: 6 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 1 4/18/19… 0130 CST 97 MOBILE AL TORNA… 0 <NA> <NA> <NA> <NA> 0
## 2 1 4/18/19… 0145 CST 3 BALDWIN AL TORNA… 0 <NA> <NA> <NA> <NA> 0
## 3 1 2/20/19… 1600 CST 57 FAYETTE AL TORNA… 0 <NA> <NA> <NA> <NA> 0
## 4 1 6/8/195… 0900 CST 89 MADISON AL TORNA… 0 <NA> <NA> <NA> <NA> 0
## 5 1 11/15/1… 1500 CST 43 CULLMAN AL TORNA… 0 <NA> <NA> <NA> <NA> 0
## 6 1 11/15/1… 2000 CST 77 LAUDERDALE AL TORNA… 0 <NA> <NA> <NA> <NA> 0
## # ... with 23 more variables: COUNTYENDN <chr>, END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>,
## # F <int>, MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>,
## # WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## # REMARKS <chr>, REFNUM <dbl>
Let’s restrict the data to just relevant variables for our research.
sub_df <- df %>%
select(-c(STATE__, BGN_DATE, BGN_TIME, TIME_ZONE, COUNTY, COUNTYNAME, STATE, BGN_RANGE, BGN_AZI, BGN_LOCATI,
END_DATE, END_TIME, COUNTY_END, COUNTYENDN, END_RANGE, END_AZI, END_LOCATI, LENGTH, WIDTH,
STATEOFFIC, ZONENAMES, LATITUDE, LONGITUDE, LATITUDE_E, LONGITUDE_, REMARKS))
head(sub_df)
## # A tibble: 6 x 11
## EVTYPE F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO REFNUM
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 TORNADO 3 0 0 15 25 K 0 <NA> <NA> 1
## 2 TORNADO 2 0 0 0 2.5 K 0 <NA> <NA> 2
## 3 TORNADO 2 0 0 2 25 K 0 <NA> <NA> 3
## 4 TORNADO 2 0 0 2 2.5 K 0 <NA> <NA> 4
## 5 TORNADO 2 0 0 2 2.5 K 0 <NA> <NA> 5
## 6 TORNADO 2 0 0 6 2.5 K 0 <NA> <NA> 6
Now let’s check for missing values and other issues with these data.
skim(sub_df)
## Skim summary statistics
## n obs: 902297
## n variables: 11
##
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## CROPDMGEXP 618413 283884 902297 1 1 0 8
## EVTYPE 0 902297 902297 1 30 0 977
## PROPDMGEXP 465934 436363 902297 1 1 0 18
## WFO 142069 760228 902297 1 3 0 541
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## F 843563 58734 902297 0.91 1 0 0 1 1 5 ▇▆▁▃▁▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## CROPDMG 0 902297 902297 1.53 22.17 0 0 0 0 990 ▇▁▁▁▁▁▁▁
## FATALITIES 0 902297 902297 0.017 0.77 0 0 0 0 583 ▇▁▁▁▁▁▁▁
## INJURIES 0 902297 902297 0.16 5.43 0 0 0 0 1700 ▇▁▁▁▁▁▁▁
## MAG 0 902297 902297 46.9 61.91 0 0 50 75 22000 ▇▁▁▁▁▁▁▁
## PROPDMG 0 902297 902297 12.06 59.48 0 0 0 0.5 5000 ▇▁▁▁▁▁▁▁
## REFNUM 0 902297 902297 451149 260470.85 1 225575 451149 676723 9e+05 ▇▇▇▇▇▇▇▇
There is something interesting here: the EVTYPE variable shows 977 unique values, although in the data documentation only 48 event types are reported.
We may think that something’s wrong, so let’s have a look at some of these values.
sub_df %>%
count(EVTYPE) %>%
arrange(desc(n))
## # A tibble: 977 x 2
## EVTYPE n
## <chr> <int>
## 1 HAIL 288661
## 2 TSTM WIND 219944
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54278
## 6 FLOOD 25326
## 7 THUNDERSTORM WINDS 20843
## 8 HIGH WIND 20212
## 9 LIGHTNING 15755
## 10 HEAVY SNOW 15708
## # ... with 967 more rows
As we can see, there are a number of values which are mispelled or poorly encoded. This means we need some cleaning here.
A deep inspection of the EVTYPE variable shows a very high number of different values, which highlights the need for a very complicated strategy to fix this issue. Instead, I chose to select the first 200 most common elements and fix the wrong entries in this list, according to the original 48 event types; all the other entries will be left out of the analysis.
top_200 <- sub_df %>%
count(EVTYPE) %>%
arrange(desc(n)) %>%
top_n(200, n)
top_df <- sub_df %>%
filter(EVTYPE %in% top_200$EVTYPE) %>%
mutate(EVTYPE = toupper(EVTYPE))
Now I will perform a (quite boring) replacement of messy event with their proper counterpart from the data documentation.
top_df %<>%
mutate(event = case_when(EVTYPE %in% c("THUNDERSTORM WIND", "TSTM WIND", "THUNDERSTORM WINDS", "TSTM WIND/HAIL",
"THUNDERSTORM WINDS HAIL", "THUNDERSTORM WINDSS", "THUNDERSTORM",
"TSTM WIND (G45)", "THUNDERSTORM WINDS/HAIL", "SEVERE THUNDERSTORMS",
"THUNDERSTORMS WINDS", "SEVERE THUNDERSTORM", "TSTM WIND (G40)",
"THUNDERSTORM WINDS") ~ "THUNDERSTORM WIND",
EVTYPE %in% c("MARINE THUNDERSTORM WIND", "MARINE TSTM WIND",
"COASTAL STORM") ~ "MARINE THUNDERSTORM WIND",
EVTYPE %in% c("FLOOD", "URBAN/SML STREAM FLD", "FLOOD/FLASH FLOOD", "URBAN FLOOD",
"RIVER FLOOD", "FLOODING", "URBAN FLOODING", "URBAN/SMALL STREAM FLOOD",
"RIVER FLOODING", "SMALL STREAM FLOOD") ~ "FLOOD",
EVTYPE %in% c("HIGH WIND", "HIGH WINDS", "WIND", "DRY MICROBURST", "WINDS",
"WIND ADVISORY") ~ "HIGH WIND",
EVTYPE %in% c("WILDFIRE", "WILD/FOREST FIRE", "WILDFIRES") ~ "WILDFIRE",
EVTYPE %in% c("WINTER WEATHER", "WINTER WEATHER/MIX",
"LOW TEMPERATURE") ~ "WINTER WEATHER",
EVTYPE %in% c("FLASH FLOOD", "FLASH FLOODING", "FLASH FLOODS",
"FLASH FLOOD/FLOOD", "FLASH FLOODING/FLOOD") ~ "FLASH FLOOD",
EVTYPE %in% c("EXTREME COLD/WIND CHILL", "EXTREME COLD", "EXTREME WINDCHILL",
"RECORD COLD", "UNSEASONABLY COLD",
"EXTREME WINDCHILL TEMPERATURES") ~ "EXTREME COLD/WIND CHILL",
EVTYPE %in% c("AVALANCHE", "LANDSLIDE", "MUDSLIDE", "LANDSLIDES",
"MUD SLIDE") ~ "AVALANCHE",
EVTYPE %in% c("HEAVY SNOW", "SNOW", "HEAVY SNOW SQUALLS", "EXCESSIVE SNOW",
"SNOW SQUALL", "SNOW SQUALLS", "HEAVY SNOW-SQUALLS", "BLOWING SNOW",
"RECORD SNOW", "SNOW/BLOWING SNOW", "SNOW/ICE") ~ "HEAVY SNOW",
EVTYPE %in% c("DENSE FOG", "FOG", "GLAZE") ~ "DENSE FOG",
EVTYPE %in% c("RIP CURRENT", "RIP CURRENTS") ~ "RIP CURRENT",
EVTYPE %in% c("STORM SURGE/TIDE", "STORM SURGE", "HIGH SEAS") ~ "STORM SURGE/TIDE",
EVTYPE %in% c("HAIL", "FREEZING RAIN", "SMALL HAIL", "HAIL 75", "HAIL 0.75",
"HAIL 100", "HAIL 175", "SNOW FREEZING RAIN",
"NON SEVERE HAIL") ~ "HAIL",
EVTYPE %in% c("HIGH SURF", "HEAVY SURF/HIGH SURF", "HEAVY SURF") ~ "HIGH SURF",
EVTYPE %in% c("STRONG WIND", "STRONG WINDS", "GUSTY WINDS", "WIND DAMAGE",
"GUSTY WIND", "GRADIENT WINDS") ~ "STRONG WIND",
EVTYPE %in% c("HURRICANE (TYPHOON)", "HURRICANE", "HURRICANE/TYPHOON",
"TYPHOON", "HURRICANE OPAL",
"HURRICANE ERIN") ~ "HURRICANE (TYPHOON)",
EVTYPE %in% c("SLEET", "LIGHT SNOW", "MODERATE SNOWFALL", "WINTRY MIX",
"LIGHT FREEZING RAIN", "FREEZING DRIZZLE", "SLEET STORM",
"SNOW/SLEET", "FREEZING RAIN/SLEET") ~ "SLEET",
EVTYPE %in% c("HEAT", "RECORD WARMTH", "UNSEASONABLY HOT", "UNUSUAL WARMTH", "DRY",
"UNSEASONABLY WARM", "UNSEASONABLY DRY",
"UNSEASONABLY WARM AND DRY") ~ "HEAT",
EVTYPE %in% c("COASTAL FLOOD", "COASTAL FLOODING", "TIDAL FLOODING") ~ "COASTAL FLOOD",
EVTYPE %in% c("EXCESSIVE HEAT", "RECORD HEAT", "HEAT WAVE",
"EXTREME HEAT") ~ "EXCESSIVE HEAT",
# not very sure about the following one, but since in the data
# documentation only an Astronomical Low Tide entry
# is reported, I'm assuming this conversion is appropriate
EVTYPE %in% c("ASTRONOMICAL LOW TIDE",
"ASTRONOMICAL HIGH TIDE") ~ "ASTRONOMICAL LOW TIDE",
EVTYPE %in% c("FUNNEL CLOUD", "FUNNEL CLOUDS", "FUNNEL") ~ "FUNNEL CLOUD",
EVTYPE %in% c("FROST/FREEZE", "FREEZE", "ICE", "FROST", "ICY ROADS", "BLACK ICE",
"HARD FREEZE") ~ "FROST/FREEZE",
EVTYPE %in% c("COLD/WIND CHILL", "COLD", "WIND CHILL", "PROLONG COLD",
"UNSEASONABLY COOL", "UNUSUALLY COLD") ~ "COLD/WIND CHILL",
EVTYPE %in% c("WATERSPOUT", "WATERSPOUTS", "WATERSPOUT-",
"WATERSPOUT/TORNADO") ~ "WATERSPOUT",
EVTYPE %in% c("HEAVY RAIN", "HEAVY RAINS", "UNSEASONABLY WET",
"RAIN", "RECORD RAINFALL", "HEAVY RAINS/FLOODING") ~ "HEAVY RAIN",
EVTYPE %in% c("ICE STORM", "SNOW AND ICE", "SNOW/ICE STORM") ~ "ICE STORM",
EVTYPE %in% c("LAKE-EFFECT SNOW", "HEAVY LAKE SNOW",
"LAKE EFFECT SNOW") ~ "LAKE-EFFECT SNOW",
EVTYPE %in% c("TORNADO", "TORNADO F0") ~ "TORNADO",
EVTYPE %in% c("DROUGHT", "DROUGHT/EXCESSIVE HEAT", "SNOW DROUGHT") ~ "DROUGHT",
EVTYPE %in% c("DENSE SMOKE", "SMOKE") ~ "DENSE SMOKE",
EVTYPE %in% c("LIGHTNING", "THUNDERSTORM WINDS LIGHTNING") ~ "LIGHTNING",
EVTYPE == "BLIZZARD" ~ "BLIZZARD",
EVTYPE == "DEBRIS FLOW" ~ "DEBRIS FLOW",
EVTYPE == "DUST DEVIL" ~ "DUST DEVIL",
EVTYPE == "DUST STORM" ~ "DUST STORM",
EVTYPE == "FREEZING FOG" ~ "FREEZING FOG",
EVTYPE == "LAKESHORE FLOOD" ~ "LAKESHORE FLOOD",
EVTYPE == "MARINE HAIL" ~ "MARINE HAIL",
EVTYPE == "MARINE HIGH WIND" ~ "MARINE HIGH WIND",
EVTYPE == "MARINE STRONG WIND" ~ "MARINE STRONG WIND",
EVTYPE == "SEICHE" ~ "SEICHE",
EVTYPE == "TROPICAL DEPRESSION" ~ "TROPICAL DEPRESSION",
EVTYPE == "TROPICAL STORM" ~ "TROPICAL STORM",
EVTYPE == "TSUNAMI" ~ "TSUNAMI",
EVTYPE == "VOLCANIC ASH" ~ "VOLCANIC ASH",
EVTYPE == "WINTER STORM" ~ "WINTER STORM",
TRUE ~ "OTHER"))
Although this could result in an essentially flawed analysis, it should be enough to get an idea of the data (and to show how to achieve the main aims of the analysis).
The clean dataset now looks like this, and is ready to help us with our questions.
head(top_df)
## # A tibble: 6 x 12
## EVTYPE F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO REFNUM event
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 TORNADO 3 0 0 15 25 K 0 <NA> <NA> 1 TORNADO
## 2 TORNADO 2 0 0 0 2.5 K 0 <NA> <NA> 2 TORNADO
## 3 TORNADO 2 0 0 2 25 K 0 <NA> <NA> 3 TORNADO
## 4 TORNADO 2 0 0 2 2.5 K 0 <NA> <NA> 4 TORNADO
## 5 TORNADO 2 0 0 2 2.5 K 0 <NA> <NA> 5 TORNADO
## 6 TORNADO 2 0 0 6 2.5 K 0 <NA> <NA> 6 TORNADO
In order to find the answer to this question, we want to consider both the FATALITIES and INJURIES variables.
top_fatal <- top_df %>%
group_by(event) %>%
summarise(tot_fatal = sum(FATALITIES)) %>%
arrange(desc(tot_fatal)) %>%
top_n(5, tot_fatal)
top_fatal
## # A tibble: 5 x 2
## event tot_fatal
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 2173
## 3 FLASH FLOOD 1018
## 4 HEAT 977
## 5 LIGHTNING 816
top_injur <- top_df %>%
group_by(event) %>%
summarise(tot_injur = sum(INJURIES)) %>%
arrange(desc(tot_injur)) %>%
top_n(5, tot_injur)
top_injur
## # A tibble: 5 x 2
## event tot_injur
## <chr> <dbl>
## 1 TORNADO 91346
## 2 THUNDERSTORM WIND 9480
## 3 EXCESSIVE HEAT 7039
## 4 FLOOD 6887
## 5 LIGHTNING 5230
fatal_plot <- top_fatal %>%
ggplot(aes(x = reorder(event, tot_fatal), y = tot_fatal, fill = event)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
guides(fill = FALSE) +
theme_bw() +
labs(x = "Event", y = "Fatalities", title = "Top 5 events by fatalities",
subtitle = "Tornadoes account for the highest number of fatalities.")
injur_plot <- top_injur %>%
ggplot(aes(x = reorder(event, tot_injur), y = tot_injur, fill = event)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Paired") +
guides(fill = FALSE) +
theme_bw() +
labs(x = "Event", y = "Injuries", title = "Top 5 events by injuries",
subtitle = "Tornadoes account for the highest number of injuries.")
grid.arrange(fatal_plot, injur_plot, nrow = 2)
From this analysis, we can see that the most harmful event in terms of population health are tornadoes, with 5633 fatalities and 91346 injuries, and this is true for both fatalities and injuries.
These plots also highlight the extreme dangerousness of tornadoes, which definitely outnumber all the other event types, and this is particularly true for the number of injuries.
In order to answer this question, we have to look at the PROPDMG and CROPDMG variables, but we shall notice that both features have some kind of twin-feature, respectively PROPDMGEXP and CROPDMGEXP, which specify the magnitude of the given value. From the data documentation:
[Damage] Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
Let’s check that everything’s alright with the PROPDMGEXP and CROPDMGEXP variables.
top_df %>%
count(PROPDMGEXP)
## # A tibble: 18 x 2
## PROPDMGEXP n
## <chr> <int>
## 1 - 1
## 2 ? 8
## 3 + 3
## 4 0 215
## 5 1 25
## 6 2 13
## 7 3 4
## 8 4 4
## 9 5 27
## 10 6 4
## 11 7 5
## 12 8 1
## 13 B 37
## 14 H 6
## 15 K 424390
## 16 m 7
## 17 M 11269
## 18 <NA> 464889
top_df %>%
count(CROPDMGEXP)
## # A tibble: 9 x 2
## CROPDMGEXP n
## <chr> <int>
## 1 ? 6
## 2 0 18
## 3 2 1
## 4 B 9
## 5 k 21
## 6 K 281783
## 7 m 1
## 8 M 1963
## 9 <NA> 617106
I have honestly no idea what other values like H, + or numbers might mean, so I will just ignore these and pretend they all represent thousands (i.e. K), since this is the lowest unit allowed.
We should first convert each couple of features into specific prop_dmg and crop_dmg variables with the proper values; after that, since the question does not focus on one particular type of damage, my guess is that both these values equally account for economic consequences, so I will sum them to obtain a total_dmg column.
econ_df <- top_df %>%
mutate(prop_dmg = case_when(toupper(PROPDMGEXP) == "M" ~ PROPDMG * 1000000,
toupper(PROPDMGEXP) == "B" ~ PROPDMG * 1000000000,
TRUE ~ PROPDMG * 1000),
crop_dmg = case_when(toupper(CROPDMGEXP) == "M" ~ CROPDMG * 1000000,
toupper(CROPDMGEXP) == "B" ~ CROPDMG * 1000000000,
TRUE ~ CROPDMG * 1000),
total_dmg = prop_dmg + crop_dmg) %>%
select(event, prop_dmg, crop_dmg, total_dmg)
econ_df
## # A tibble: 900,908 x 4
## event prop_dmg crop_dmg total_dmg
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 25000 0 25000
## 2 TORNADO 2500 0 2500
## 3 TORNADO 25000 0 25000
## 4 TORNADO 2500 0 2500
## 5 TORNADO 2500 0 2500
## 6 TORNADO 2500 0 2500
## 7 TORNADO 2500 0 2500
## 8 TORNADO 2500 0 2500
## 9 TORNADO 25000 0 25000
## 10 TORNADO 25000 0 25000
## # ... with 900,898 more rows
Now we can finally look for the most economy-damaging event.
top_damage <- econ_df %>%
group_by(event) %>%
summarise(event_dmg = sum(total_dmg)) %>%
arrange(desc(event_dmg)) %>%
top_n(10, event_dmg)
top_damage
## # A tibble: 10 x 2
## event event_dmg
## <chr> <dbl>
## 1 FLOOD 160956469850
## 2 HURRICANE (TYPHOON) 90710952810
## 3 TORNADO 57352655690
## 4 STORM SURGE/TIDE 47965594500
## 5 HAIL 18789023370
## 6 FLASH FLOOD 18169283860
## 7 DROUGHT 15018677780
## 8 THUNDERSTORM WIND 12217487730
## 9 ICE STORM 8967191360
## 10 TROPICAL STORM 8382236550
top_damage %>%
ggplot(aes(x = reorder(event, event_dmg), y = event_dmg, fill = event)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set3") +
scale_y_continuous(labels = c("0", "50B", "100B", "150B")) +
guides(fill = FALSE) +
theme_bw() +
labs(x = "Event", y = "Estimated damage ($)", title = "Top 10 events by economic damage",
subtitle = "Floods are responsible for the greatest economic consequences.")
From our analysis it looks like floods determined the greatest economic consequences, causing damages to properties and crops for about 160 956 469 850 $, followed by hurricanes and tornadoes.
If we’re curious about separating property damage from crop damage, we can easily split our analysis.
top_split_damage <- econ_df %>%
group_by(event) %>%
summarise(tot_prop = sum(prop_dmg),
tot_crop = sum(crop_dmg))
prop_plot <- top_split_damage %>%
arrange(desc(tot_prop)) %>%
top_n(5, tot_prop) %>%
ggplot(aes(x = reorder(event, tot_prop), y = tot_prop, fill = event)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = c("0", "50B", "100B", "150B")) +
guides(fill = FALSE) +
theme_bw() +
labs(x = "Event", y = "Estimated damage ($)", title = "Top 5 events by property damage",
subtitle = "Floods account for the highest amount of damage caused to properties.")
crop_plot <- top_split_damage %>%
arrange(desc(tot_crop)) %>%
top_n(5, tot_crop) %>%
ggplot(aes(x = reorder(event, tot_crop), y = tot_crop, fill = event)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Paired") +
scale_y_continuous(labels = c("0", "5B", "10B", "15B")) +
guides(fill = FALSE) +
theme_bw() +
labs(x = "Event", y = "Estimated damage ($)", title = "Top 5 events by crop damage",
subtitle = "Droughts account for the highest amount of damage caused to crops.")
grid.arrange(prop_plot, crop_plot, nrow = 2)
Interestingly, we found that droughts are causing most damage to crops, although for amounts ten times lower than those caused to properties by floods.
This data analysis was performed as an assignment of the Reproducible Research course, which is part of the Data Science Specialization by Coursera. It is meant to be a toy analysis useful to practice key concept of reproducible data analysis.
No conclusions should be drawn from what is reported here.
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 gridExtra_2.3 magrittr_1.5 skimr_1.0.3 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6
## [8] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.4 haven_1.1.2 lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0
## [7] utf8_1.1.4 rlang_0.2.2 pillar_1.3.0 glue_1.3.0 withr_2.1.2 RColorBrewer_1.1-2
## [13] modelr_0.1.2 readxl_1.1.0 bindr_0.1.1 plyr_1.8.4 munsell_0.5.0 gtable_0.2.0
## [19] cellranger_1.1.0 rvest_0.3.2 evaluate_0.11 labeling_0.3 knitr_1.20 fansi_0.3.0
## [25] broom_0.5.0 Rcpp_0.12.18 scales_1.0.0 backports_1.1.2 jsonlite_1.5 hms_0.4.2
## [31] digest_0.6.17 stringi_1.2.4 grid_3.5.1 rprojroot_1.3-2 cli_1.0.0 tools_3.5.1
## [37] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2 xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.0
## [43] rmarkdown_1.10 httr_1.3.1 rstudioapi_0.7 R6_2.2.2 nlme_3.1-137 compiler_3.5.1