The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. There are two main questions we address in the analysis. First of all, we explore the type of events that are the most harmful with respect to population health, across the United States. Then, we mesure the economic consequences of these events.
Before doing any analysis, we start by checking our workspace and loading the necessary packages for the analysis.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.24 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
## [5] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3 tibble_2.1.3
## [9] ggplot2_3.2.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.0
## [5] tools_3.6.0 zeallot_0.1.0 digest_0.6.20 lubridate_1.7.4
## [9] jsonlite_1.6 evaluate_0.14 nlme_3.1-139 gtable_0.3.0
## [13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0
## [17] rstudioapi_0.10 yaml_2.2.0 haven_2.1.1 xfun_0.9
## [21] withr_2.1.2 xml2_1.2.2 httr_1.4.1 vctrs_0.2.0
## [25] hms_0.5.1 generics_0.0.2 grid_3.6.0 tidyselect_0.2.5
## [29] glue_1.3.1 R6_2.4.0 readxl_1.3.1 rmarkdown_1.15.1
## [33] modelr_0.1.5 magrittr_1.5 backports_1.1.4 scales_1.0.0
## [37] htmltools_0.3.6 rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1
## [41] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2
## [45] crayon_1.3.4
ls(); rm(list = ls()) # see if there's any object in the workspace and delete it if any.
## character(0)
library(tidyverse) # for data wrangling and analysis
library(rmarkdown) # create markdown file
library(knitr) # to compile the R markdown document and convert it to HTML
We may now download the raw data.
if(!file.exists("data")) {
dir.create("data")
}
fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "data//storm.csv.bz2"
download.file(fileurl, destfile)
dateDownloaded <- date()
Before reading the data into our software, we perform a rough estimate of the memory it will take. We use the following formula : (ncolnrow8) / 2^20 Since our data has 37 variables and 902,297 observations, we can compute the estimatte memory required.
(37*902297*8) / 2^20
## [1] 254.7073
It takes around 255Mb of memory and we have enough memory to read it.
Now that the data has been downloaded and we assured we have enough memory, we can load it.
storm <- read.csv("data//storm.csv.bz2") # read the data
str(storm); head(storm) # display some information about the dataset
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
One of the first things we can do with our dataframe is check if there’s any missing values. Depending on where there are located in the dataframe, we may keep them or simply delete them.
We compute the total missing values in our dataset.
missing_values <- sum(is.na(storm))
missing_values
## [1] 1745947
Once we have the total number of missing values, we may want to print the variables containing them.
for (Var in names(storm)) {
missing <- sum(is.na(storm[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
## [1] "COUNTYENDN" "902297"
## [1] "F" "843563"
## [1] "LATITUDE" "47"
## [1] "LATITUDE_E" "40"
All the missing values are located in 4 of the 37 variables. I have first tried to simply delete all the rows with missing values in the dataset but it returned an empty dataframe, which means there is not a single row without missing values.
This is confirmed by the following code which returns an empty vector.
storm_clean <- storm %>% drop_na()
storm_clean
## [1] STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## [7] STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## [13] END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## [19] LENGTH WIDTH F MAG FATALITIES INJURIES
## [25] PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## [31] ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## [37] REFNUM
## <0 rows> (or 0-length row.names)
The 4 variables containing the missing values are : COUNTYENDN, LATITUDE_E, LATITUDE and F.
The COUNTYENDN variable has only missing values and represents 52% of all missing values. This is confirmed by the following.
length(storm$COUNTYENDN) == sum(is.na(storm$COUNTYENDN)) # returns "TRUE"
## [1] TRUE
sum(is.na(storm$COUNTYENDN)) / missing_values # ratio of COUNTYENDN missing values
## [1] 0.5167952
Therefore, we can simply delete it.
storm$COUNTYENDN <- NULL
The F variable describes the strength of the tornado based on the amount and type of damage caused by the tornado. It contains 48% of all missing values.
sum(is.na(storm$F)) / missing_values # ratio of "F" missing values
## [1] 0.483155
The F variable provides information about only the tornado variable and the information provided is not useful for our anaylis. We will then also delete it.
storm$F <- NULL
dim(storm)
## [1] 902297 35
We still have 2 variables with missing values, LATITUDE and LATITUTE_E. Those two variables have nearly the same values. We can keep one the them and delete the other. We decide to keep the LATITUDE variable.
While exploring the LATITUDE variable, we note that LONGITUDE and LONGITUDE_ variables also have nearly the same values. We drop one of them and keep the LONGITUDE variable.
table(near(storm$LATITUDE, storm$LATITUDE_E))
##
## FALSE TRUE
## 401118 501132
table(near(storm$LONGITUDE, storm$LONGITUDE_))
##
## FALSE TRUE
## 409854 492443
storm$LATITUDE_E <- NULL
storm$LONGITUDE_ <- NULL
4 variables have been deleted. We now have 33 variables, with only one of them containing missing values
We check the amount of missing malues remaining in our dataframe.
missing <- sum(is.na(storm))
missing
## [1] 47
We have 47 missing values, and want to print the variables with missing values.
for (Var in names(storm)) {
missing <- sum(is.na(storm[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
## [1] "LATITUDE" "47"
We can compute the ratio of missing values in our dataframe.
missing / nrow(storm)
## [1] 0
47/902297
## [1] 5.208928e-05
Since the value is closed to zero, we don’t need to delete it.
After processing the data, we can start analysing it and answering the different questions.
QUESTION 1 : ACROSS THE UNITED STATES, WHICH TYPES OF EVENTS (AS INDICATED IN THE EVTYPE VARIABLE) ARE MOST HARMFUL WITH RESPECT TO POPULATION HEALTH?
The data has now 33 variables, but we want to select only a subset of the variables we’ll need for the analysis. We had to look up on the web to find out what the variables in the dataset meant. We found a codebook describing each variable and decided to download it and include it in the analysis’ files.
codebook <- "https://adata.site.wesleyan.edu/files/2017/08/Storm_Event_Data_Codebook.pdf"
download.file(codebook, destfile = "data//codebook.pdf")
downloaded_pdf <- date()
After reading the codebook, We decided that the variables useful for answering the first question are EVTYPE, INJURIES and FATALITIES. We group the dataframe by the types of events, then calculate the total of injuries and fatalities associated with each event. After that, We create a new variable which will sum the total of fatalities and injuries. Since we are only interested by the most harmful events, we keep the values greater than zero. Finally we order the data by the most harmful events.
harmful <-storm %>%
group_by(EVTYPE) %>%
summarise(injuries = sum(INJURIES), fatalities = sum(FATALITIES), total_harm = sum(injuries, fatalities)) %>%
filter(injuries > 0, fatalities > 0) %>%
arrange(desc(total_harm))
str(harmful); head(harmful)
## Classes 'tbl_df', 'tbl' and 'data.frame': 106 obs. of 4 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 130 856 170 464 275 153 427 760 972 ...
## $ injuries : num 91346 6525 6957 6789 5230 ...
## $ fatalities: num 5633 1903 504 470 816 ...
## $ total_harm: num 96979 8428 7461 7259 6046 ...
## # A tibble: 6 x 4
## EVTYPE injuries fatalities total_harm
## <fct> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 96979
## 2 EXCESSIVE HEAT 6525 1903 8428
## 3 TSTM WIND 6957 504 7461
## 4 FLOOD 6789 470 7259
## 5 LIGHTNING 5230 816 6046
## 6 HEAT 2100 937 3037
To have a better sense of the results we just got, we decide to visualise it with a bar graph. We use ggplot2 package loaded with tidyverse at the beginning of the analysis.
There are 106 events in the dataframe, and We’ll only plot a small fraction of them, ie the top 6.
most_harmful <- head(harmful)
most_harmful
## # A tibble: 6 x 4
## EVTYPE injuries fatalities total_harm
## <fct> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 96979
## 2 EXCESSIVE HEAT 6525 1903 8428
## 3 TSTM WIND 6957 504 7461
## 4 FLOOD 6789 470 7259
## 5 LIGHTNING 5230 816 6046
## 6 HEAT 2100 937 3037
We plot the figure.
ggplot(most_harmful, aes(EVTYPE, total_harm)) + geom_col(aes(fill = EVTYPE)) +
xlab("Event Type") +
ylab("Injuries and Fatalities") +
ggtitle("Most Harmful Events")
As the plot shows, tornado is by far the most harmful event to population health. This information can also be obtained by computing the percentage of the total harm due to tornado
torn_harm_percent <- harmful[1,4] / sum(harmful$total_harm)
torn_harm_percent
## total_harm
## 1 0.6252071
QUESTION 2 : ACROSS THE UNITED STATES, WHICH TYPES OF EVENTS HAVE THE GREATEST ECONOMIC CONSEQUENCES?
We create a subset of the dataframe with the variables relative to economic impact. We will refer to the codebook in order to do that.
We start by printing the names of the dataframe.
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "END_RANGE"
## [16] "END_AZI" "END_LOCATI" "LENGTH" "WIDTH" "MAG"
## [21] "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP" "CROPDMG"
## [26] "CROPDMGEXP" "WFO" "STATEOFFIC" "ZONENAMES" "LATITUDE"
## [31] "LONGITUDE" "REMARKS" "REFNUM"
There are 4 variables relative to economic impact : PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. We will create a new dataframe with these four variables plus the EVTYPE variable. Then we’ll filter it by damages greater than 0.
economic <- storm %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
filter(PROPDMG > 0 | CROPDMG > 0)
str(economic); summary(economic)
## 'data.frame': 245031 obs. of 5 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## EVTYPE PROPDMG PROPDMGEXP
## TSTM WIND :61475 Min. : 0.00 K :229057
## THUNDERSTORM WIND :43465 1st Qu.: 2.00 M : 11319
## TORNADO :39361 Median : 8.00 : 4357
## HAIL :25969 Mean : 44.42 0 : 209
## FLASH FLOOD :20659 3rd Qu.: 25.00 B : 40
## THUNDERSTORM WINDS:12005 Max. :5000.00 5 : 18
## (Other) :42097 (Other): 31
## CROPDMG CROPDMGEXP
## Min. : 0.000 :145037
## 1st Qu.: 0.000 K : 97960
## Median : 0.000 M : 1982
## Mean : 5.623 k : 21
## 3rd Qu.: 0.000 0 : 17
## Max. :990.000 B : 7
## (Other): 7
We will create two dataframes with the two economic consequences. First, property damages, second crop damages.
In order to comprehend the variables in our newly created dataframe, we refer to the documentation of the database. We download the documentation and include it in the files analysis
storm_doc <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(storm_doc, destfile = "data//documentation.pdf")
downloaded_doc <- date()
In the Storm Data Documentation (page 12), there’s found some useful information about the notation associated with the damage variables (property and crop).
From the documentation : “Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions."
Therefore, we will filter the rows in the DMGEXP variables according to this information.
We start by the analysis of property damages
property <- economic %>%
select(EVTYPE, PROPDMG, PROPDMGEXP) %>%
filter(PROPDMGEXP %in% c ("K", "k", "M", "m", "B", "b")) %>%
filter(PROPDMG > 0)
str(property); summary(property)
## 'data.frame': 238847 obs. of 3 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## EVTYPE PROPDMG PROPDMGEXP
## TSTM WIND :60486 Min. : 0.01 K :227481
## THUNDERSTORM WIND :43265 1st Qu.: 2.50 M : 11319
## TORNADO :39024 Median : 10.00 B : 40
## HAIL :23027 Mean : 45.54 m : 7
## FLASH FLOOD :20535 3rd Qu.: 25.00 : 0
## THUNDERSTORM WINDS:11692 Max. :5000.00 - : 0
## (Other) :40818 (Other): 0
After filtering the dataframe, we reshape the values in the PROPDMG variable taking in account the notation in the PROPDMGEXP variable.
property <- property %>%
mutate(propdmg = ifelse(PROPDMGEXP == "B" | PROPDMGEXP == "b", PROPDMG * 1000000000, PROPDMG)) %>%
mutate(propdmg1 = ifelse(PROPDMGEXP == "M" | PROPDMGEXP == "m", propdmg * 1000000, propdmg)) %>%
mutate(propdmg2 = ifelse(PROPDMGEXP == "K" | PROPDMGEXP == "k", propdmg1 * 1000, propdmg1)) %>%
rename(PROPDMG = propdmg2)
We drop the redundant columns.
property$PROPDMG <- NULL
property$propdmg <- NULL
property$propdmg1 <- NULL
Now we group the data by the types of events and compute the sum of damages associated with each event.
property_damage <- property %>%
group_by(EVTYPE) %>%
summarise(damages = sum(PROPDMG)) %>%
arrange(desc(damages))
head(property_damage)
## # A tibble: 6 x 2
## EVTYPE damages
## <fct> <dbl>
## 1 FLOOD 144657709800
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56937160480
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140811510
## 6 HAIL 15732266720
We substract the 6 events with the greatest impact and plot the properties wich suffered the greatest economic consequences. We store the plot in a PNG file.
sub_property <- head(property_damage)
ggplot(sub_property, aes(EVTYPE, damages)) + geom_col(aes(fill = EVTYPE)) +
xlab("Event Type") +
ylab("Property Damages") +
ggtitle("Property Damages")
The graphic shows that flood is event having the biggest impact on properties, followed by hurricanes/typhoons and tornadoes.
After analysing the property damages, we can now do the same thing for the crop damages. I’ll follow the same steps that for the property analysis.
crop <- economic %>%
select(EVTYPE, CROPDMG, CROPDMGEXP) %>%
filter(CROPDMGEXP %in% c ("K", "k", "M", "m", "B", "b")) %>%
filter(CROPDMG > 0)
str(crop); summary(crop)
## 'data.frame': 22084 obs. of 3 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 410 786 406 409 409 786 786 834 834 812 ...
## $ CROPDMG : num 10 500 1 4 10 50 50 5 50 15 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 9 7 9 9 8 7 7 7 7 7 ...
## EVTYPE CROPDMG CROPDMGEXP
## HAIL :9373 Min. : 0.01 K :20137
## TSTM WIND :3447 1st Qu.: 5.00 M : 1918
## FLASH FLOOD :2112 Median : 10.00 k : 21
## FLOOD :1759 Mean : 62.38 B : 7
## TORNADO :1493 3rd Qu.: 50.00 m : 1
## THUNDERSTORM WIND: 946 Max. :990.00 : 0
## (Other) :2954 (Other): 0
After filtering the dataframe, we reshape the values in the CROPDMG variable taking in account the notation in the CROPDMGEXP variable.
crop <- crop %>%
mutate(cropdmg = ifelse(CROPDMGEXP == "B" | CROPDMGEXP == "b", CROPDMG * 1000000000, CROPDMG)) %>%
mutate(cropdmg1 = ifelse(CROPDMGEXP == "M" | CROPDMGEXP == "m", cropdmg * 1000000, cropdmg)) %>%
mutate(cropdmg2 = ifelse(CROPDMGEXP == "K" | CROPDMGEXP == "k", cropdmg1 * 1000, cropdmg1)) %>%
rename(CROPDMG = cropdmg2)
We drop the redundant columns.
crop$CROPDMG <- NULL
crop$cropdmg <- NULL
crop$cropdmg1 <- NULL
We group the data by the types of events and compute the sum of damages associated with each event.
cropdamage <- crop %>%
group_by(EVTYPE) %>%
summarise(damage = sum(CROPDMG)) %>%
arrange(desc(damage))
head(cropdamage)
## # A tibble: 6 x 2
## EVTYPE damage
## <fct> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954450
## 6 HURRICANE 2741910000
We substract the 6 events with the greatest impact and plot the crops wich suffered the greatest economic consequences.
sub2 <- head(cropdamage)
ggplot(sub2, aes(EVTYPE, damage)) + geom_col(aes(fill = EVTYPE)) +
xlab("Event Type") +
ylab("Crop Damages") +
ggtitle("Crop Damages")
Drought is by the far the most impactful event on the crops as it’s showed in the plot.