The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. There are two main questions we address in the analysis. First of all, we explore the type of events that are the most harmful with respect to population health, across the United States. Then, we mesure the economic consequences of these events.

DATA PROCESSING

Before doing any analysis, we start by checking our workspace and loading the necessary packages for the analysis.

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
## [5] LC_TIME=French_France.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.24      forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
##  [5] purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.3   
##  [9] ggplot2_3.2.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.6.0  
##  [5] tools_3.6.0      zeallot_0.1.0    digest_0.6.20    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.14    nlme_3.1-139     gtable_0.3.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.4.0      cli_1.1.0       
## [17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.1      xfun_0.9        
## [21] withr_2.1.2      xml2_1.2.2       httr_1.4.1       vctrs_0.2.0     
## [25] hms_0.5.1        generics_0.0.2   grid_3.6.0       tidyselect_0.2.5
## [29] glue_1.3.1       R6_2.4.0         readxl_1.3.1     rmarkdown_1.15.1
## [33] modelr_0.1.5     magrittr_1.5     backports_1.1.4  scales_1.0.0    
## [37] htmltools_0.3.6  rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1
## [41] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2     
## [45] crayon_1.3.4
ls(); rm(list = ls()) # see if there's any object in the workspace and delete it if any.
## character(0)
library(tidyverse) # for data wrangling and analysis
library(rmarkdown) # create markdown file
library(knitr) # to compile the R markdown document and convert it to HTML

We may now download the raw data.

if(!file.exists("data")) {
  dir.create("data")
}

fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "data//storm.csv.bz2"
download.file(fileurl, destfile)
dateDownloaded <- date()

Before reading the data into our software, we perform a rough estimate of the memory it will take. We use the following formula : (ncolnrow8) / 2^20 Since our data has 37 variables and 902,297 observations, we can compute the estimatte memory required.

(37*902297*8) / 2^20
## [1] 254.7073

It takes around 255Mb of memory and we have enough memory to read it.

Now that the data has been downloaded and we assured we have enough memory, we can load it.

storm <- read.csv("data//storm.csv.bz2") # read the data
str(storm); head(storm) # display some information about the dataset
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

One of the first things we can do with our dataframe is check if there’s any missing values. Depending on where there are located in the dataframe, we may keep them or simply delete them.

We compute the total missing values in our dataset.

missing_values <- sum(is.na(storm))
missing_values
## [1] 1745947

Once we have the total number of missing values, we may want to print the variables containing them.

for (Var in names(storm)) {
    missing <- sum(is.na(storm[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}
## [1] "COUNTYENDN" "902297"    
## [1] "F"      "843563"
## [1] "LATITUDE" "47"      
## [1] "LATITUDE_E" "40"

All the missing values are located in 4 of the 37 variables. I have first tried to simply delete all the rows with missing values in the dataset but it returned an empty dataframe, which means there is not a single row without missing values.

This is confirmed by the following code which returns an empty vector.

storm_clean <- storm  %>% drop_na()
storm_clean
##  [1] STATE__    BGN_DATE   BGN_TIME   TIME_ZONE  COUNTY     COUNTYNAME
##  [7] STATE      EVTYPE     BGN_RANGE  BGN_AZI    BGN_LOCATI END_DATE  
## [13] END_TIME   COUNTY_END COUNTYENDN END_RANGE  END_AZI    END_LOCATI
## [19] LENGTH     WIDTH      F          MAG        FATALITIES INJURIES  
## [25] PROPDMG    PROPDMGEXP CROPDMG    CROPDMGEXP WFO        STATEOFFIC
## [31] ZONENAMES  LATITUDE   LONGITUDE  LATITUDE_E LONGITUDE_ REMARKS   
## [37] REFNUM    
## <0 rows> (or 0-length row.names)

The 4 variables containing the missing values are : COUNTYENDN, LATITUDE_E, LATITUDE and F.

The COUNTYENDN variable has only missing values and represents 52% of all missing values. This is confirmed by the following.

length(storm$COUNTYENDN) == sum(is.na(storm$COUNTYENDN)) # returns "TRUE"
## [1] TRUE
sum(is.na(storm$COUNTYENDN)) / missing_values # ratio of COUNTYENDN missing values
## [1] 0.5167952

Therefore, we can simply delete it.

storm$COUNTYENDN <- NULL

The F variable describes the strength of the tornado based on the amount and type of damage caused by the tornado. It contains 48% of all missing values.

sum(is.na(storm$F)) / missing_values # ratio of "F" missing values
## [1] 0.483155

The F variable provides information about only the tornado variable and the information provided is not useful for our anaylis. We will then also delete it.

storm$F <- NULL
dim(storm)
## [1] 902297     35

We still have 2 variables with missing values, LATITUDE and LATITUTE_E. Those two variables have nearly the same values. We can keep one the them and delete the other. We decide to keep the LATITUDE variable.

While exploring the LATITUDE variable, we note that LONGITUDE and LONGITUDE_ variables also have nearly the same values. We drop one of them and keep the LONGITUDE variable.

table(near(storm$LATITUDE, storm$LATITUDE_E))
## 
##  FALSE   TRUE 
## 401118 501132
table(near(storm$LONGITUDE, storm$LONGITUDE_))
## 
##  FALSE   TRUE 
## 409854 492443
storm$LATITUDE_E <- NULL
storm$LONGITUDE_ <- NULL

4 variables have been deleted. We now have 33 variables, with only one of them containing missing values

We check the amount of missing malues remaining in our dataframe.

missing <- sum(is.na(storm))
missing
## [1] 47

We have 47 missing values, and want to print the variables with missing values.

for (Var in names(storm)) {
  missing <- sum(is.na(storm[,Var]))
  if (missing > 0) {
    print(c(Var,missing))
  }
}
## [1] "LATITUDE" "47"

We can compute the ratio of missing values in our dataframe.

missing / nrow(storm) 
## [1] 0
47/902297
## [1] 5.208928e-05

Since the value is closed to zero, we don’t need to delete it.

RESULTS

After processing the data, we can start analysing it and answering the different questions.

QUESTION 1 : ACROSS THE UNITED STATES, WHICH TYPES OF EVENTS (AS INDICATED IN THE EVTYPE VARIABLE) ARE MOST HARMFUL WITH RESPECT TO POPULATION HEALTH?

The data has now 33 variables, but we want to select only a subset of the variables we’ll need for the analysis. We had to look up on the web to find out what the variables in the dataset meant. We found a codebook describing each variable and decided to download it and include it in the analysis’ files.

codebook <- "https://adata.site.wesleyan.edu/files/2017/08/Storm_Event_Data_Codebook.pdf"
download.file(codebook, destfile = "data//codebook.pdf")
downloaded_pdf <- date()

After reading the codebook, We decided that the variables useful for answering the first question are EVTYPE, INJURIES and FATALITIES. We group the dataframe by the types of events, then calculate the total of injuries and fatalities associated with each event. After that, We create a new variable which will sum the total of fatalities and injuries. Since we are only interested by the most harmful events, we keep the values greater than zero. Finally we order the data by the most harmful events.

harmful <-storm %>%
  group_by(EVTYPE) %>%
  summarise(injuries = sum(INJURIES), fatalities = sum(FATALITIES), total_harm = sum(injuries, fatalities)) %>%
  filter(injuries > 0, fatalities > 0) %>%
  arrange(desc(total_harm)) 

str(harmful); head(harmful)
## Classes 'tbl_df', 'tbl' and 'data.frame':    106 obs. of  4 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 130 856 170 464 275 153 427 760 972 ...
##  $ injuries  : num  91346 6525 6957 6789 5230 ...
##  $ fatalities: num  5633 1903 504 470 816 ...
##  $ total_harm: num  96979 8428 7461 7259 6046 ...
## # A tibble: 6 x 4
##   EVTYPE         injuries fatalities total_harm
##   <fct>             <dbl>      <dbl>      <dbl>
## 1 TORNADO           91346       5633      96979
## 2 EXCESSIVE HEAT     6525       1903       8428
## 3 TSTM WIND          6957        504       7461
## 4 FLOOD              6789        470       7259
## 5 LIGHTNING          5230        816       6046
## 6 HEAT               2100        937       3037

To have a better sense of the results we just got, we decide to visualise it with a bar graph. We use ggplot2 package loaded with tidyverse at the beginning of the analysis.

There are 106 events in the dataframe, and We’ll only plot a small fraction of them, ie the top 6.

most_harmful <- head(harmful)
most_harmful
## # A tibble: 6 x 4
##   EVTYPE         injuries fatalities total_harm
##   <fct>             <dbl>      <dbl>      <dbl>
## 1 TORNADO           91346       5633      96979
## 2 EXCESSIVE HEAT     6525       1903       8428
## 3 TSTM WIND          6957        504       7461
## 4 FLOOD              6789        470       7259
## 5 LIGHTNING          5230        816       6046
## 6 HEAT               2100        937       3037

We plot the figure.

ggplot(most_harmful, aes(EVTYPE, total_harm)) + geom_col(aes(fill = EVTYPE)) +
  xlab("Event Type") +
  ylab("Injuries and Fatalities") +
  ggtitle("Most Harmful Events")

As the plot shows, tornado is by far the most harmful event to population health. This information can also be obtained by computing the percentage of the total harm due to tornado

torn_harm_percent <- harmful[1,4] / sum(harmful$total_harm)
torn_harm_percent
##   total_harm
## 1  0.6252071

QUESTION 2 : ACROSS THE UNITED STATES, WHICH TYPES OF EVENTS HAVE THE GREATEST ECONOMIC CONSEQUENCES?

We create a subset of the dataframe with the variables relative to economic impact. We will refer to the codebook in order to do that.

We start by printing the names of the dataframe.

names(storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "END_RANGE" 
## [16] "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"      "MAG"       
## [21] "FATALITIES" "INJURIES"   "PROPDMG"    "PROPDMGEXP" "CROPDMG"   
## [26] "CROPDMGEXP" "WFO"        "STATEOFFIC" "ZONENAMES"  "LATITUDE"  
## [31] "LONGITUDE"  "REMARKS"    "REFNUM"

There are 4 variables relative to economic impact : PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. We will create a new dataframe with these four variables plus the EVTYPE variable. Then we’ll filter it by damages greater than 0.

economic <- storm %>%
  select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
  filter(PROPDMG > 0 | CROPDMG > 0)
  
str(economic); summary(economic)
## 'data.frame':    245031 obs. of  5 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##                 EVTYPE         PROPDMG          PROPDMGEXP    
##  TSTM WIND         :61475   Min.   :   0.00   K      :229057  
##  THUNDERSTORM WIND :43465   1st Qu.:   2.00   M      : 11319  
##  TORNADO           :39361   Median :   8.00          :  4357  
##  HAIL              :25969   Mean   :  44.42   0      :   209  
##  FLASH FLOOD       :20659   3rd Qu.:  25.00   B      :    40  
##  THUNDERSTORM WINDS:12005   Max.   :5000.00   5      :    18  
##  (Other)           :42097                     (Other):    31  
##     CROPDMG          CROPDMGEXP    
##  Min.   :  0.000          :145037  
##  1st Qu.:  0.000   K      : 97960  
##  Median :  0.000   M      :  1982  
##  Mean   :  5.623   k      :    21  
##  3rd Qu.:  0.000   0      :    17  
##  Max.   :990.000   B      :     7  
##                    (Other):     7

We will create two dataframes with the two economic consequences. First, property damages, second crop damages.

In order to comprehend the variables in our newly created dataframe, we refer to the documentation of the database. We download the documentation and include it in the files analysis

storm_doc <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(storm_doc, destfile = "data//documentation.pdf")
downloaded_doc <- date()

In the Storm Data Documentation (page 12), there’s found some useful information about the notation associated with the damage variables (property and crop).

From the documentation : “Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions."

Therefore, we will filter the rows in the DMGEXP variables according to this information.

PROPERTY DAMAGES

We start by the analysis of property damages

property <- economic %>%
  select(EVTYPE, PROPDMG, PROPDMGEXP) %>%
  filter(PROPDMGEXP %in% c ("K", "k", "M", "m", "B", "b")) %>%
  filter(PROPDMG > 0)

str(property); summary(property)
## 'data.frame':    238847 obs. of  3 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##                 EVTYPE         PROPDMG          PROPDMGEXP    
##  TSTM WIND         :60486   Min.   :   0.01   K      :227481  
##  THUNDERSTORM WIND :43265   1st Qu.:   2.50   M      : 11319  
##  TORNADO           :39024   Median :  10.00   B      :    40  
##  HAIL              :23027   Mean   :  45.54   m      :     7  
##  FLASH FLOOD       :20535   3rd Qu.:  25.00          :     0  
##  THUNDERSTORM WINDS:11692   Max.   :5000.00   -      :     0  
##  (Other)           :40818                     (Other):     0

After filtering the dataframe, we reshape the values in the PROPDMG variable taking in account the notation in the PROPDMGEXP variable.

property <- property %>%
  mutate(propdmg = ifelse(PROPDMGEXP == "B"  | PROPDMGEXP == "b", PROPDMG * 1000000000, PROPDMG)) %>%
  mutate(propdmg1 = ifelse(PROPDMGEXP == "M" | PROPDMGEXP == "m", propdmg * 1000000, propdmg)) %>%
  mutate(propdmg2 = ifelse(PROPDMGEXP == "K" | PROPDMGEXP == "k", propdmg1 * 1000, propdmg1)) %>%
  rename(PROPDMG = propdmg2)

We drop the redundant columns.

property$PROPDMG <- NULL
property$propdmg <- NULL
property$propdmg1 <- NULL

Now we group the data by the types of events and compute the sum of damages associated with each event.

property_damage <- property %>%
  group_by(EVTYPE) %>%
  summarise(damages = sum(PROPDMG)) %>%
  arrange(desc(damages))

head(property_damage)
## # A tibble: 6 x 2
##   EVTYPE                 damages
##   <fct>                    <dbl>
## 1 FLOOD             144657709800
## 2 HURRICANE/TYPHOON  69305840000
## 3 TORNADO            56937160480
## 4 STORM SURGE        43323536000
## 5 FLASH FLOOD        16140811510
## 6 HAIL               15732266720

We substract the 6 events with the greatest impact and plot the properties wich suffered the greatest economic consequences. We store the plot in a PNG file.

sub_property <- head(property_damage)

ggplot(sub_property, aes(EVTYPE, damages)) + geom_col(aes(fill = EVTYPE)) +
  xlab("Event Type") +
  ylab("Property Damages") +
  ggtitle("Property Damages")

The graphic shows that flood is event having the biggest impact on properties, followed by hurricanes/typhoons and tornadoes.

CROP DAMAGES

After analysing the property damages, we can now do the same thing for the crop damages. I’ll follow the same steps that for the property analysis.

crop <- economic %>%
  select(EVTYPE, CROPDMG, CROPDMGEXP) %>%
  filter(CROPDMGEXP %in% c ("K", "k", "M", "m", "B", "b")) %>%
  filter(CROPDMG > 0)

str(crop); summary(crop)
## 'data.frame':    22084 obs. of  3 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 410 786 406 409 409 786 786 834 834 812 ...
##  $ CROPDMG   : num  10 500 1 4 10 50 50 5 50 15 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 9 7 9 9 8 7 7 7 7 7 ...
##                EVTYPE        CROPDMG         CROPDMGEXP   
##  HAIL             :9373   Min.   :  0.01   K      :20137  
##  TSTM WIND        :3447   1st Qu.:  5.00   M      : 1918  
##  FLASH FLOOD      :2112   Median : 10.00   k      :   21  
##  FLOOD            :1759   Mean   : 62.38   B      :    7  
##  TORNADO          :1493   3rd Qu.: 50.00   m      :    1  
##  THUNDERSTORM WIND: 946   Max.   :990.00          :    0  
##  (Other)          :2954                    (Other):    0

After filtering the dataframe, we reshape the values in the CROPDMG variable taking in account the notation in the CROPDMGEXP variable.

crop <- crop %>%
  mutate(cropdmg = ifelse(CROPDMGEXP == "B"  | CROPDMGEXP == "b", CROPDMG * 1000000000, CROPDMG)) %>%
  mutate(cropdmg1 = ifelse(CROPDMGEXP == "M" | CROPDMGEXP == "m", cropdmg * 1000000, cropdmg)) %>%
  mutate(cropdmg2 = ifelse(CROPDMGEXP == "K" | CROPDMGEXP == "k", cropdmg1 * 1000, cropdmg1)) %>%
  rename(CROPDMG = cropdmg2)

We drop the redundant columns.

crop$CROPDMG <- NULL
crop$cropdmg <- NULL
crop$cropdmg1 <- NULL

We group the data by the types of events and compute the sum of damages associated with each event.

cropdamage <- crop %>%
  group_by(EVTYPE) %>%
  summarise(damage = sum(CROPDMG)) %>%
  arrange(desc(damage))

head(cropdamage)
## # A tibble: 6 x 2
##   EVTYPE           damage
##   <fct>             <dbl>
## 1 DROUGHT     13972566000
## 2 FLOOD        5661968450
## 3 RIVER FLOOD  5029459000
## 4 ICE STORM    5022113500
## 5 HAIL         3025954450
## 6 HURRICANE    2741910000

We substract the 6 events with the greatest impact and plot the crops wich suffered the greatest economic consequences.

sub2 <- head(cropdamage)

ggplot(sub2, aes(EVTYPE, damage)) + geom_col(aes(fill = EVTYPE)) +
  xlab("Event Type") +
  ylab("Crop Damages") +
  ggtitle("Crop Damages")

Drought is by the far the most impactful event on the crops as it’s showed in the plot.