Reproducible Research, Course Project 2: NOOA Storm Database - Analysis of Severe Weather (1950

Synopsis

The analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database contains major weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage, the main point in the study is to simply awser 2 questions; which weather event impacted the population health and which event had the most significant enonomic impact, as such, the analysis involves cleannind the dataset in order to be able to awnsert this questions

Loading and Processing Raw Data

Thr first step will be to load all the libraries we need

library(stringr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

Reading the Data from 1950 to the end of November 2011

The first step is to import the data from 1950 to 2011 in R, the data is a CSV

#Read CSV can import a zip file
Raw_Data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = "," )

After generating a variable Raw_Data, the next step should be to explore the dataset in order to understand the variables

dim(Raw_Data)

## [1] 902297     37

In order to awnser the question regarding the type of events we must first explore the EVENTYPE varible, and the fatalities and injueries

head(Raw_Data[,1:12])

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 1         0                            
## 2         0                            
## 3         0                            
## 4         0                            
## 5         0                            
## 6         0

head(Raw_Data[,13:24])

##   END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F
## 1                   0         NA         0                      14.0   100 3
## 2                   0         NA         0                       2.0   150 2
## 3                   0         NA         0                       0.1   123 2
## 4                   0         NA         0                       0.0   100 2
## 5                   0         NA         0                       0.0   150 2
## 6                   0         NA         0                       1.5   177 2
##   MAG FATALITIES INJURIES
## 1   0          0       15
## 2   0          0        0
## 3   0          0        2
## 4   0          0        2
## 5   0          0        2
## 6   0          0        6

head(Raw_Data[,25:37])

##   PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1    25.0          K       0                                         3040
## 2     2.5          K       0                                         3042
## 3    25.0          K       0                                         3340
## 4     2.5          K       0                                         3458
## 5     2.5          K       0                                         3412
## 6     2.5          K       0                                         3450
##   LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1      8812       3051       8806              1
## 2      8755          0          0              2
## 3      8742          0          0              3
## 4      8626          0          0              4
## 5      8642          0          0              5
## 6      8748          0          0              6

From the inspection above, we are clearly we are interested in the Eventype and also the number of fatalities, injuries, property damages and crop damages in order to determine the types of events, which are most hamrful with respect to population health and the type of events that have the greates economic consequences.

The first step for cleanning is to remove the additional 00:00 in the date format

Raw_Data$BGN_DATE <- str_remove_all(Raw_Data$BGN_DATE,"0:00:00") # Removes Zero's from the Date Begining 
Raw_Data$BGN_DATE <- format(as.Date(Raw_Data$BGN_DATE, format="%m/%d/%Y"),"%Y") #Converts the string in to a Date Format
Raw_Data$BGN_DATE <- as.numeric(Raw_Data$BGN_DATE)

First lest have a look at the data in particular the event types graph

hist(Raw_Data$BGN_DATE, breaks = 40, main = "Histogram Events Per Year", xlab = "Years 1950 to November 2011")

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events

We are going to have a look at the event types prior 1996 and after 1996

# Let's have a look at the unique events for the whole data set

summary <- as.data.frame(unique(Raw_Data$EVTYPE))
dim(summary)

## [1] 985   1

# We can clearly see that we have 985 unique names, the problem is repetations

# Let's look at the unique event types after 1996, and we can see how many unique 

Raw_Data96_11 <- filter(Raw_Data, BGN_DATE >= 1996)
summary_96_11 <- as.data.frame(unique(Raw_Data96_11$EVTYPE)) 

# As we can see the original data contains more than 985 different types of events, whereby we are only interested in 48 ofrficial events

Based on the analysis above, the best approach is to use the data from 1996 to 2011 for this analysis as more than 70% of the data was collected during the aformentioned period, so it is more significative to use this sample.

Processing the Dataset from 1996 to 2011

The first step will be to fix all the type of events in order to have 48 events, it is a combination of typos and adding information

Raw_Data96_11$EVTYPE <- as.character(Raw_Data96_11$EVTYPE)
List_Events <- count(Raw_Data96_11, EVTYPE) #First we count the types of events in order to get an idea of the events that are repeated the most
List_Events <- arrange(List_Events,desc(n)) 
List_Events <- mutate(List_Events, percentage = trunc((100*(n/653530))))
Top_10_Events <- List_Events[1:10,2:3]

#As we can see the top 10 events represents 83% of the sample, as such, it is logical to assume that outside those 10 events the sample is not significant enought to alter the results
print(colSums(Top_10_Events))

##          n percentage 
##     574798         83

#To do more diligence, I decided to quantify the effect of the events that only appeared 10 times

# As we can see there are 399 type of events which only occured once, which are clearly not significative in terms of frequency 
Typos <- filter(List_Events, n <= 10) 

dim(Typos)

## [1] 399   3

In order to be completely sure, we will be doing further processing later but at the moment we can assume that the besta approach is to use the events from 1996 to 2011, and it is almost certain that it is difficult that the events that ocurred less than 6000 times could impact the study, however we need to look this in depth, but first we need to fix the columns

Cleanning Columns - Removing Unncessary Variables

In order to determine which event impacted the most in United States, both in terms of harming the population and economic consequences, we simply need the following variables

1.- Date = BGN_DATE 2.- Sate = STATE 3.- Event Type = EVTYPE 4.- # of People Killed = FATALITIES 5.- # Number of people injured = INJURIES 6.- Property Damage = PROPDMG 7.- Property Damage Exponent = PROPDMGEXP 8.- Crop Damage = CROPDMG 9.- Crop Damage Exponent = CROPDMGEXP

The 2 questions that we need to asks can be fully awnser by using the data above

Property Damage & Crop Damage

The next step is to fix the exponent columns and merge the value in one column so we can check the information

##     BGN_DATE        STATE           EVTYPE            FATALITIES       
##  Min.   :1996   TX     : 51335   Length:653530      Min.   :  0.00000  
##  1st Qu.:2000   KS     : 38649   Class :character   1st Qu.:  0.00000  
##  Median :2005   OK     : 26980   Mode  :character   Median :  0.00000  
##  Mean   :2004   MO     : 25802                      Mean   :  0.01336  
##  3rd Qu.:2008   IA     : 22712                      3rd Qu.:  0.00000  
##  Max.   :2011   IL     : 21215                      Max.   :158.00000  
##                 (Other):466837                                         
##     INJURIES           PROPDMG          PROPDMGEXP        CROPDMG       
##  Min.   :0.00e+00   Min.   :   0.00   K      :369938   Min.   :  0.000  
##  1st Qu.:0.00e+00   1st Qu.:   0.00          :276185   1st Qu.:  0.000  
##  Median :0.00e+00   Median :   0.00   M      :  7374   Median :  0.000  
##  Mean   :8.87e-02   Mean   :  11.69   B      :    32   Mean   :  1.839  
##  3rd Qu.:0.00e+00   3rd Qu.:   1.00   0      :     1   3rd Qu.:  0.000  
##  Max.   :1.15e+03   Max.   :5000.00   -      :     0   Max.   :990.000  
##                                       (Other):     0                    
##    CROPDMGEXP    
##         :373069  
##  K      :278686  
##  M      :  1771  
##  B      :     4  
##  ?      :     0  
##  0      :     0  
##  (Other):     0

## # A tibble: 4 x 1
##   CROPDMGEXP
##   <fct>     
## 1 ""        
## 2 "B"       
## 3 "K"       
## 4 "M"

## # A tibble: 4 x 1
##   CROPDMGEXP
##        <dbl>
## 1          0
## 2       1000
## 3    1000000
## 4 1000000000

## # A tibble: 5 x 1
##   PROPDMGEXP
##   <fct>     
## 1 ""        
## 2 "0"       
## 3 "B"       
## 4 "K"       
## 5 "M"

## # A tibble: 5 x 1
##   PROPDMGEXP
##   <fct>     
## 1 ""        
## 2 "0"       
## 3 "B"       
## 4 "K"       
## 5 "M"

Fatalities and Injuries

In this section we are going to look at the fatalities and injuries, in order to determine if we can use only the 10 top events, logically, if we are trying to awnser the question - Which event caused more damage in terms of fatalities, we need to ensure that there is no bias in the data. We have analysed the amount of events (frequency). but that doesn’t necessarly means that those events caused the most amount of fatalities.

Fatalities <- count(Var_Raw_Data96_11,FATALITIES)

tail(Fatalities,10)

## # A tibble: 10 x 2
##    FATALITIES     n
##         <dbl> <int>
##  1         30     1
##  2         32     3
##  3         33     1
##  4         42     2
##  5         44     1
##  6         46     1
##  7         49     1
##  8         74     1
##  9         99     1
## 10        158     1

# As we can see there were only 1 instance in which the fatality was above 100

colSums(Fatalities)

## FATALITIES          n 
##        959     653530

# The total number of fatalities from 1950 to 2011 is 959, which is only 15 deaths per year

summary(Fatalities)

##    FATALITIES           n         
##  Min.   :  0.00   Min.   :     1  
##  1st Qu.:  9.00   1st Qu.:     1  
##  Median : 18.00   Median :     3  
##  Mean   : 25.92   Mean   : 17663  
##  3rd Qu.: 30.00   3rd Qu.:    13  
##  Max.   :158.00   Max.   :648572

# As such statistically speaking, we need to see what type of events are the ones that caused fatalities, in order to determine if there is 1 event outside the top 10 events that could introduce an error to our study, eventhough 99% of the cases are 0 fatalities, as we can see below

Fatalities <- mutate(Fatalities, percentage = trunc((100*(n/653530))))

Fatalities_Top <- filter(Fatalities, FATALITIES >= 10)

# The next step is to simply look for those particular fatalities and correlate to the events


Fatalities_Top_events <- select(Var_Raw_Data96_11, for(i in Fatalities_Top$FATALITIES){ FATALITIES = i})


Fatal_Events <- data.frame(FATALITIES = integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities


for(i in Fatalities_Top$FATALITIES){
    
    p <- grep(i,Var_Raw_Data96_11$FATALITIES)
    l <- data.frame(FATALITIES = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[p])
    Fatal_Events <- rbind(Fatal_Events,l)
    
    #df <- rbind(l,m)
    
}

Fatal_Events <- arrange(Fatal_Events, desc(FATALITIES))

Fatal_Events_Summary <- arrange(aggregate(FATALITIES ~ EVENT.TYPE,Fatal_Events,sum),desc(FATALITIES))

print(Fatal_Events_Summary)

##                 EVENT.TYPE FATALITIES
## 1           EXCESSIVE HEAT        798
## 2                  TORNADO        631
## 3                  TSUNAMI         32
## 4              FLASH FLOOD         31
## 5                     HEAT         26
## 6                 WILDFIRE         26
## 7                LANDSLIDE         24
## 8           TROPICAL STORM         22
## 9                    FLOOD         21
## 10              HEAVY RAIN         19
## 11       HURRICANE/TYPHOON         15
## 12           COLD AND SNOW         14
## 13               HURRICANE         13
## 14                     FOG         11
## 15        STORM SURGE/TIDE         11
## 16 EXTREME COLD/WIND CHILL         10

#The event that caused the majority of the fatalities was Excessive Heat and Tornado
# We are going to run a few commands in order to check that the events that caused the majority of the casualties are actually the events that were more frequent and therefore, we can assume that 
# any event that ocurred less than 10 times, is simply a typo as there are only 48 events which are significant

FATAL_EVENTS_LIST <- filter(List_Events,List_Events$EVTYPE == "TORNADO" | List_Events$EVTYPE == "TSUNAMI" | List_Events$EVTYPE == "EXCESSIVE HEAT" |  List_Events$EVTYPE == "FLASH FLOOD" | List_Events$EVTYPE == "HEAT")

print(FATAL_EVENTS_LIST)

## # A tibble: 5 x 3
##   EVTYPE             n percentage
##   <chr>          <int>      <dbl>
## 1 FLASH FLOOD    50999          7
## 2 TORNADO        23154          3
## 3 EXCESSIVE HEAT  1656          0
## 4 HEAT             716          0
## 5 TSUNAMI           20          0

#As we can there is not an event, which happened only one that also caused a lot fatalities

Injuries <- count(Var_Raw_Data96_11,INJURIES)

tail(Injuries,10)

## # A tibble: 10 x 2
##    INJURIES     n
##       <dbl> <int>
##  1      437     1
##  2      500     4
##  3      519     1
##  4      550     1
##  5      600     1
##  6      700     1
##  7      750     1
##  8      780     1
##  9      800     2
## 10     1150     1

# As we can see the number of fatalities is high, and there were particular events that caused quite a lot of damage

colSums(Injuries)

## INJURIES        n 
##    17150   653530

# The total number of fatalities from 1950 to 2011 is 17150

summary(Injuries)

##     INJURIES            n         
##  Min.   :   0.0   Min.   :     1  
##  1st Qu.:  33.0   1st Qu.:     1  
##  Median :  67.0   Median :     2  
##  Mean   : 128.9   Mean   :  4914  
##  3rd Qu.: 136.0   3rd Qu.:     8  
##  Max.   :1150.0   Max.   :644339

# As such statistically speaking, we need to see what type of events are the ones that caused injuries, in order to determine if there is 1 event outside the top 10 events that could introduce an error to our study, technically only 2% of the events are responsible for 17150 casualties, so it is trickky, as we need to ensure that these isolated events are included in our sample

Injuries <- mutate(Injuries, percentage = trunc((100*(n/653530))))

Injuries_Top <- filter(Injuries, INJURIES >= 1)

# The next step is to simply look for those particular fatalities and correlate to the events


Injuries_Events <- data.frame(INJURIES = integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities


for(i in Injuries_Top$INJURIES){
    
    inju <- grep(i,Var_Raw_Data96_11$INJURIES)
    d_Inju <- data.frame(INJURIES = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[inju])
    Injuries_Events <- rbind(Injuries_Events,d_Inju)
    
    #df <- rbind(l,m)
    
}


Injuries_Events_Summary <- arrange(aggregate(INJURIES ~ EVENT.TYPE,Injuries_Events,sum),desc(INJURIES))

head(Injuries_Events_Summary,10)

##           EVENT.TYPE INJURIES
## 1            TORNADO    24267
## 2     EXCESSIVE HEAT     7981
## 3              FLOOD     7616
## 4          LIGHTNING     4284
## 5          TSTM WIND     3808
## 6        FLASH FLOOD     1935
## 7       WINTER STORM     1551
## 8  HURRICANE/TYPHOON     1537
## 9               HEAT     1472
## 10 THUNDERSTORM WIND     1463

# It should not be a surprise that the events with the majority of the injuries are the same as the events the fatalities, andthis events are within the first 100 events

INJURIES_EVENTS_LIST <- filter(List_Events,List_Events$EVTYPE == "TORNADO" | List_Events$EVTYPE == "EXCESSIVE HEAT"| List_Events$EVTYPE == "LIGHTNING" |List_Events$EVTYPE == "FLOOD" | List_Events$EVTYPE == "TSTM WIND")

print(INJURIES_EVENTS_LIST)

## # A tibble: 5 x 3
##   EVTYPE              n percentage
##   <chr>           <int>      <dbl>
## 1 TSTM WIND      128662         19
## 2 FLOOD           24247          3
## 3 TORNADO         23154          3
## 4 LIGHTNING       13203          2
## 5 EXCESSIVE HEAT   1656          0

Crop Damage and Property Damage

We are going to ensure that we our top events are not excluding events that contributes to property damage and crop damage, as a data analysts, assumptions can be detrimental in a study.

Crops <- count(Var_Raw_Data96_11,CROP.DAMAGE)

tail(Crops,10)

## # A tibble: 10 x 2
##    CROP.DAMAGE     n
##          <dbl> <int>
##  1   423000000     1
##  2   450000000     1
##  3   480000000     1
##  4   492400000     1
##  5   500000000     5
##  6   515000000     1
##  7   578850000     1
##  8   596000000     1
##  9  1000000000     1
## 10  1510000000     1

# # There are a lot of single events causing crop damage, therefore, we must ensure that we identify the crop damage and ensure that we include the correct samples

colSums(Crops)

## CROP.DAMAGE           n 
## 19931083310      653530

# The amount of damage in USD is 19,931,083,310 (19 Billion)


summary(Crops)

##   CROP.DAMAGE              n         
##  Min.   :0.000e+00   Min.   :     1  
##  1st Qu.:1.498e+05   1st Qu.:     1  
##  Median :2.050e+06   Median :     1  
##  Mean   :3.970e+07   Mean   :  1302  
##  3rd Qu.:2.186e+07   3rd Qu.:     4  
##  Max.   :1.510e+09   Max.   :634839

# As such statistically speaking, we need to see what type of events are the ones that caused more damage in USD, from the code below, we can see that 97% of the events caused no damage at all, so again we have single events, which were extremely fatal. In this case we are going to do something different we are going to select events that caused more than 1,000 USD in damages

Crops <- mutate(Crops, percentage = trunc((100*(n/653530))))

Crops_Top <- filter(Crops, CROP.DAMAGE >= 1000)

# The next step is to simply look for those particular Crop Damages and understand the variations


Crops_Events <- data.frame(CROP_DAMAGE= integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities


for(i in Crops_Top$CROP.DAMAGE){
    
    crop_i <- grep(i,Var_Raw_Data96_11$CROP.DAMAGE)
    d_crop <- data.frame(CROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[crop_i])
    Crops_Events <- rbind(Crops_Events,d_crop)
    
    
    
}

## Error in data.frame(CROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[crop_i]): arguments imply differing number of rows: 1, 0

Crops_Events_Summary <- arrange(aggregate(CROP_DAMAGE ~ EVENT.TYPE,Crops_Events,sum),desc(CROP_DAMAGE))

head(Crops_Events_Summary,5)

##    EVENT.TYPE CROP_DAMAGE
## 1        HAIL   153871400
## 2   TSTM WIND    50125450
## 3 FLASH FLOOD    40230750
## 4       FLOOD    37799750
## 5     TORNADO    22037650

#As we can see the top 5 events contributing to crop damage, are within the 53% more frequent events


CROPS_EVENT_LIST <- filter(List_Events,List_Events$EVTYPE == "HAIL" | List_Events$EVTYPE  == "TSTM WIND" | List_Events$EVTYPE  == "FLASH FLOOD" | List_Events$EVTYPE  == "FLOOD" | List_Events$EVTYPE  == "TORNADO" )

print(CROPS_EVENT_LIST)

## # A tibble: 5 x 3
##   EVTYPE           n percentage
##   <chr>        <int>      <dbl>
## 1 HAIL        207715         31
## 2 TSTM WIND   128662         19
## 3 FLASH FLOOD  50999          7
## 4 FLOOD        24247          3
## 5 TORNADO      23154          3

Property <- count(Var_Raw_Data96_11,PROPERTY.DAMAGE)

tail(Property ,10)

## # A tibble: 10 x 2
##    PROPERTY.DAMAGE     n
##              <dbl> <int>
##  1      4830000000     1
##  2      5150000000     1
##  3      5420000000     1
##  4      5880000000     1
##  5      7350000000     1
##  6     10000000000     1
##  7     11260000000     1
##  8     16930000000     1
##  9     31300000000     1
## 10    115000000000     1

# # There are a lot of single events causing the most amount of crop damage, therefore, we must ensure that we identify the crop damage and ensure that we include the correct samples

colSums(Property)

## PROPERTY.DAMAGE               n 
##    276051909570          653530

# The amount of damage in USD is  276,051,909,570  (276 Billion)


summary(Property)

##  PROPERTY.DAMAGE           n           
##  Min.   :0.000e+00   Min.   :     1.0  
##  1st Qu.:1.502e+05   1st Qu.:     1.0  
##  Median :8.682e+05   Median :     1.0  
##  Mean   :1.687e+08   Mean   :   399.5  
##  3rd Qu.:1.041e+07   3rd Qu.:     4.0  
##  Max.   :1.150e+11   Max.   :464262.0

# As such statistically speaking, we need to see what type of events are the ones that caused more damage in USD, from the code below, we can see that 71% of the events caused significant damage, the rest are isolated events, we are going to actually use everythiong more than 10 USD, basically this is going to take a look at most of the events 29%

Property <- mutate(Property, percentage = trunc((100*(n/653530))))
head(Property,10)

## # A tibble: 10 x 3
##    PROPERTY.DAMAGE      n percentage
##              <dbl>  <int>      <dbl>
##  1               0 464262         71
##  2              10    850          0
##  3              20     46          0
##  4              30      5          0
##  5              40      2          0
##  6              50    217          0
##  7              60     21          0
##  8              70      3          0
##  9             100   1283          0
## 10             110     27          0

Property_Top <- filter(Property, PROPERTY.DAMAGE >= 10)

# The next step is to simply look for those particular Crop Damages and understand the variations


PROP_Events <- data.frame(PROP_DAMAGE= integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities


for(i in Property_Top$PROPERTY.DAMAGE){
    
    prop_i <- grep(i,Var_Raw_Data96_11$PROPERTY.DAMAGE)
    d_prop <- data.frame(PROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[prop_i])
    PROP_Events <- rbind(PROP_Events,d_prop)
    
  }

## Error in data.frame(PROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[prop_i]): arguments imply differing number of rows: 1, 0

PROP_Events_Summary <- arrange(aggregate(PROP_DAMAGE ~ EVENT.TYPE,PROP_Events,sum),desc(PROP_DAMAGE))

head(PROP_Events_Summary,5)

##          EVENT.TYPE PROP_DAMAGE
## 1         TSTM WIND   848919900
## 2 THUNDERSTORM WIND   575331100
## 3       FLASH FLOOD   441619710
## 4           TORNADO   359727540
## 5              HAIL   287225850

#I think this is important, let's ensure that the events that caused more than 1 B USD in damage are included in the events


PROPERTY_EVENTS_LIST <- filter(List_Events,List_Events$EVTYPE  == "TSTM WIND" | List_Events$EVTYPE  == "THUNDERSTORM WIND" | List_Events$EVTYPE  == "FLASH FLOOD"| List_Events$EVTYPE  == "TORNADO" | List_Events$EVTYPE  == "HAIL")

print(PROPERTY_EVENTS_LIST)

## # A tibble: 5 x 3
##   EVTYPE                 n percentage
##   <chr>              <int>      <dbl>
## 1 HAIL              207715         31
## 2 TSTM WIND         128662         19
## 3 THUNDERSTORM WIND  81402         12
## 4 FLASH FLOOD        50999          7
## 5 TORNADO            23154          3

# From the list below, it is clear that the events contributing to property damage are within the 72% more frequent events

Finalizing the Data

The processing effort of the data thus far could seem excessive, but actually, one can assume that we are not making wrong assumptions, and certainly the pre-processing helped to understand the dataset, so now we should finalize the data in order to have a final dataset that can be used by others.

# The first step will be to fix the even type names in order to consolidate our list, ensure that there are no doble entries and more importantly NOOA only recognizes 48 types 

#Let's look at the top 5 events, logically we need to concentrate on this events 

Fatal_Events_Summary$index ="Fatalities"
PROP_Events_Summary$index = "Property"
Injuries_Events_Summary$index = "Injuries"
Crops_Events_Summary$index = "Crops"

#Let's map the events in order to now if they repeat 

List_Top5_events <- rbind(Fatal_Events_Summary[1:5,c(-2)], PROP_Events_Summary[1:5,c(-2)], Injuries_Events_Summary[1:5,c(-2)], Crops_Events_Summary[1:5,c(-2)])

#As we can see there is only one event that is present in all of the dataset and we can see that as long as we have this 10 events included in our list we should be on safe side, and we won't be making wrong assumptions

List_Top5_events <- arrange(count(List_Top5_events,List_Top5_events$EVENT.TYPE),desc(n))
List_Top5_events <- rename(List_Top5_events, Events = `List_Top5_events$EVENT.TYPE` )
List_Top5_events$Events <- as.character(List_Top5_events$Events)
print(List_Top5_events)

## # A tibble: 10 x 2
##    Events                n
##    <chr>             <int>
##  1 TORNADO               4
##  2 FLASH FLOOD           3
##  3 TSTM WIND             3
##  4 EXCESSIVE HEAT        2
##  5 FLOOD                 2
##  6 HAIL                  2
##  7 HEAT                  1
##  8 TSUNAMI               1
##  9 LIGHTNING             1
## 10 THUNDERSTORM WIND     1

#I have imported the 48 events from the NOAA list, and the events are contain in the file below 

NOOA_Events <- read.table("NOOA_Events.txt", header = FALSE, fill = TRUE, sep = "\t")
NOOA_Events$V1 <- str_remove_all(NOOA_Events$V1,"Z")
NOOA_Events$V1 <- str_remove_all(NOOA_Events$V1,"C")
NOOA_Events <- rename(NOOA_Events, Events = V1)
NOOA_Events$Events <- trimws(NOOA_Events$Events, which = c("right"))
NOOA_Events$Events <- toupper(NOOA_Events$Events)

# We need to ensure that the list with 10 events are all included in the 48 event types

ALL_EVENTS <- NOOA_Events[(NOOA_Events$Events %in% List_Top5_events$Events),]

List_Top5_events[List_Top5_events$Events %in% NOOA_Events$Events == FALSE,]

## # A tibble: 1 x 2
##   Events        n
##   <chr>     <int>
## 1 TSTM WIND     3

# Please note that this was perhaps a very winded road in order to obtain something that was extremely easy to visualize,and there is one event type that we have to fix in the dataset and that is TSTM Wind.

# We need to use the information contained in the pdf - https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf - The information is very easy to digest and it is clear that TSMT Wind is a typo of Marine Thunderstorm Wind - Winds, associated with thunderstorms.

# As such, let's replace TMSTM Wind with Marine Thunderstorm Wind

Var_Raw_Data96_11$EVTYPE <- str_replace_all(Var_Raw_Data96_11$EVTYPE,"TSTM WIND", "MARINE THUNDERSTORM WIND" )


head(filter(Var_Raw_Data96_11,EVTYPE == "MARINE THUNDERSTORM WIND" ),10) # This is just to check that the change was made

##    BGN_DATE STATE                   EVTYPE FATALITIES INJURIES PROPERTY.DAMAGE
## 1      1996    AL MARINE THUNDERSTORM WIND          0        0            3000
## 2      1996    AL MARINE THUNDERSTORM WIND          0        0            5000
## 3      1996    AL MARINE THUNDERSTORM WIND          0        0            2000
## 4      1996    AL MARINE THUNDERSTORM WIND          0        0           12000
## 5      1996    AL MARINE THUNDERSTORM WIND          0        0            8000
## 6      1996    AL MARINE THUNDERSTORM WIND          0        0           12000
## 7      1996    AL MARINE THUNDERSTORM WIND          0        0            2000
## 8      1996    AL MARINE THUNDERSTORM WIND          0        0            5000
## 9      1996    AL MARINE THUNDERSTORM WIND          0        0            2000
## 10     1996    AL MARINE THUNDERSTORM WIND          0        0            2000
##    CROP.DAMAGE
## 1            0
## 2            0
## 3            0
## 4            0
## 5            0
## 6            0
## 7            0
## 8            0
## 9            0
## 10           0

#We are also going to filter only the events contained in the varibale List_Top5_events, basically 10 evenst

Processed_Data_96_11 <- filter(Var_Raw_Data96_11, EVTYPE == "TORNADO" | EVTYPE == "FLASH FLOOD"| EVTYPE == "MARINE THUNDERSTORM WIND" |EVTYPE == "EXCESSIVE HEAT" | EVTYPE == "FLOOD" | EVTYPE == "HAIL" | EVTYPE == "HEAT" | EVTYPE == "TSUNAMI" | EVTYPE == "LIGHTNING" | EVTYPE == "THUNDERSTORM WIND" )
                               
# We have finalized processing our dataset and we have a more manageable dataset

Results

We have 2 awnser 2 questions:

###1.- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

library(ggplot2)
# The question can be awnser by quantifying the impact of the different events to human lifes (Injueries and Fatalities), let's create 2 variables so we can study each variable independently


  Proc_Data_Injuries <-  arrange(aggregate(INJURIES ~ EVTYPE,Processed_Data_96_11,sum), desc(INJURIES))
  print(Proc_Data_Injuries)

##                      EVTYPE INJURIES
## 1                   TORNADO    20667
## 2                     FLOOD     6758
## 3            EXCESSIVE HEAT     6391
## 4                 LIGHTNING     4141
## 5  MARINE THUNDERSTORM WIND     3655
## 6               FLASH FLOOD     1674
## 7         THUNDERSTORM WIND     1400
## 8                      HEAT     1222
## 9                      HAIL      713
## 10                  TSUNAMI      129

  Proc_Data_Fatalities <- arrange(aggregate(FATALITIES ~ EVTYPE,Processed_Data_96_11,sum), desc(FATALITIES))
  print(Proc_Data_Fatalities)

##                      EVTYPE FATALITIES
## 1            EXCESSIVE HEAT       1797
## 2                   TORNADO       1511
## 3               FLASH FLOOD        887
## 4                 LIGHTNING        651
## 5                     FLOOD        414
## 6  MARINE THUNDERSTORM WIND        251
## 7                      HEAT        237
## 8         THUNDERSTORM WIND        130
## 9                   TSUNAMI         33
## 10                     HAIL          7

  Data_Plot_q1 <- merge(Proc_Data_Injuries,Proc_Data_Fatalities, by = "EVTYPE", all.x = TRUE)
  print(Data_Plot_q1)

##                      EVTYPE INJURIES FATALITIES
## 1            EXCESSIVE HEAT     6391       1797
## 2               FLASH FLOOD     1674        887
## 3                     FLOOD     6758        414
## 4                      HAIL      713          7
## 5                      HEAT     1222        237
## 6                 LIGHTNING     4141        651
## 7  MARINE THUNDERSTORM WIND     3655        251
## 8         THUNDERSTORM WIND     1400        130
## 9                   TORNADO    20667       1511
## 10                  TSUNAMI      129         33

  Plot_Q1 <- ggplot(data = Data_Plot_q1, aes(x = reorder(EVTYPE,FATALITIES),FATALITIES, fill = EVTYPE )) + geom_bar( stat = "identity") + ggtitle("FATALITIES BY EVENT") + ylab ("Fatalities") + xlab("Weather Storm Data Events") + theme(axis.text.x = element_text(angle = -335, hjust = 1, size = 7)) 
  print(Plot_Q1)

  Plot_Q1_In <- ggplot(data = Data_Plot_q1, aes(x = reorder(EVTYPE,INJURIES),INJURIES, fill = EVTYPE )) + geom_bar( stat = "identity") + ggtitle("POPULATION HEALTH EVENTS IN US (1996-2011)") + ylab ("INJURIES") + xlab("Weather Storm Data Events") + theme(axis.text.x = element_text(angle = -335, hjust = 1, size = 7)) 
  print(Plot_Q1_In)

  # As we can see the events that caused the majority of the damage in terms of human safety were TORNADOS, EXCESSIVE HEAT & FlOODS

###2.- Across the United States, which types of events have the greatest economic consequences?

# The question can be awnser by quantifying the impact of the different events to property and crops, let's create 2 variables so we can study each variable independently


  Proc_Data_Property <-  arrange(aggregate(PROPERTY.DAMAGE ~ EVTYPE,Processed_Data_96_11,sum), desc(PROPERTY.DAMAGE))
  Proc_Data_Property$PROPERTY.DAMAGE <- round(Proc_Data_Property$PROPERTY.DAMAGE/1000000000,digits =2) # This is Already in Billions
  Proc_Data_Property <- rename(Proc_Data_Property,DAMAGE = PROPERTY.DAMAGE)
  Proc_Data_Property$TYPE.OF.DAMAGE <- "PROPERTY"
  
  Proc_Data_Crop <- arrange(aggregate(CROP.DAMAGE ~ EVTYPE,Processed_Data_96_11,sum), desc(CROP.DAMAGE))
  Proc_Data_Crop$CROP.DAMAGE <- round(Proc_Data_Crop$CROP.DAMAGE/1000000000,digits =2)
  Proc_Data_Crop <- rename(Proc_Data_Crop ,DAMAGE = CROP.DAMAGE)
  Proc_Data_Crop$TYPE.OF.DAMAGE <- "CROP"
  
  Data_Plot_q2 <- arrange(rbind(Proc_Data_Property,Proc_Data_Crop), EVTYPE)
  
  # We are interested in the total damage, however we will be plotting both economical damages next to each other, so that is why I arrrange the data with a group label (DAMAGE)
  print(Data_Plot_q2)

##                      EVTYPE DAMAGE TYPE.OF.DAMAGE
## 1            EXCESSIVE HEAT   0.01       PROPERTY
## 2            EXCESSIVE HEAT   0.49           CROP
## 3               FLASH FLOOD  15.22       PROPERTY
## 4               FLASH FLOOD   1.33           CROP
## 5                     FLOOD 143.94       PROPERTY
## 6                     FLOOD   4.97           CROP
## 7                      HAIL  14.60       PROPERTY
## 8                      HAIL   2.48           CROP
## 9                      HEAT   0.00       PROPERTY
## 10                     HEAT   0.00           CROP
## 11                LIGHTNING   0.74       PROPERTY
## 12                LIGHTNING   0.01           CROP
## 13 MARINE THUNDERSTORM WIND   4.48       PROPERTY
## 14 MARINE THUNDERSTORM WIND   0.55           CROP
## 15        THUNDERSTORM WIND   3.38       PROPERTY
## 16        THUNDERSTORM WIND   0.40           CROP
## 17                  TORNADO  24.62       PROPERTY
## 18                  TORNADO   0.28           CROP
## 19                  TSUNAMI   0.14       PROPERTY
## 20                  TSUNAMI   0.00           CROP

  Plot_Q2 <- ggplot(Data_Plot_q2, aes(x =reorder(EVTYPE,DAMAGE), y = DAMAGE, fill = TYPE.OF.DAMAGE)) + geom_bar(stat = "identity") + theme_bw()  + facet_grid(.~EVTYPE,scales = "free", space="free_x") + labs(x="Weather Storm Data Events", y=expression("Damage in USD (Billions)")) + labs(title=expression("ECONOMICAL DAMAGE WEATHER EVENTS IN US (1996-2011)" )) + theme(axis.text.x = element_text(angle = -335, hjust = 1, size = 7)) + theme(text = element_text(size=12)) + guides(size = 4)
  

  print(Plot_Q2)

  # As we can see the events from the graph the Property damage is substantially more than the crop damage, so in terms of economic consequences, Flooding is the leading factor for both crop and property damage, in a very distant second, Tornado in USD had a significant impact, and then we can see that flood and hail had similar consequences in number 3

Reproducible Research, Course Project 2: NOOA Storm Database - Analysis of Severe Weather (1950 - 2011)

Francisco Colina Sanchez

5/29/2020