The analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database contains major weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage, the main point in the study is to simply awser 2 questions; which weather event impacted the population health and which event had the most significant enonomic impact, as such, the analysis involves cleannind the dataset in order to be able to awnsert this questions
Thr first step will be to load all the libraries we need
library(stringr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
The first step is to import the data from 1950 to 2011 in R, the data is a CSV
#Read CSV can import a zip file
Raw_Data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = "," )
After generating a variable Raw_Data, the next step should be to explore the dataset in order to understand the variables
dim(Raw_Data)
## [1] 902297 37
In order to awnser the question regarding the type of events we must first explore the EVENTYPE varible, and the fatalities and injueries
head(Raw_Data[,1:12])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
head(Raw_Data[,13:24])
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F
## 1 0 NA 0 14.0 100 3
## 2 0 NA 0 2.0 150 2
## 3 0 NA 0 0.1 123 2
## 4 0 NA 0 0.0 100 2
## 5 0 NA 0 0.0 150 2
## 6 0 NA 0 1.5 177 2
## MAG FATALITIES INJURIES
## 1 0 0 15
## 2 0 0 0
## 3 0 0 2
## 4 0 0 2
## 5 0 0 2
## 6 0 0 6
head(Raw_Data[,25:37])
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1 25.0 K 0 3040
## 2 2.5 K 0 3042
## 3 25.0 K 0 3340
## 4 2.5 K 0 3458
## 5 2.5 K 0 3412
## 6 2.5 K 0 3450
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 8812 3051 8806 1
## 2 8755 0 0 2
## 3 8742 0 0 3
## 4 8626 0 0 4
## 5 8642 0 0 5
## 6 8748 0 0 6
From the inspection above, we are clearly we are interested in the Eventype and also the number of fatalities, injuries, property damages and crop damages in order to determine the types of events, which are most hamrful with respect to population health and the type of events that have the greates economic consequences.
The first step for cleanning is to remove the additional 00:00 in the date format
Raw_Data$BGN_DATE <- str_remove_all(Raw_Data$BGN_DATE,"0:00:00") # Removes Zero's from the Date Begining
Raw_Data$BGN_DATE <- format(as.Date(Raw_Data$BGN_DATE, format="%m/%d/%Y"),"%Y") #Converts the string in to a Date Format
Raw_Data$BGN_DATE <- as.numeric(Raw_Data$BGN_DATE)
First lest have a look at the data in particular the event types graph
hist(Raw_Data$BGN_DATE, breaks = 40, main = "Histogram Events Per Year", xlab = "Years 1950 to November 2011")
According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since our objective is comparing the effects of different weather events
We are going to have a look at the event types prior 1996 and after 1996
# Let's have a look at the unique events for the whole data set
summary <- as.data.frame(unique(Raw_Data$EVTYPE))
dim(summary)
## [1] 985 1
# We can clearly see that we have 985 unique names, the problem is repetations
# Let's look at the unique event types after 1996, and we can see how many unique
Raw_Data96_11 <- filter(Raw_Data, BGN_DATE >= 1996)
summary_96_11 <- as.data.frame(unique(Raw_Data96_11$EVTYPE))
# As we can see the original data contains more than 985 different types of events, whereby we are only interested in 48 ofrficial events
Based on the analysis above, the best approach is to use the data from 1996 to 2011 for this analysis as more than 70% of the data was collected during the aformentioned period, so it is more significative to use this sample.
The first step will be to fix all the type of events in order to have 48 events, it is a combination of typos and adding information
Raw_Data96_11$EVTYPE <- as.character(Raw_Data96_11$EVTYPE)
List_Events <- count(Raw_Data96_11, EVTYPE) #First we count the types of events in order to get an idea of the events that are repeated the most
List_Events <- arrange(List_Events,desc(n))
List_Events <- mutate(List_Events, percentage = trunc((100*(n/653530))))
Top_10_Events <- List_Events[1:10,2:3]
#As we can see the top 10 events represents 83% of the sample, as such, it is logical to assume that outside those 10 events the sample is not significant enought to alter the results
print(colSums(Top_10_Events))
## n percentage
## 574798 83
#To do more diligence, I decided to quantify the effect of the events that only appeared 10 times
# As we can see there are 399 type of events which only occured once, which are clearly not significative in terms of frequency
Typos <- filter(List_Events, n <= 10)
dim(Typos)
## [1] 399 3
In order to be completely sure, we will be doing further processing later but at the moment we can assume that the besta approach is to use the events from 1996 to 2011, and it is almost certain that it is difficult that the events that ocurred less than 6000 times could impact the study, however we need to look this in depth, but first we need to fix the columns
In order to determine which event impacted the most in United States, both in terms of harming the population and economic consequences, we simply need the following variables
1.- Date = BGN_DATE 2.- Sate = STATE 3.- Event Type = EVTYPE 4.- # of People Killed = FATALITIES 5.- # Number of people injured = INJURIES 6.- Property Damage = PROPDMG 7.- Property Damage Exponent = PROPDMGEXP 8.- Crop Damage = CROPDMG 9.- Crop Damage Exponent = CROPDMGEXP
The 2 questions that we need to asks can be fully awnser by using the data above
The next step is to fix the exponent columns and merge the value in one column so we can check the information
## BGN_DATE STATE EVTYPE FATALITIES
## Min. :1996 TX : 51335 Length:653530 Min. : 0.00000
## 1st Qu.:2000 KS : 38649 Class :character 1st Qu.: 0.00000
## Median :2005 OK : 26980 Mode :character Median : 0.00000
## Mean :2004 MO : 25802 Mean : 0.01336
## 3rd Qu.:2008 IA : 22712 3rd Qu.: 0.00000
## Max. :2011 IL : 21215 Max. :158.00000
## (Other):466837
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. :0.00e+00 Min. : 0.00 K :369938 Min. : 0.000
## 1st Qu.:0.00e+00 1st Qu.: 0.00 :276185 1st Qu.: 0.000
## Median :0.00e+00 Median : 0.00 M : 7374 Median : 0.000
## Mean :8.87e-02 Mean : 11.69 B : 32 Mean : 1.839
## 3rd Qu.:0.00e+00 3rd Qu.: 1.00 0 : 1 3rd Qu.: 0.000
## Max. :1.15e+03 Max. :5000.00 - : 0 Max. :990.000
## (Other): 0
## CROPDMGEXP
## :373069
## K :278686
## M : 1771
## B : 4
## ? : 0
## 0 : 0
## (Other): 0
## # A tibble: 4 x 1
## CROPDMGEXP
## <fct>
## 1 ""
## 2 "B"
## 3 "K"
## 4 "M"
## # A tibble: 4 x 1
## CROPDMGEXP
## <dbl>
## 1 0
## 2 1000
## 3 1000000
## 4 1000000000
## # A tibble: 5 x 1
## PROPDMGEXP
## <fct>
## 1 ""
## 2 "0"
## 3 "B"
## 4 "K"
## 5 "M"
## # A tibble: 5 x 1
## PROPDMGEXP
## <fct>
## 1 ""
## 2 "0"
## 3 "B"
## 4 "K"
## 5 "M"
In this section we are going to look at the fatalities and injuries, in order to determine if we can use only the 10 top events, logically, if we are trying to awnser the question - Which event caused more damage in terms of fatalities, we need to ensure that there is no bias in the data. We have analysed the amount of events (frequency). but that doesn’t necessarly means that those events caused the most amount of fatalities.
Fatalities <- count(Var_Raw_Data96_11,FATALITIES)
tail(Fatalities,10)
## # A tibble: 10 x 2
## FATALITIES n
## <dbl> <int>
## 1 30 1
## 2 32 3
## 3 33 1
## 4 42 2
## 5 44 1
## 6 46 1
## 7 49 1
## 8 74 1
## 9 99 1
## 10 158 1
# As we can see there were only 1 instance in which the fatality was above 100
colSums(Fatalities)
## FATALITIES n
## 959 653530
# The total number of fatalities from 1950 to 2011 is 959, which is only 15 deaths per year
summary(Fatalities)
## FATALITIES n
## Min. : 0.00 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 1
## Median : 18.00 Median : 3
## Mean : 25.92 Mean : 17663
## 3rd Qu.: 30.00 3rd Qu.: 13
## Max. :158.00 Max. :648572
# As such statistically speaking, we need to see what type of events are the ones that caused fatalities, in order to determine if there is 1 event outside the top 10 events that could introduce an error to our study, eventhough 99% of the cases are 0 fatalities, as we can see below
Fatalities <- mutate(Fatalities, percentage = trunc((100*(n/653530))))
Fatalities_Top <- filter(Fatalities, FATALITIES >= 10)
# The next step is to simply look for those particular fatalities and correlate to the events
Fatalities_Top_events <- select(Var_Raw_Data96_11, for(i in Fatalities_Top$FATALITIES){ FATALITIES = i})
Fatal_Events <- data.frame(FATALITIES = integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities
for(i in Fatalities_Top$FATALITIES){
p <- grep(i,Var_Raw_Data96_11$FATALITIES)
l <- data.frame(FATALITIES = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[p])
Fatal_Events <- rbind(Fatal_Events,l)
#df <- rbind(l,m)
}
Fatal_Events <- arrange(Fatal_Events, desc(FATALITIES))
Fatal_Events_Summary <- arrange(aggregate(FATALITIES ~ EVENT.TYPE,Fatal_Events,sum),desc(FATALITIES))
print(Fatal_Events_Summary)
## EVENT.TYPE FATALITIES
## 1 EXCESSIVE HEAT 798
## 2 TORNADO 631
## 3 TSUNAMI 32
## 4 FLASH FLOOD 31
## 5 HEAT 26
## 6 WILDFIRE 26
## 7 LANDSLIDE 24
## 8 TROPICAL STORM 22
## 9 FLOOD 21
## 10 HEAVY RAIN 19
## 11 HURRICANE/TYPHOON 15
## 12 COLD AND SNOW 14
## 13 HURRICANE 13
## 14 FOG 11
## 15 STORM SURGE/TIDE 11
## 16 EXTREME COLD/WIND CHILL 10
#The event that caused the majority of the fatalities was Excessive Heat and Tornado
# We are going to run a few commands in order to check that the events that caused the majority of the casualties are actually the events that were more frequent and therefore, we can assume that
# any event that ocurred less than 10 times, is simply a typo as there are only 48 events which are significant
FATAL_EVENTS_LIST <- filter(List_Events,List_Events$EVTYPE == "TORNADO" | List_Events$EVTYPE == "TSUNAMI" | List_Events$EVTYPE == "EXCESSIVE HEAT" | List_Events$EVTYPE == "FLASH FLOOD" | List_Events$EVTYPE == "HEAT")
print(FATAL_EVENTS_LIST)
## # A tibble: 5 x 3
## EVTYPE n percentage
## <chr> <int> <dbl>
## 1 FLASH FLOOD 50999 7
## 2 TORNADO 23154 3
## 3 EXCESSIVE HEAT 1656 0
## 4 HEAT 716 0
## 5 TSUNAMI 20 0
#As we can there is not an event, which happened only one that also caused a lot fatalities
Injuries <- count(Var_Raw_Data96_11,INJURIES)
tail(Injuries,10)
## # A tibble: 10 x 2
## INJURIES n
## <dbl> <int>
## 1 437 1
## 2 500 4
## 3 519 1
## 4 550 1
## 5 600 1
## 6 700 1
## 7 750 1
## 8 780 1
## 9 800 2
## 10 1150 1
# As we can see the number of fatalities is high, and there were particular events that caused quite a lot of damage
colSums(Injuries)
## INJURIES n
## 17150 653530
# The total number of fatalities from 1950 to 2011 is 17150
summary(Injuries)
## INJURIES n
## Min. : 0.0 Min. : 1
## 1st Qu.: 33.0 1st Qu.: 1
## Median : 67.0 Median : 2
## Mean : 128.9 Mean : 4914
## 3rd Qu.: 136.0 3rd Qu.: 8
## Max. :1150.0 Max. :644339
# As such statistically speaking, we need to see what type of events are the ones that caused injuries, in order to determine if there is 1 event outside the top 10 events that could introduce an error to our study, technically only 2% of the events are responsible for 17150 casualties, so it is trickky, as we need to ensure that these isolated events are included in our sample
Injuries <- mutate(Injuries, percentage = trunc((100*(n/653530))))
Injuries_Top <- filter(Injuries, INJURIES >= 1)
# The next step is to simply look for those particular fatalities and correlate to the events
Injuries_Events <- data.frame(INJURIES = integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities
for(i in Injuries_Top$INJURIES){
inju <- grep(i,Var_Raw_Data96_11$INJURIES)
d_Inju <- data.frame(INJURIES = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[inju])
Injuries_Events <- rbind(Injuries_Events,d_Inju)
#df <- rbind(l,m)
}
Injuries_Events_Summary <- arrange(aggregate(INJURIES ~ EVENT.TYPE,Injuries_Events,sum),desc(INJURIES))
head(Injuries_Events_Summary,10)
## EVENT.TYPE INJURIES
## 1 TORNADO 24267
## 2 EXCESSIVE HEAT 7981
## 3 FLOOD 7616
## 4 LIGHTNING 4284
## 5 TSTM WIND 3808
## 6 FLASH FLOOD 1935
## 7 WINTER STORM 1551
## 8 HURRICANE/TYPHOON 1537
## 9 HEAT 1472
## 10 THUNDERSTORM WIND 1463
# It should not be a surprise that the events with the majority of the injuries are the same as the events the fatalities, andthis events are within the first 100 events
INJURIES_EVENTS_LIST <- filter(List_Events,List_Events$EVTYPE == "TORNADO" | List_Events$EVTYPE == "EXCESSIVE HEAT"| List_Events$EVTYPE == "LIGHTNING" |List_Events$EVTYPE == "FLOOD" | List_Events$EVTYPE == "TSTM WIND")
print(INJURIES_EVENTS_LIST)
## # A tibble: 5 x 3
## EVTYPE n percentage
## <chr> <int> <dbl>
## 1 TSTM WIND 128662 19
## 2 FLOOD 24247 3
## 3 TORNADO 23154 3
## 4 LIGHTNING 13203 2
## 5 EXCESSIVE HEAT 1656 0
We are going to ensure that we our top events are not excluding events that contributes to property damage and crop damage, as a data analysts, assumptions can be detrimental in a study.
Crops <- count(Var_Raw_Data96_11,CROP.DAMAGE)
tail(Crops,10)
## # A tibble: 10 x 2
## CROP.DAMAGE n
## <dbl> <int>
## 1 423000000 1
## 2 450000000 1
## 3 480000000 1
## 4 492400000 1
## 5 500000000 5
## 6 515000000 1
## 7 578850000 1
## 8 596000000 1
## 9 1000000000 1
## 10 1510000000 1
# # There are a lot of single events causing crop damage, therefore, we must ensure that we identify the crop damage and ensure that we include the correct samples
colSums(Crops)
## CROP.DAMAGE n
## 19931083310 653530
# The amount of damage in USD is 19,931,083,310 (19 Billion)
summary(Crops)
## CROP.DAMAGE n
## Min. :0.000e+00 Min. : 1
## 1st Qu.:1.498e+05 1st Qu.: 1
## Median :2.050e+06 Median : 1
## Mean :3.970e+07 Mean : 1302
## 3rd Qu.:2.186e+07 3rd Qu.: 4
## Max. :1.510e+09 Max. :634839
# As such statistically speaking, we need to see what type of events are the ones that caused more damage in USD, from the code below, we can see that 97% of the events caused no damage at all, so again we have single events, which were extremely fatal. In this case we are going to do something different we are going to select events that caused more than 1,000 USD in damages
Crops <- mutate(Crops, percentage = trunc((100*(n/653530))))
Crops_Top <- filter(Crops, CROP.DAMAGE >= 1000)
# The next step is to simply look for those particular Crop Damages and understand the variations
Crops_Events <- data.frame(CROP_DAMAGE= integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities
for(i in Crops_Top$CROP.DAMAGE){
crop_i <- grep(i,Var_Raw_Data96_11$CROP.DAMAGE)
d_crop <- data.frame(CROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[crop_i])
Crops_Events <- rbind(Crops_Events,d_crop)
}
## Error in data.frame(CROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[crop_i]): arguments imply differing number of rows: 1, 0
Crops_Events_Summary <- arrange(aggregate(CROP_DAMAGE ~ EVENT.TYPE,Crops_Events,sum),desc(CROP_DAMAGE))
head(Crops_Events_Summary,5)
## EVENT.TYPE CROP_DAMAGE
## 1 HAIL 153871400
## 2 TSTM WIND 50125450
## 3 FLASH FLOOD 40230750
## 4 FLOOD 37799750
## 5 TORNADO 22037650
#As we can see the top 5 events contributing to crop damage, are within the 53% more frequent events
CROPS_EVENT_LIST <- filter(List_Events,List_Events$EVTYPE == "HAIL" | List_Events$EVTYPE == "TSTM WIND" | List_Events$EVTYPE == "FLASH FLOOD" | List_Events$EVTYPE == "FLOOD" | List_Events$EVTYPE == "TORNADO" )
print(CROPS_EVENT_LIST)
## # A tibble: 5 x 3
## EVTYPE n percentage
## <chr> <int> <dbl>
## 1 HAIL 207715 31
## 2 TSTM WIND 128662 19
## 3 FLASH FLOOD 50999 7
## 4 FLOOD 24247 3
## 5 TORNADO 23154 3
Property <- count(Var_Raw_Data96_11,PROPERTY.DAMAGE)
tail(Property ,10)
## # A tibble: 10 x 2
## PROPERTY.DAMAGE n
## <dbl> <int>
## 1 4830000000 1
## 2 5150000000 1
## 3 5420000000 1
## 4 5880000000 1
## 5 7350000000 1
## 6 10000000000 1
## 7 11260000000 1
## 8 16930000000 1
## 9 31300000000 1
## 10 115000000000 1
# # There are a lot of single events causing the most amount of crop damage, therefore, we must ensure that we identify the crop damage and ensure that we include the correct samples
colSums(Property)
## PROPERTY.DAMAGE n
## 276051909570 653530
# The amount of damage in USD is 276,051,909,570 (276 Billion)
summary(Property)
## PROPERTY.DAMAGE n
## Min. :0.000e+00 Min. : 1.0
## 1st Qu.:1.502e+05 1st Qu.: 1.0
## Median :8.682e+05 Median : 1.0
## Mean :1.687e+08 Mean : 399.5
## 3rd Qu.:1.041e+07 3rd Qu.: 4.0
## Max. :1.150e+11 Max. :464262.0
# As such statistically speaking, we need to see what type of events are the ones that caused more damage in USD, from the code below, we can see that 71% of the events caused significant damage, the rest are isolated events, we are going to actually use everythiong more than 10 USD, basically this is going to take a look at most of the events 29%
Property <- mutate(Property, percentage = trunc((100*(n/653530))))
head(Property,10)
## # A tibble: 10 x 3
## PROPERTY.DAMAGE n percentage
## <dbl> <int> <dbl>
## 1 0 464262 71
## 2 10 850 0
## 3 20 46 0
## 4 30 5 0
## 5 40 2 0
## 6 50 217 0
## 7 60 21 0
## 8 70 3 0
## 9 100 1283 0
## 10 110 27 0
Property_Top <- filter(Property, PROPERTY.DAMAGE >= 10)
# The next step is to simply look for those particular Crop Damages and understand the variations
PROP_Events <- data.frame(PROP_DAMAGE= integer(), EVENT.TYPE = character()) # Creating an empty table(Data frame) so we can populate with the event and fatalities
for(i in Property_Top$PROPERTY.DAMAGE){
prop_i <- grep(i,Var_Raw_Data96_11$PROPERTY.DAMAGE)
d_prop <- data.frame(PROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[prop_i])
PROP_Events <- rbind(PROP_Events,d_prop)
}
## Error in data.frame(PROP_DAMAGE = i, EVENT.TYPE = Var_Raw_Data96_11$EVTYPE[prop_i]): arguments imply differing number of rows: 1, 0
PROP_Events_Summary <- arrange(aggregate(PROP_DAMAGE ~ EVENT.TYPE,PROP_Events,sum),desc(PROP_DAMAGE))
head(PROP_Events_Summary,5)
## EVENT.TYPE PROP_DAMAGE
## 1 TSTM WIND 848919900
## 2 THUNDERSTORM WIND 575331100
## 3 FLASH FLOOD 441619710
## 4 TORNADO 359727540
## 5 HAIL 287225850
#I think this is important, let's ensure that the events that caused more than 1 B USD in damage are included in the events
PROPERTY_EVENTS_LIST <- filter(List_Events,List_Events$EVTYPE == "TSTM WIND" | List_Events$EVTYPE == "THUNDERSTORM WIND" | List_Events$EVTYPE == "FLASH FLOOD"| List_Events$EVTYPE == "TORNADO" | List_Events$EVTYPE == "HAIL")
print(PROPERTY_EVENTS_LIST)
## # A tibble: 5 x 3
## EVTYPE n percentage
## <chr> <int> <dbl>
## 1 HAIL 207715 31
## 2 TSTM WIND 128662 19
## 3 THUNDERSTORM WIND 81402 12
## 4 FLASH FLOOD 50999 7
## 5 TORNADO 23154 3
# From the list below, it is clear that the events contributing to property damage are within the 72% more frequent events
The processing effort of the data thus far could seem excessive, but actually, one can assume that we are not making wrong assumptions, and certainly the pre-processing helped to understand the dataset, so now we should finalize the data in order to have a final dataset that can be used by others.
# The first step will be to fix the even type names in order to consolidate our list, ensure that there are no doble entries and more importantly NOOA only recognizes 48 types
#Let's look at the top 5 events, logically we need to concentrate on this events
Fatal_Events_Summary$index ="Fatalities"
PROP_Events_Summary$index = "Property"
Injuries_Events_Summary$index = "Injuries"
Crops_Events_Summary$index = "Crops"
#Let's map the events in order to now if they repeat
List_Top5_events <- rbind(Fatal_Events_Summary[1:5,c(-2)], PROP_Events_Summary[1:5,c(-2)], Injuries_Events_Summary[1:5,c(-2)], Crops_Events_Summary[1:5,c(-2)])
#As we can see there is only one event that is present in all of the dataset and we can see that as long as we have this 10 events included in our list we should be on safe side, and we won't be making wrong assumptions
List_Top5_events <- arrange(count(List_Top5_events,List_Top5_events$EVENT.TYPE),desc(n))
List_Top5_events <- rename(List_Top5_events, Events = `List_Top5_events$EVENT.TYPE` )
List_Top5_events$Events <- as.character(List_Top5_events$Events)
print(List_Top5_events)
## # A tibble: 10 x 2
## Events n
## <chr> <int>
## 1 TORNADO 4
## 2 FLASH FLOOD 3
## 3 TSTM WIND 3
## 4 EXCESSIVE HEAT 2
## 5 FLOOD 2
## 6 HAIL 2
## 7 HEAT 1
## 8 TSUNAMI 1
## 9 LIGHTNING 1
## 10 THUNDERSTORM WIND 1
#I have imported the 48 events from the NOAA list, and the events are contain in the file below
NOOA_Events <- read.table("NOOA_Events.txt", header = FALSE, fill = TRUE, sep = "\t")
NOOA_Events$V1 <- str_remove_all(NOOA_Events$V1,"Z")
NOOA_Events$V1 <- str_remove_all(NOOA_Events$V1,"C")
NOOA_Events <- rename(NOOA_Events, Events = V1)
NOOA_Events$Events <- trimws(NOOA_Events$Events, which = c("right"))
NOOA_Events$Events <- toupper(NOOA_Events$Events)
# We need to ensure that the list with 10 events are all included in the 48 event types
ALL_EVENTS <- NOOA_Events[(NOOA_Events$Events %in% List_Top5_events$Events),]
List_Top5_events[List_Top5_events$Events %in% NOOA_Events$Events == FALSE,]
## # A tibble: 1 x 2
## Events n
## <chr> <int>
## 1 TSTM WIND 3
# Please note that this was perhaps a very winded road in order to obtain something that was extremely easy to visualize,and there is one event type that we have to fix in the dataset and that is TSTM Wind.
# We need to use the information contained in the pdf - https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf - The information is very easy to digest and it is clear that TSMT Wind is a typo of Marine Thunderstorm Wind - Winds, associated with thunderstorms.
# As such, let's replace TMSTM Wind with Marine Thunderstorm Wind
Var_Raw_Data96_11$EVTYPE <- str_replace_all(Var_Raw_Data96_11$EVTYPE,"TSTM WIND", "MARINE THUNDERSTORM WIND" )
head(filter(Var_Raw_Data96_11,EVTYPE == "MARINE THUNDERSTORM WIND" ),10) # This is just to check that the change was made
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPERTY.DAMAGE
## 1 1996 AL MARINE THUNDERSTORM WIND 0 0 3000
## 2 1996 AL MARINE THUNDERSTORM WIND 0 0 5000
## 3 1996 AL MARINE THUNDERSTORM WIND 0 0 2000
## 4 1996 AL MARINE THUNDERSTORM WIND 0 0 12000
## 5 1996 AL MARINE THUNDERSTORM WIND 0 0 8000
## 6 1996 AL MARINE THUNDERSTORM WIND 0 0 12000
## 7 1996 AL MARINE THUNDERSTORM WIND 0 0 2000
## 8 1996 AL MARINE THUNDERSTORM WIND 0 0 5000
## 9 1996 AL MARINE THUNDERSTORM WIND 0 0 2000
## 10 1996 AL MARINE THUNDERSTORM WIND 0 0 2000
## CROP.DAMAGE
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
#We are also going to filter only the events contained in the varibale List_Top5_events, basically 10 evenst
Processed_Data_96_11 <- filter(Var_Raw_Data96_11, EVTYPE == "TORNADO" | EVTYPE == "FLASH FLOOD"| EVTYPE == "MARINE THUNDERSTORM WIND" |EVTYPE == "EXCESSIVE HEAT" | EVTYPE == "FLOOD" | EVTYPE == "HAIL" | EVTYPE == "HEAT" | EVTYPE == "TSUNAMI" | EVTYPE == "LIGHTNING" | EVTYPE == "THUNDERSTORM WIND" )
# We have finalized processing our dataset and we have a more manageable dataset
We have 2 awnser 2 questions:
###1.- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
library(ggplot2)
# The question can be awnser by quantifying the impact of the different events to human lifes (Injueries and Fatalities), let's create 2 variables so we can study each variable independently
Proc_Data_Injuries <- arrange(aggregate(INJURIES ~ EVTYPE,Processed_Data_96_11,sum), desc(INJURIES))
print(Proc_Data_Injuries)
## EVTYPE INJURIES
## 1 TORNADO 20667
## 2 FLOOD 6758
## 3 EXCESSIVE HEAT 6391
## 4 LIGHTNING 4141
## 5 MARINE THUNDERSTORM WIND 3655
## 6 FLASH FLOOD 1674
## 7 THUNDERSTORM WIND 1400
## 8 HEAT 1222
## 9 HAIL 713
## 10 TSUNAMI 129
Proc_Data_Fatalities <- arrange(aggregate(FATALITIES ~ EVTYPE,Processed_Data_96_11,sum), desc(FATALITIES))
print(Proc_Data_Fatalities)
## EVTYPE FATALITIES
## 1 EXCESSIVE HEAT 1797
## 2 TORNADO 1511
## 3 FLASH FLOOD 887
## 4 LIGHTNING 651
## 5 FLOOD 414
## 6 MARINE THUNDERSTORM WIND 251
## 7 HEAT 237
## 8 THUNDERSTORM WIND 130
## 9 TSUNAMI 33
## 10 HAIL 7
Data_Plot_q1 <- merge(Proc_Data_Injuries,Proc_Data_Fatalities, by = "EVTYPE", all.x = TRUE)
print(Data_Plot_q1)
## EVTYPE INJURIES FATALITIES
## 1 EXCESSIVE HEAT 6391 1797
## 2 FLASH FLOOD 1674 887
## 3 FLOOD 6758 414
## 4 HAIL 713 7
## 5 HEAT 1222 237
## 6 LIGHTNING 4141 651
## 7 MARINE THUNDERSTORM WIND 3655 251
## 8 THUNDERSTORM WIND 1400 130
## 9 TORNADO 20667 1511
## 10 TSUNAMI 129 33
Plot_Q1 <- ggplot(data = Data_Plot_q1, aes(x = reorder(EVTYPE,FATALITIES),FATALITIES, fill = EVTYPE )) + geom_bar( stat = "identity") + ggtitle("FATALITIES BY EVENT") + ylab ("Fatalities") + xlab("Weather Storm Data Events") + theme(axis.text.x = element_text(angle = -335, hjust = 1, size = 7))
print(Plot_Q1)
Plot_Q1_In <- ggplot(data = Data_Plot_q1, aes(x = reorder(EVTYPE,INJURIES),INJURIES, fill = EVTYPE )) + geom_bar( stat = "identity") + ggtitle("POPULATION HEALTH EVENTS IN US (1996-2011)") + ylab ("INJURIES") + xlab("Weather Storm Data Events") + theme(axis.text.x = element_text(angle = -335, hjust = 1, size = 7))
print(Plot_Q1_In)
# As we can see the events that caused the majority of the damage in terms of human safety were TORNADOS, EXCESSIVE HEAT & FlOODS
###2.- Across the United States, which types of events have the greatest economic consequences?
# The question can be awnser by quantifying the impact of the different events to property and crops, let's create 2 variables so we can study each variable independently
Proc_Data_Property <- arrange(aggregate(PROPERTY.DAMAGE ~ EVTYPE,Processed_Data_96_11,sum), desc(PROPERTY.DAMAGE))
Proc_Data_Property$PROPERTY.DAMAGE <- round(Proc_Data_Property$PROPERTY.DAMAGE/1000000000,digits =2) # This is Already in Billions
Proc_Data_Property <- rename(Proc_Data_Property,DAMAGE = PROPERTY.DAMAGE)
Proc_Data_Property$TYPE.OF.DAMAGE <- "PROPERTY"
Proc_Data_Crop <- arrange(aggregate(CROP.DAMAGE ~ EVTYPE,Processed_Data_96_11,sum), desc(CROP.DAMAGE))
Proc_Data_Crop$CROP.DAMAGE <- round(Proc_Data_Crop$CROP.DAMAGE/1000000000,digits =2)
Proc_Data_Crop <- rename(Proc_Data_Crop ,DAMAGE = CROP.DAMAGE)
Proc_Data_Crop$TYPE.OF.DAMAGE <- "CROP"
Data_Plot_q2 <- arrange(rbind(Proc_Data_Property,Proc_Data_Crop), EVTYPE)
# We are interested in the total damage, however we will be plotting both economical damages next to each other, so that is why I arrrange the data with a group label (DAMAGE)
print(Data_Plot_q2)
## EVTYPE DAMAGE TYPE.OF.DAMAGE
## 1 EXCESSIVE HEAT 0.01 PROPERTY
## 2 EXCESSIVE HEAT 0.49 CROP
## 3 FLASH FLOOD 15.22 PROPERTY
## 4 FLASH FLOOD 1.33 CROP
## 5 FLOOD 143.94 PROPERTY
## 6 FLOOD 4.97 CROP
## 7 HAIL 14.60 PROPERTY
## 8 HAIL 2.48 CROP
## 9 HEAT 0.00 PROPERTY
## 10 HEAT 0.00 CROP
## 11 LIGHTNING 0.74 PROPERTY
## 12 LIGHTNING 0.01 CROP
## 13 MARINE THUNDERSTORM WIND 4.48 PROPERTY
## 14 MARINE THUNDERSTORM WIND 0.55 CROP
## 15 THUNDERSTORM WIND 3.38 PROPERTY
## 16 THUNDERSTORM WIND 0.40 CROP
## 17 TORNADO 24.62 PROPERTY
## 18 TORNADO 0.28 CROP
## 19 TSUNAMI 0.14 PROPERTY
## 20 TSUNAMI 0.00 CROP
Plot_Q2 <- ggplot(Data_Plot_q2, aes(x =reorder(EVTYPE,DAMAGE), y = DAMAGE, fill = TYPE.OF.DAMAGE)) + geom_bar(stat = "identity") + theme_bw() + facet_grid(.~EVTYPE,scales = "free", space="free_x") + labs(x="Weather Storm Data Events", y=expression("Damage in USD (Billions)")) + labs(title=expression("ECONOMICAL DAMAGE WEATHER EVENTS IN US (1996-2011)" )) + theme(axis.text.x = element_text(angle = -335, hjust = 1, size = 7)) + theme(text = element_text(size=12)) + guides(size = 4)
print(Plot_Q2)
# As we can see the events from the graph the Property damage is substantially more than the crop damage, so in terms of economic consequences, Flooding is the leading factor for both crop and property damage, in a very distant second, Tornado in USD had a significant impact, and then we can see that flood and hail had similar consequences in number 3