The analysis exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database presented in this document which tracks characteristics of major storms and weather events in the United states, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage will make us answer two major questions :
Which types of events are more harmful to population health ?
Which types of events have the greatest economic consequences ?
And overall, tornado is the most affecting event on the population, while flood is the event that causes the most properties damage, with drought being the N°1 enemy of crops
Here is the link to download the data : https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Here is the link to download the documentation of the data : https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
The first thing to do is downloading the data from the course website and reading the file to retrieve the data
data <- read.csv("repdata-data-StormData.csv.bz2")
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Before having the data we need,
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In this section, we’ll retrieve the only variables that intereevents us (The variables that have impact on the felware of the population which is the fatalities and injuries and on the economy (Crop damage and properties damage))
event<-data[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP","CROPDMG", "CROPDMGEXP")]
The first thing to do is to examine the events types that are existing in the united states After examining different types of events, I have seen that it is necessary to change some of names due to some typing errors or some names that have been written two times in a different way First, we’ll change to names to capital letters to mix some names that appear multiple times such as (Wintery Mix)
event$EVTYPE <- toupper(event$EVTYPE)
That didn’t change a lot, but we’ll now reconsider the events types on the data documentation by changing every mispelled name to the original name but before we’ll remove the extra spaces
event$EVTYPE<-gsub("(^[[:space:]]+|[[:space:]]+$)", "", event$EVTYPE)
event[grepl("BLIZZARD", event$EVTYPE), ]$EVTYPE <- "BLIZZARD"
event[grepl("AVALANCE", event$EVTYPE), ]$EVTYPE <- "AVALANCHE"
event[grepl("BRUSH FIRE", event$EVTYPE), ]$EVTYPE <- "BRUSH FIRE"
event[grepl("BLOW-OUT TIDE", event$EVTYPE), ]$EVTYPE <- "BLOW-OUT TIDE"
event[grepl("DRY", event$EVTYPE), ]$EVTYPE <- "DRY"
event[grepl("DROUGHT", event$EVTYPE), ]$EVTYPE <- "DROUGHT"
event[(grepl("COLD", event$EVTYPE) & !(grepl("EXTREME COLD", event$EVTYPE)) & !(grepl("EXTREME/RECORD COLD", event$EVTYPE))),]$EVTYPE <- "COLD"
event[grepl("EXTREME/RECORD COLD", event$EVTYPE), ]$EVTYPE <- "EXTREME COLD"
event[grepl("FLASH FLOOD", event$EVTYPE), ]$EVTYPE <- "FLASH FLOOD"
event[grepl("FLOOD FLASH", event$EVTYPE), ]$EVTYPE <- "FLASH FLOOD"
event[grepl("FLOOD/FLASH", event$EVTYPE), ]$EVTYPE <- "FLASH FLOOD"
event[(grepl("FLOOD", event$EVTYPE) & !(grepl("FLASH FLOOD", event$EVTYPE)) & !(grepl("LAKESHORE FLOOD", event$EVTYPE)) & !(grepl("COAeventAL FLOOD", event$EVTYPE))),]$EVTYPE <- "FLOOD"
event[grepl("FUNNEL", event$EVTYPE), ]$EVTYPE <- "FUNNEL CLOUD"
event[grepl("GLAZE", event$EVTYPE), ]$EVTYPE <- "GLAZE"
event[grepl("HAIL", event$EVTYPE) & !(grepl("MARINE HAIL", event$EVTYPE)), ]$EVTYPE <- "HAIL"
event[grepl("HEAT", event$EVTYPE) & !(grepl("EXCESSIVE HEAT", event$EVTYPE)), ]$EVTYPE <- "HEAT"
event[grepl("HEAVY RAIN", event$EVTYPE), ]$EVTYPE <- "HEAVY RAIN"
event[grepl("HEAVY PRECIPATATION", event$EVTYPE), ]$EVTYPE <- "HEAVY PRECIPITATION"
event[grepl("HIGH SWELLS", event$EVTYPE), ]$EVTYPE <- "HIGH SWELLS"
event[grepl("HEAVY SHOWERS", event$EVTYPE), ]$EVTYPE <- "HEAVY SHOWER"
event[grepl("HVY RAIN", event$EVTYPE), ]$EVTYPE <- "HEAVY RAIN"
event[grepl("HEAVY SNOW", event$EVTYPE), ]$EVTYPE <- "HEAVY SNOW"
event[grepl("HIGH SURF", event$EVTYPE), ]$EVTYPE <- "HIGH SURF"
event[grepl("HIGH WIND", event$EVTYPE) & !(grepl("MARINE HIGH WIND", event$EVTYPE)), ]$EVTYPE <- "HIGH WIND"
event[grepl("HURRICANE", event$EVTYPE), ]$EVTYPE <- "HURRICANE"
event[grepl("HYPOTHERMIA/EXPOSURE", event$EVTYPE), ]$EVTYPE <- "HYPOTHERMIA"
event[grepl("ICE ON ROAD", event$EVTYPE), ]$EVTYPE <- "ICE ROADS"
event[grepl("ICY ROADS", event$EVTYPE), ]$EVTYPE <- "ICE ROADS"
event[grepl("LANDSLIDES", event$EVTYPE), ]$EVTYPE <- "LANDSLIDE"
event[grepl("LAKE EFFECT SNOW", event$EVTYPE), ]$EVTYPE <- "LAKE-EFFECT SNOW"
event[grepl("LIGHTNING", event$EVTYPE), ]$EVTYPE <- "LIGHTNING"
event[grepl("LIGHTING", event$EVTYPE), ]$EVTYPE <- "LIGHTNING"
event[grepl("LIGNTNING", event$EVTYPE), ]$EVTYPE <- "LIGHTNING"
event[grepl("LOW TEMPERATURE RECORD", event$EVTYPE), ]$EVTYPE <- "LOW TEMPERATURE"
event[grepl("MUD SLIDE", event$EVTYPE), ]$EVTYPE <- "MUDSLIDE"
event[grepl("MUDSLIDE/LANDSLIDE", event$EVTYPE), ]$EVTYPE <- "MUDSLIDE"
event[grepl("MUD/ROCK SLIDE", event$EVTYPE), ]$EVTYPE <- "MUDSLIDE"
event[grepl("RIP CURRENT", event$EVTYPE), ]$EVTYPE <- "RIP CURRENT"
event[grepl("RECORD/EXCESSIVE HEAT", event$EVTYPE), ]$EVTYPE <- "EXCESSIVE HEAT"
event[grepl("SEVERE THUNDERSTORMS", event$EVTYPE), ]$EVTYPE <- "SEVERE THUNDERSTORM"
event[grepl("SMALL STREAM AND", event$EVTYPE), ]$EVTYPE <- "SMALL STREAM"
event[grepl("SML STREAM FLD", event$EVTYPE), ]$EVTYPE <- "SMALL STREAM"
event[grepl("TORRENTIAL RAINFALL", event$EVTYPE), ]$EVTYPE <- "TORRENTIAL RAIN"
event[grepl("THUNDERTSORM", event$EVTYPE), ]$EVTYPE <- "THUNDEReventORM WIND"
event[grepl("THUNDERTORM", event$EVTYPE), ]$EVTYPE <- "THUNDEReventORM WIND"
event[grepl("SNOW", event$EVTYPE), ]$EVTYPE <- "SNOW"
event[grepl("SLEET", event$EVTYPE), ]$EVTYPE <- "SLEET"
event[grepl("TYPHOON", event$EVTYPE), ]$EVTYPE <- "HURRICANE"
event[grepl("TORNADO", event$EVTYPE), ]$EVTYPE <- "TORNADO"
event[grepl("TORNDAO", event$EVTYPE), ]$EVTYPE <- "TORNADO"
event[grepl("UNSEASONABLY WARM YEAR", event$EVTYPE), ]$EVTYPE <- "UNSEASONABLY WARM"
event[grepl("UNUSUAL/RECORD WARMTH", event$EVTYPE), ]$EVTYPE <- "UNUSUAL WARMTH"
event[grepl("UNUSUALLY WARM", event$EVTYPE), ]$EVTYPE <- "UNUSUAL WARMTH"
event[grepl("URBAN", event$EVTYPE), ]$EVTYPE <- "URBAN SMALL"
event[grepl("UNSEASONABLY WARM & WET", event$EVTYPE), ]$EVTYPE <- "UNSEASONABLY WARM/WET"
event[grepl("VOLCANIC ASH", event$EVTYPE), ]$EVTYPE <- "VOLCANIC ASH"
event[grepl("WATERSPOUT", event$EVTYPE), ]$EVTYPE <- "WATERSPOUT"
event[grepl("WATER SPOUT", event$EVTYPE), ]$EVTYPE <- "WATERSPOUT"
event[grepl("WAYTERSPOUT", event$EVTYPE), ]$EVTYPE <- "WATERSPOUT"
event[grepl("WILDFIRE", event$EVTYPE), ]$EVTYPE <- "WILDFIRE"
event[grepl("WILD FIRES", event$EVTYPE), ]$EVTYPE <- "WILDFIRE"
event[grepl("WINTER WEATHER", event$EVTYPE), ]$EVTYPE <- "WINTER WEATHER"
event[(grepl("WIND", event$EVTYPE) & !(grepl("MARINE THUNDEReventORM WIND", event$EVTYPE)) & !(grepl("THUNDEReventORM WIND", event$EVTYPE))), ]$EVTYPE <- "WIND"
event[grepl("WND", event$EVTYPE), ]$EVTYPE <- "WIND"
event[grepl("WINTER STORMS", event$EVTYPE), ]$EVTYPE <- "WINTER STORM"
event[grepl("WINTERY MIX", event$EVTYPE), ]$EVTYPE <- "WINTER MIX"
event[grepl("WINTRY MIX", event$EVTYPE), ]$EVTYPE <- "WINTER MIX"
This selection ensures that every name is in the correct way
In this section, we’ll start by an aggregation of the frequency of the events and the number of fatalities per event so we can have a much more informed structure
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = event, sum)
table_fatalities<-as.matrix(table(event$EVTYPE))
fatalities<-cbind(fatalities, table_fatalities, rate = fatalities$FATALITIES / table_fatalities)
To have a very strict analysis on the impact of events on fatalities we’ll be having two plots one which will show the number of fatalities per event and the other one is the fatalities rate (Number of fatalities/ Number of occurrences) per event
Plotting the fatalities
order_by_fatalities <- fatalities[order(fatalities$FATALITIES, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_fatalities[order_by_fatalities$FATALITIES>100,], aes(reorder(EVTYPE, -FATALITIES), FATALITIES))
g <- g + geom_bar(stat = "identity", aes(fill = FATALITIES)) +
xlab("Event Type") + ylab("Number of Fatalities") +
ggtitle('Events with death toll more than 100') + theme
plot(g)
In this section, we clearly see that the tornado is the very dangerous
event above all with a number of fatalities more than 3* the number of
fatalities of the top 2 “Excessive Heat”. Thus, more preventive actions
need to be taken considering these events Plotting the fatality rate
order_by_fatalities_rate <- fatalities[order(fatalities$rate, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_fatalities[order_by_fatalities$rate>0.5,], aes(reorder(EVTYPE, -rate), rate))
g <- g + geom_bar(stat = "identity", aes(fill = rate)) +
xlab("Event Type") + ylab("Fatality Rate") +
ggtitle('Events with Fatality rate larger than 0.5') + theme
plot(g)
The reason why we plot the fatality rate is to see the danger zone of an
event, because the fatalities rate removes the redundancy of the event.
So, we clearly see here that for each tropical storm gordon, 8 people
die while for the tornado seen before is less than 0.5. This will have
an impact on the decision making of the government, for example it is
necessary to have more detection material for tornados in the whole
country but it is also necessary to make sure people already know
preventive actions against the tropical storm gordon in the tropical
areas…
In this section, we’ll start by an aggregation of the frequency of the events and the number of injuries per event so we can have a much more informed structure
injuries <- aggregate(INJURIES ~ EVTYPE, data = event, sum)
table_fatalities<-as.matrix(table(event$EVTYPE))
injuries<-cbind(injuries, table_fatalities, rate = injuries$INJURIES / table_fatalities)
To have a very strict analysis on the impact of events on injuries we’ll be having two plots one which will show the number of injuries per event and the other one is the injuries rate (Number of injuries/ Number of occurrences) per event
Plotting the injuries
order_by_injuries <- injuries[order(injuries$INJURIES, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_injuries[order_by_injuries$INJURIES>1000,], aes(reorder(EVTYPE, -INJURIES), INJURIES))
g <- g + geom_bar(stat = "identity", aes(fill = INJURIES)) +
xlab("Event Type") + ylab("Number of Injuries") +
ggtitle('Events with injury toll more than 1000') + theme
plot(g)
We cleary see the same effect of tornados on the injury rate
order_by_injuries_rate <- injuries[order(injuries$rate, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_injuries[order_by_injuries$rate>0.5,], aes(reorder(EVTYPE, -rate), rate))
g <- g + geom_bar(stat = "identity", aes(fill = rate)) +
xlab("Event Type") + ylab("Injury Rate") +
ggtitle('Events with Injury rate larger than 0.5') + theme
plot(g)
From the analysis here, it is clear and obvious that the injury and death rate are correlated which is logical. It is needed to do the analysis only on the injuries or the fatalities.
First thing to do is the examine the exponents of 10 that will gave us the total crops and properties damage which is shown on the PROPDMGEXP and CROPDMGEXP columns
print(table(event$PROPDMGEXP))
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
print(table(event$CROPDMGEXP))
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
As we can observe clearly, that there is different notations, some which are unclear such as ? which means no scaling but unsure, + which is no scaling but it may be higher, and - which is no scaling but it can be lower
id <- c("", "-", "+", "?", 0:9, "h", "H", "k", "K", "m", "M", "b", "B")
exponent <- c(0,0,0,0, 0:9, 2, 2, 3, 3, 6, 6, 9, 9)
matching_table <- data.frame(id, exponent)
event$Total_Properties_damage <- event$PROPDMG * 10 ^ matching_table[match(event$PROPDMGEXP, matching_table$id), 2]
event$Total_Crop_damage <- event$CROPDMG * 10 ^ matching_table[match(event$CROPDMGEXP, matching_table$id), 2]
Properties_damage <- aggregate(Total_Properties_damage ~ EVTYPE, data = event, sum)
Properties_damage_order <- Properties_damage[order(Properties_damage$Total_Properties_damage, decreasing = TRUE), ]
Crop_damage<- aggregate(Total_Crop_damage ~ EVTYPE, data = event, sum)
Crop_damage_order <- Crop_damage[order(Crop_damage$Total_Crop_damage, decreasing = TRUE), ]
Plotting the properties damage
library(ggplot2)
g <- ggplot(Properties_damage_order[Properties_damage_order$Total_Properties_damage>1000000000,], aes(reorder(EVTYPE, -Total_Properties_damage), Total_Properties_damage))
g <- g + geom_bar(stat = "identity", aes(fill = Total_Properties_damage)) +
xlab("Event Type") + ylab("Total_Properties_damage") +
ggtitle('Total properties damage by event more than 1 billion $') + theme
plot(g)
Plotting the crop damage
library(ggplot2)
g <- ggplot(Crop_damage_order[Crop_damage_order$Total_Crop_damage>1000000000,], aes(reorder(EVTYPE, -Total_Crop_damage), Total_Crop_damage))
g <- g + geom_bar(stat = "identity", aes(fill = Total_Crop_damage)) +
xlab("Event Type") + ylab("Total_Crops_damage") +
ggtitle('Total crops damage by event more than 1 billion $') + theme
plot(g)
As we can see here there is less events that cause crop damage than properties damage in the order of billions of dollars. Thus, this analysis is very intriguing because it shows how the change in events have very large effects on billions of dollars which is necessary for governments so they can track changes that need to be done in order to reduce drastically the damage. For example, make the properties much more resilient to floods and ensure that pipes are sufficient to retrieve alone very large amounts of water. And how to make crops and fields resilient to drought but introducing new water technologies such as dessalination. These two approach only can save up to 10^11 dollars if applied efficienly which is a perfect way of talking.