Synopsis

The analysis exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database presented in this document which tracks characteristics of major storms and weather events in the United states, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage will make us answer two major questions :

Which types of events are more harmful to population health ?

Which types of events have the greatest economic consequences ?

And overall, tornado is the most affecting event on the population, while flood is the event that causes the most properties damage, with drought being the N°1 enemy of crops

Here is the link to download the data : https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

Here is the link to download the documentation of the data : https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

Data processing

The first thing to do is downloading the data from the course website and reading the file to retrieve the data

data <- read.csv("repdata-data-StormData.csv.bz2")
head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Before having the data we need,

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

In this section, we’ll retrieve the only variables that intereevents us (The variables that have impact on the felware of the population which is the fatalities and injuries and on the economy (Crop damage and properties damage))

event<-data[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP","CROPDMG", "CROPDMGEXP")]

The first thing to do is to examine the events types that are existing in the united states After examining different types of events, I have seen that it is necessary to change some of names due to some typing errors or some names that have been written two times in a different way First, we’ll change to names to capital letters to mix some names that appear multiple times such as (Wintery Mix)

event$EVTYPE <- toupper(event$EVTYPE)

That didn’t change a lot, but we’ll now reconsider the events types on the data documentation by changing every mispelled name to the original name but before we’ll remove the extra spaces

event$EVTYPE<-gsub("(^[[:space:]]+|[[:space:]]+$)", "", event$EVTYPE)
event[grepl("BLIZZARD", event$EVTYPE), ]$EVTYPE <- "BLIZZARD"
event[grepl("AVALANCE", event$EVTYPE), ]$EVTYPE <- "AVALANCHE"
event[grepl("BRUSH FIRE", event$EVTYPE), ]$EVTYPE <- "BRUSH FIRE"
event[grepl("BLOW-OUT TIDE", event$EVTYPE), ]$EVTYPE <- "BLOW-OUT TIDE"
event[grepl("DRY", event$EVTYPE), ]$EVTYPE <- "DRY"
event[grepl("DROUGHT", event$EVTYPE), ]$EVTYPE <- "DROUGHT"
event[(grepl("COLD", event$EVTYPE) & !(grepl("EXTREME COLD", event$EVTYPE)) & !(grepl("EXTREME/RECORD COLD", event$EVTYPE))),]$EVTYPE <- "COLD"
event[grepl("EXTREME/RECORD COLD", event$EVTYPE), ]$EVTYPE <- "EXTREME COLD"
event[grepl("FLASH FLOOD", event$EVTYPE), ]$EVTYPE <- "FLASH FLOOD"
event[grepl("FLOOD FLASH", event$EVTYPE), ]$EVTYPE <- "FLASH FLOOD"
event[grepl("FLOOD/FLASH", event$EVTYPE), ]$EVTYPE <- "FLASH FLOOD"
event[(grepl("FLOOD", event$EVTYPE) & !(grepl("FLASH FLOOD", event$EVTYPE)) & !(grepl("LAKESHORE FLOOD", event$EVTYPE)) & !(grepl("COAeventAL FLOOD", event$EVTYPE))),]$EVTYPE <- "FLOOD"
event[grepl("FUNNEL", event$EVTYPE), ]$EVTYPE <- "FUNNEL CLOUD"
event[grepl("GLAZE", event$EVTYPE), ]$EVTYPE <- "GLAZE"
event[grepl("HAIL", event$EVTYPE) & !(grepl("MARINE HAIL", event$EVTYPE)), ]$EVTYPE <- "HAIL"
event[grepl("HEAT", event$EVTYPE) & !(grepl("EXCESSIVE HEAT", event$EVTYPE)), ]$EVTYPE <- "HEAT"
event[grepl("HEAVY RAIN", event$EVTYPE), ]$EVTYPE <- "HEAVY RAIN"
event[grepl("HEAVY PRECIPATATION", event$EVTYPE), ]$EVTYPE <- "HEAVY PRECIPITATION"
event[grepl("HIGH  SWELLS", event$EVTYPE), ]$EVTYPE <- "HIGH SWELLS"
event[grepl("HEAVY SHOWERS", event$EVTYPE), ]$EVTYPE <- "HEAVY SHOWER"
event[grepl("HVY RAIN", event$EVTYPE), ]$EVTYPE <- "HEAVY RAIN"
event[grepl("HEAVY SNOW", event$EVTYPE), ]$EVTYPE <- "HEAVY SNOW"
event[grepl("HIGH SURF", event$EVTYPE), ]$EVTYPE <- "HIGH SURF"
event[grepl("HIGH WIND", event$EVTYPE) & !(grepl("MARINE HIGH WIND", event$EVTYPE)), ]$EVTYPE <- "HIGH WIND"
event[grepl("HURRICANE", event$EVTYPE), ]$EVTYPE <- "HURRICANE"
event[grepl("HYPOTHERMIA/EXPOSURE", event$EVTYPE), ]$EVTYPE <- "HYPOTHERMIA"
event[grepl("ICE ON ROAD", event$EVTYPE), ]$EVTYPE <- "ICE ROADS"
event[grepl("ICY ROADS", event$EVTYPE), ]$EVTYPE <- "ICE ROADS"
event[grepl("LANDSLIDES", event$EVTYPE), ]$EVTYPE <- "LANDSLIDE"
event[grepl("LAKE EFFECT SNOW", event$EVTYPE), ]$EVTYPE <- "LAKE-EFFECT SNOW"
event[grepl("LIGHTNING", event$EVTYPE), ]$EVTYPE <- "LIGHTNING"
event[grepl("LIGHTING", event$EVTYPE), ]$EVTYPE <- "LIGHTNING"
event[grepl("LIGNTNING", event$EVTYPE), ]$EVTYPE <- "LIGHTNING"
event[grepl("LOW TEMPERATURE RECORD", event$EVTYPE), ]$EVTYPE <- "LOW TEMPERATURE"
event[grepl("MUD SLIDE", event$EVTYPE), ]$EVTYPE <- "MUDSLIDE"
event[grepl("MUDSLIDE/LANDSLIDE", event$EVTYPE), ]$EVTYPE <- "MUDSLIDE"
event[grepl("MUD/ROCK SLIDE", event$EVTYPE), ]$EVTYPE <- "MUDSLIDE"
event[grepl("RIP CURRENT", event$EVTYPE), ]$EVTYPE <- "RIP CURRENT"
event[grepl("RECORD/EXCESSIVE HEAT", event$EVTYPE), ]$EVTYPE <- "EXCESSIVE HEAT"
event[grepl("SEVERE THUNDERSTORMS", event$EVTYPE), ]$EVTYPE <- "SEVERE THUNDERSTORM"
event[grepl("SMALL STREAM AND", event$EVTYPE), ]$EVTYPE <- "SMALL STREAM"
event[grepl("SML STREAM FLD", event$EVTYPE), ]$EVTYPE <- "SMALL STREAM"
event[grepl("TORRENTIAL RAINFALL", event$EVTYPE), ]$EVTYPE <- "TORRENTIAL RAIN"
event[grepl("THUNDERTSORM", event$EVTYPE), ]$EVTYPE <- "THUNDEReventORM WIND"
event[grepl("THUNDERTORM", event$EVTYPE), ]$EVTYPE <- "THUNDEReventORM WIND"
event[grepl("SNOW", event$EVTYPE), ]$EVTYPE <- "SNOW"
event[grepl("SLEET", event$EVTYPE), ]$EVTYPE <- "SLEET"
event[grepl("TYPHOON", event$EVTYPE), ]$EVTYPE <- "HURRICANE"
event[grepl("TORNADO", event$EVTYPE), ]$EVTYPE <- "TORNADO"
event[grepl("TORNDAO", event$EVTYPE), ]$EVTYPE <- "TORNADO"
event[grepl("UNSEASONABLY WARM YEAR", event$EVTYPE), ]$EVTYPE <- "UNSEASONABLY WARM"
event[grepl("UNUSUAL/RECORD WARMTH", event$EVTYPE), ]$EVTYPE <- "UNUSUAL WARMTH"
event[grepl("UNUSUALLY WARM", event$EVTYPE), ]$EVTYPE <- "UNUSUAL WARMTH"
event[grepl("URBAN", event$EVTYPE), ]$EVTYPE <- "URBAN SMALL"
event[grepl("UNSEASONABLY WARM & WET", event$EVTYPE), ]$EVTYPE <- "UNSEASONABLY WARM/WET"
event[grepl("VOLCANIC ASH", event$EVTYPE), ]$EVTYPE <- "VOLCANIC ASH"
event[grepl("WATERSPOUT", event$EVTYPE), ]$EVTYPE <- "WATERSPOUT"
event[grepl("WATER SPOUT", event$EVTYPE), ]$EVTYPE <- "WATERSPOUT"
event[grepl("WAYTERSPOUT", event$EVTYPE), ]$EVTYPE <- "WATERSPOUT"
event[grepl("WILDFIRE", event$EVTYPE), ]$EVTYPE <- "WILDFIRE"
event[grepl("WILD FIRES", event$EVTYPE), ]$EVTYPE <- "WILDFIRE"
event[grepl("WINTER WEATHER", event$EVTYPE), ]$EVTYPE <- "WINTER WEATHER"
event[(grepl("WIND", event$EVTYPE) & !(grepl("MARINE THUNDEReventORM WIND", event$EVTYPE)) & !(grepl("THUNDEReventORM WIND", event$EVTYPE))), ]$EVTYPE <- "WIND"
event[grepl("WND", event$EVTYPE), ]$EVTYPE <- "WIND"
event[grepl("WINTER STORMS", event$EVTYPE), ]$EVTYPE <- "WINTER STORM"
event[grepl("WINTERY MIX", event$EVTYPE), ]$EVTYPE <- "WINTER MIX"
event[grepl("WINTRY MIX", event$EVTYPE), ]$EVTYPE <- "WINTER MIX"

This selection ensures that every name is in the correct way

Data Analysis

Impact on the population

Fatalities

In this section, we’ll start by an aggregation of the frequency of the events and the number of fatalities per event so we can have a much more informed structure

fatalities <- aggregate(FATALITIES ~ EVTYPE, data = event, sum)
table_fatalities<-as.matrix(table(event$EVTYPE))
fatalities<-cbind(fatalities, table_fatalities, rate = fatalities$FATALITIES / table_fatalities)

To have a very strict analysis on the impact of events on fatalities we’ll be having two plots one which will show the number of fatalities per event and the other one is the fatalities rate (Number of fatalities/ Number of occurrences) per event

Plotting the fatalities

order_by_fatalities <- fatalities[order(fatalities$FATALITIES, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_fatalities[order_by_fatalities$FATALITIES>100,], aes(reorder(EVTYPE, -FATALITIES), FATALITIES))
g <- g + geom_bar(stat = "identity", aes(fill = FATALITIES)) + 
    xlab("Event Type") + ylab("Number of Fatalities") + 
    ggtitle('Events with death toll more than 100') + theme
plot(g)

In this section, we clearly see that the tornado is the very dangerous event above all with a number of fatalities more than 3* the number of fatalities of the top 2 “Excessive Heat”. Thus, more preventive actions need to be taken considering these events Plotting the fatality rate

order_by_fatalities_rate <- fatalities[order(fatalities$rate, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_fatalities[order_by_fatalities$rate>0.5,], aes(reorder(EVTYPE, -rate), rate))
g <- g + geom_bar(stat = "identity", aes(fill = rate)) + 
    xlab("Event Type") + ylab("Fatality Rate") + 
    ggtitle('Events with Fatality rate larger than 0.5') + theme
plot(g)

The reason why we plot the fatality rate is to see the danger zone of an event, because the fatalities rate removes the redundancy of the event. So, we clearly see here that for each tropical storm gordon, 8 people die while for the tornado seen before is less than 0.5. This will have an impact on the decision making of the government, for example it is necessary to have more detection material for tornados in the whole country but it is also necessary to make sure people already know preventive actions against the tropical storm gordon in the tropical areas…

Injuries

In this section, we’ll start by an aggregation of the frequency of the events and the number of injuries per event so we can have a much more informed structure

injuries <- aggregate(INJURIES ~ EVTYPE, data = event, sum)
table_fatalities<-as.matrix(table(event$EVTYPE))
injuries<-cbind(injuries, table_fatalities, rate = injuries$INJURIES / table_fatalities)

To have a very strict analysis on the impact of events on injuries we’ll be having two plots one which will show the number of injuries per event and the other one is the injuries rate (Number of injuries/ Number of occurrences) per event

Plotting the injuries

order_by_injuries <- injuries[order(injuries$INJURIES, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_injuries[order_by_injuries$INJURIES>1000,], aes(reorder(EVTYPE, -INJURIES), INJURIES))
g <- g + geom_bar(stat = "identity", aes(fill = INJURIES)) + 
    xlab("Event Type") + ylab("Number of Injuries") + 
    ggtitle('Events with injury toll more than 1000') + theme
plot(g)

We cleary see the same effect of tornados on the injury rate

order_by_injuries_rate <- injuries[order(injuries$rate, decreasing = TRUE), ]
library(ggplot2)
g <- ggplot(order_by_injuries[order_by_injuries$rate>0.5,], aes(reorder(EVTYPE, -rate), rate))
g <- g + geom_bar(stat = "identity", aes(fill = rate)) + 
    xlab("Event Type") + ylab("Injury Rate") + 
    ggtitle('Events with Injury rate larger than 0.5') + theme
plot(g)

From the analysis here, it is clear and obvious that the injury and death rate are correlated which is logical. It is needed to do the analysis only on the injuries or the fatalities.

Impact on the economy

Property and Crop Damages

First thing to do is the examine the exponents of 10 that will gave us the total crops and properties damage which is shown on the PROPDMGEXP and CROPDMGEXP columns

print(table(event$PROPDMGEXP))
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
print(table(event$CROPDMGEXP))
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

As we can observe clearly, that there is different notations, some which are unclear such as ? which means no scaling but unsure, + which is no scaling but it may be higher, and - which is no scaling but it can be lower

id <- c("", "-", "+", "?", 0:9, "h", "H", "k", "K", "m", "M", "b", "B")
exponent <- c(0,0,0,0, 0:9, 2, 2, 3, 3, 6, 6, 9, 9)
matching_table <- data.frame(id, exponent)
event$Total_Properties_damage <- event$PROPDMG * 10 ^ matching_table[match(event$PROPDMGEXP, matching_table$id), 2]
event$Total_Crop_damage <- event$CROPDMG * 10 ^ matching_table[match(event$CROPDMGEXP, matching_table$id), 2]
Properties_damage <- aggregate(Total_Properties_damage ~ EVTYPE, data = event, sum)
Properties_damage_order <- Properties_damage[order(Properties_damage$Total_Properties_damage, decreasing = TRUE), ]
Crop_damage<- aggregate(Total_Crop_damage ~ EVTYPE, data = event, sum)
Crop_damage_order <- Crop_damage[order(Crop_damage$Total_Crop_damage, decreasing = TRUE), ]

Plotting the properties damage

library(ggplot2)
g <- ggplot(Properties_damage_order[Properties_damage_order$Total_Properties_damage>1000000000,], aes(reorder(EVTYPE, -Total_Properties_damage), Total_Properties_damage))
g <- g + geom_bar(stat = "identity", aes(fill = Total_Properties_damage)) + 
    xlab("Event Type") + ylab("Total_Properties_damage") + 
    ggtitle('Total properties damage by event more than 1 billion $') + theme
plot(g)

Plotting the crop damage

library(ggplot2)
g <- ggplot(Crop_damage_order[Crop_damage_order$Total_Crop_damage>1000000000,], aes(reorder(EVTYPE, -Total_Crop_damage), Total_Crop_damage))
g <- g + geom_bar(stat = "identity", aes(fill = Total_Crop_damage)) + 
    xlab("Event Type") + ylab("Total_Crops_damage") + 
    ggtitle('Total crops damage by event more than 1 billion $') + theme
plot(g)

As we can see here there is less events that cause crop damage than properties damage in the order of billions of dollars. Thus, this analysis is very intriguing because it shows how the change in events have very large effects on billions of dollars which is necessary for governments so they can track changes that need to be done in order to reduce drastically the damage. For example, make the properties much more resilient to floods and ensure that pipes are sufficient to retrieve alone very large amounts of water. And how to make crops and fields resilient to drought but introducing new water technologies such as dessalination. These two approach only can save up to 10^11 dollars if applied efficienly which is a perfect way of talking.