This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) STORM database, and finding out which climate events are most harmful in terms of human and economic damage.
The NOOA STORM database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The database contains records of major events starting in 1950 to actual date.
The database was firstly analysed in relation to its size, characteristics and variables of interest to the analysis. Data was then processed so that the analysis could be performed. Details of the data processing performed are detailed in this report.
The analysis of the database shows that tornadoes are by far the most harmful weather event, in terms of both fatalities and injuries. This is followed by thunderstorm wind and then by flooding. In relation to their economic impact to both property & crops, flooding is the most harmful weather event, followed by hurricanes and then by tornadoes.
First we load the dabatase into our computer, and load required R programming libraries:
### Load the database ###
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
dest="tmp.bz2",
method="curl")
NOOA_STORM_DB <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
### Load required R language libraries ###
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
We then have a look at the STORM DB structure and contents:
str(NOOA_STORM_DB)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The database has 902,297 weather events, each observation has 37 variables.
Our variables of interest are:
We then proceed to find out if there are any N/A o “NAN” (not a number) values in FATALITIES and INJURIES, for all of the events present/recorded in the DB.
isna <- is.na(NOOA_STORM_DB$FATALITIES)
table(isna)
## isna
## FALSE
## 902297
No missing values for FATALITIES.
isna <- is.na(NOOA_STORM_DB$INJURIES)
table(isna)
## isna
## FALSE
## 902297
And no missing values for INJURIES.
Therefore, we do not need to account or make any data transformations to the DB in relation to this.
Now we proceed to aggregate/sum up all fatalities and injuries, by Event Type (EVTYPE):
totalcasualties <- aggregate(list(NOOA_STORM_DB$FATALITIES, NOOA_STORM_DB$INJURIES), by=list(NOOA_STORM_DB$EVTYPE), FUN=sum)
colnames(totalcasualties) <- c("Event", "Fatalities", "Injuries")
We sort the results in descending order for both FATALITIES and INJURIES
totalcasualties <- arrange(totalcasualties, desc(Fatalities), desc(Injuries))
Moving on to economic impact (damage) data in the database, we firstly analyse values of PROPDMG and CROPDMG variables:
### Review of damage info in database
table(NOOA_STORM_DB$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
print (table(NOOA_STORM_DB$PROPDMGEXP))
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(NOOA_STORM_DB$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
print(table(NOOA_STORM_DB$CROPDMGEXP))
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
As per the NOOA STORM database documentation, exponential/magnitudes for dollar amounts related to damage estimates can have 3 values: K, M and B. We do see here that there are additional values in the DB.
We review some of these cases to try understand what do these non-described values could mean:
oddEXPvalues <- grep("1", NOOA_STORM_DB$PROPDMGEXP)
oddEXPvalues <- NOOA_STORM_DB [oddEXPvalues, ]
print(head(oddEXPvalues, 10))
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 233829 47 5/10/1995 0:00:00 2140 CST 9 BLOUNT
## 233830 47 5/10/1995 0:00:00 2145 CST 9 BLOUNT
## 233833 47 5/18/1995 0:00:00 2105 CST 9 BLOUNT
## 233876 47 5/14/1995 0:00:00 0545 CST 13 CAMPBELL
## 233898 47 5/18/1995 0:00:00 0840 CST 17 CARROLL
## 233955 47 5/18/1995 0:00:00 1730 CST 31 COFFEE
## 233971 47 5/14/1995 0:00:00 0400 CST 35 CUMBERLAND
## 233972 47 5/18/1995 0:00:00 1707 CST 35 CUMBERLAND
## 233974 47 5/18/1995 0:00:00 1930 CST 35 CUMBERLAND
## 234019 47 5/18/1995 0:00:00 1345 CST 39 DECATUR
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI
## 233829 TN THUNDERSTORM WINDS 0 Friendsville
## 233830 TN THUNDERSTORM WINDS 0 Louisville
## 233833 TN THUNDERSTORM WINDS 0 Fountain City
## 233876 TN THUNDERSTORM WINDS 0 Jacksboro
## 233898 TN THUNDERSTORM WINDS 0 Huntingdon
## 233955 TN TORNADO 0 Manchester
## 233971 TN THUNDERSTORM WINDS 0
## 233972 TN HAIL 0 Crossville
## 233974 TN TORNADO 0 Lake Tanzi Village
## 234019 TN HAIL 0 Decaturville
## END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## 233829 0 NA 0
## 233830 2150CST 0 NA 0
## 233833 0 NA 0
## 233876 0 NA 0
## 233898 0 NA 0
## 233955 0 NA 0
## 233971 0500CST 0 NA 0
## 233972 0 NA 0
## 233974 0 NA 0
## 234019 1350CST 0 NA 0
## END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 233829 0 0 NA 0 0 0 0
## 233830 0 0 NA 57 0 0 0
## 233833 0 0 NA 0 0 0 0
## 233876 0 0 NA 0 0 0 0
## 233898 0 0 NA 0 0 0 0
## 233955 1 20 0 0 0 0 0
## 233971 0 0 NA 0 0 0 0
## 233972 0 0 NA 175 0 0 0
## 233974 2 25 1 0 0 0 0
## 234019 0 0 NA 275 0 0 0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 233829 1 0 0
## 233830 1 0 0
## 233833 1 0 0
## 233876 1 0 0
## 233898 1 0 0
## 233955 1 0 3529
## 233971 1 0 0
## 233972 1 0 0
## 233974 1 0 3558
## 234019 1 0 0
## LONGITUDE LATITUDE_E LONGITUDE_
## 233829 0 0 0
## 233830 0 0 0
## 233833 0 0 0
## 233876 0 0 0
## 233898 0 0 0
## 233955 8605 0 0
## 233971 0 0 0
## 233972 0 0 0
## 233974 8459 0 0
## 234019 0 0 0
## REMARKS
## 233829 Numerous trees wwere knocked down.
## 233830 A 65 mph wind gust was measured at the Knoxville National Weather Service Office. Numerous trees and power lines were knocked down in Alcoa.
## 233833 Several trees were uprooted. One tree fell and heavily damaged a house.
## 233876 A roof was blown off of a barn. Several trees and power lines were blown down as well.
## 233898 Several trees were knocked down. One fallen tree damaged a church roof.
## 233955 A house trailer was overturned. Several large trees were blown down.
## 233971 Numerous telephone and power lines were blown down.
## 233972
## 233974 At least 30 homes and house trailers were damaged.
## 234019 Several cars had their windshield broken by hail.
## REFNUM
## 233829 233790
## 233830 233791
## 233833 233794
## 233876 233837
## 233898 233859
## 233955 233942
## 233971 233958
## 233972 233959
## 233974 233961
## 234019 234006
There are not monetary/USD values associated with the non-described PRODMGEXP and CROPDMGEXP exponentials. Maybe if we remove events that do no have DAMAGE COSTs associated we get rid of strange factors in PROPDMEXP & CROPDMGEXP So let´s take out all rows where PROPERTY or CROP DAMAGE = 0
NOOA_fordamage_Est <- filter(NOOA_STORM_DB, PROPDMG != 0 | CROPDMG != 0)
table(NOOA_fordamage_Est$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 4357 1 0 5 209 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 2 0 40 1 6 229057 7 11319
table(NOOA_fordamage_Est$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 145037 6 17 0 7 21 97960 1 1982
No luck. Non-described exponential values are still present in the database.
In reviewing frequency and value of PROPERTY & CROP DAMAGE descriptive statistics values, I assume we can discard the odd exponential values since their contribution to damage costing estimates are negligible.
We continue the processing so that we can analyse damage impact of events. We will use the subsetted database that we generated to extract only weather events that caused damage.
In order to discard the odd exponential values –> anything that is not Billion (B), Million (M) or Hundred (H), is set to “0” so will not add to the sum of costs.
NOOA_fordamage_Est$PROPDMGEXP <- toupper(NOOA_fordamage_Est$PROPDMGEXP)
NOOA_fordamage_Est$CROPDMGEXP <- toupper(NOOA_fordamage_Est$CROPDMGEXP)
for (i in seq_along(NOOA_fordamage_Est$PROPDMGEXP))
{
if (NOOA_fordamage_Est [i,26] == "K")
{NOOA_fordamage_Est [i,26] =1000}
else if (NOOA_fordamage_Est [i,26] == "M")
{NOOA_fordamage_Est [i,26 ] =1e+06}
else if (NOOA_fordamage_Est [i,26] == "B")
{NOOA_fordamage_Est [i,26 ] =1e+09}
else {NOOA_fordamage_Est [i,26] =0}
}
for (i in seq_along(NOOA_fordamage_Est$CROPDMGEXP))
{
if (NOOA_fordamage_Est [i,28] == "K")
{NOOA_fordamage_Est [i,28] =1000}
else if (NOOA_fordamage_Est [i,28] == "M")
{NOOA_fordamage_Est [i,28 ] =1e+06}
else if (NOOA_fordamage_Est [i,28] == "B")
{NOOA_fordamage_Est [i,28 ] =1e+09}
else {NOOA_fordamage_Est [i,28] =0}
}
NOOA_fordamage_Est$PROPDMGEXP <- as.numeric(NOOA_fordamage_Est$PROPDMGEXP)
NOOA_fordamage_Est$CROPDMGEXP <- as.numeric(NOOA_fordamage_Est$CROPDMGEXP)
Now we: * Multiply the dollar amounts by their exponential, so we have Total Property Damge & Total Crop Damage per event * Calculate the TOTAL DAMAGE (sum of property and crop values) of each event * Aggregate/sum up TOTAL DAMAGE, per event (new dataframe called “Damagecosts”)
NOOA_fordamage_Est <- mutate(NOOA_fordamage_Est, Damages_cost = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP)
Damagecosts <- select(NOOA_fordamage_Est, EVTYPE, Damages_cost)
Damagecosts <- aggregate(Damagecosts$Damages_cost, by=list(Damagecosts$EVTYPE), FUN=sum)
colnames(Damagecosts) <- c("Event", "Damages_cost")
Damagecosts <- arrange(Damagecosts, desc(Damages_cost))
Review of the Top 10 most harmful weather events in terms of fatalities and injuries:
head(totalcasualties, 10)
## Event Fatalities Injuries
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
And here is the plot of these results:
### First melt the DF so that ggplot can be used
topcasualties <- melt (totalcasualties, id.vars = "Event")
colnames(topcasualties) <- c("Event", "Type_of_harm", "Value")
topcasualties <- arrange(topcasualties, desc(Value))
Casualties_plot <- ggplot(topcasualties [1:20, ], aes(x=reorder(Event, -Value), y=Value))
Casualties_plot + geom_bar(stat="identity", aes(fill=Type_of_harm), position="dodge") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + ggtitle("10 Most Harmful Events") + xlab("Type of Event") + ylab("# of occurences")
Tornadoes are by far the most harmful weather event, in terms of both fatalities and injuries. This is followed by thunderstorm wind, and then by flooding.
Review of the Top 10 most harmful weather events in terms of economic impact:
head(Damagecosts, 10)
## Event Damages_cost
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352113590
## 4 STORM SURGE 43323541000
## 5 HAIL 18758221170
## 6 FLASH FLOOD 17562128610
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041310
And here is the corresponding plot of these results:
damageplot <- ggplot(Damagecosts[1:10, ], aes(x=reorder(Event, -Damages_cost), y=Damages_cost))
damageplot + geom_bar(stat = "identity", fill="Blue") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + ggtitle("Top 10 Events with Highest Economic Impact") + ylab("Damages Cost in USD") + xlab("Event type")
Flooding is responsible for the largest proportion of total economic impact out of all event types, followed by hurricanes and then by tornadoes.