Sree June 20th 2014
1.0 Purpose
This project explores the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The purpose of this report is to answer two questions
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
2.0 Data Processing
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data was downloaded from the Coursera Reproducible Research web site [Stormdata (47Mb) ]
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
2.1 Prepare for loading the data
Set the working directory
setwd("C://Users/sreekantha/Documents/data-science/Assignments/Sub5/feed")
Install and load required packages
library("data.table")
library("knitr")
2.2 Import the data
Download data from the Coursera peer assesment location and processed as 'CSV' file in a single script fragment. The files will be unzipped from current working directory defined above Below is the link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
loading into a data frame
stormdata <- read.csv("/Users/sreekantha/Documents/data-science/Assignments/Sub5/feed/repdata-data-StormData.csv", sep = ",", stringsAsFactors = FALSE)
2.3 Analyse the data
Analyse the structure of the stormdata using the str command
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Analyse the summary of the data fields using the summary command
summary(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31 Class :character Class :character Class :character
## Median : 75 Mode :character Mode :character Mode :character
## Mean :101
## 3rd Qu.:131
## Max. :873
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0 Class :character Class :character Class :character
## Median : 0 Mode :character Mode :character Mode :character
## Mean : 1
## 3rd Qu.: 1
## Max. :3749
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0
## Mode :character Median :0 Median : 0
## Mean :0 Mean : 1
## 3rd Qu.:0 3rd Qu.: 0
## Max. :0 Max. :925
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0 Min. : 0
## Class :character Class :character 1st Qu.: 0.0 1st Qu.: 0
## Mode :character Mode :character Median : 0.0 Median : 0
## Mean : 0.2 Mean : 8
## 3rd Qu.: 0.0 3rd Qu.: 0
## Max. :2315.0 Max. :4400
##
## F MAG FATALITIES INJURIES
## Min. :0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.:0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0
## Median :1 Median : 50 Median : 0 Median : 0.0
## Mean :1 Mean : 47 Mean : 0 Mean : 0.2
## 3rd Qu.:1 3rd Qu.: 75 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :5 Max. :22000 Max. :583 Max. :1700.0
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0 Length:902297 Min. : 0.0 Length:902297
## 1st Qu.: 0 Class :character 1st Qu.: 0.0 Class :character
## Median : 0 Mode :character Median : 0.0 Mode :character
## Mean : 12 Mean : 1.5
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :5000 Max. :990.0
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
Analyse the first lines of the dataset using the head command
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
2.4 Prepare the data
One of the first actions to enable aggregation of the fatalities and injuries numbers based on event type, is ensuring that the former are numeric.
Convert fatalities field to numeric
stormdata$FATALITIES <- as.numeric(stormdata$FATALITIES)
Convert injuries field to numeric
stormdata$INJURIES <- as.numeric(stormdata$INJURIES)
Clean-up the event type
The event type (EVTYPE) contains duplicate categories based on mixed cases.
stormdata$EVTYPE <- toupper(stormdata$EVTYPE)
eventtype <- sort(unique(stormdata$EVTYPE))
Show first 50 event types
eventtype[1:50]
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
## [21] "BEACH EROSION" "BEACH EROSION/COASTAL FLOOD"
## [23] "BEACH FLOOD" "BELOW NORMAL PRECIPITATION"
## [25] "BITTER WIND CHILL" "BITTER WIND CHILL TEMPERATURES"
## [27] "BLACK ICE" "BLIZZARD"
## [29] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [31] "BLIZZARD SUMMARY" "BLIZZARD WEATHER"
## [33] "BLIZZARD/FREEZING RAIN" "BLIZZARD/HEAVY SNOW"
## [35] "BLIZZARD/HIGH WIND" "BLIZZARD/WINTER STORM"
## [37] "BLOW-OUT TIDE" "BLOW-OUT TIDES"
## [39] "BLOWING DUST" "BLOWING SNOW"
## [41] "BLOWING SNOW- EXTREME WIND CHI" "BLOWING SNOW & EXTREME WIND CH"
## [43] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"
## [45] "BRUSH FIRE" "BRUSH FIRES"
## [47] "COASTAL FLOODING/EROSION" "COASTAL EROSION"
## [49] "COASTAL FLOOD" "COASTAL FLOODING"
the event types show still a lot of similarities, that ultimately need to be adjusted, some parts can be automatically converted a lot of the others need manual actions.
The next step is to transfer the event type to a factor
stormdata$EVTYPE <- as.factor(stormdata$EVTYPE)
2.5 Aggregating event data
Consolidate all lethal events.
fatalities <- as.data.table(subset(aggregate(FATALITIES ~ EVTYPE, data = stormdata,
FUN = "sum"), FATALITIES > 0))
fatalities <- fatalities[order(-FATALITIES), ]
Show the first 20 rows
top20 <- fatalities[1:20, ]
library(ggplot2)
ggplot(data = top20, aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") + xlab("Event") + ylab("Fatalities") + ggtitle("Fatalities caused by Events (top 20) ") + coord_flip() + theme(legend.position = "none")
The graph clearly shows that tornado's are by far the most deadly disaster over the years
Consolidate all events with injuries
injuries <- as.data.table(subset(aggregate(INJURIES ~ EVTYPE, data = stormdata,
FUN = "sum"), INJURIES > 0))
injuries <- injuries[order(-INJURIES), ]
Show again the first 20 rows
top20i <- injuries[1:20, ]
ggplot(data = top20i, aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") +
xlab("Event") + ylab("Injuries") + ggtitle("Injuries caused by Events (top 20) ") +
coord_flip() + theme(legend.position = "none")
Again Tornado is by far the leader of injuries caused by events
2.6 Economic impact of events
First check the exponent data, to see which exponents we have
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
It shows that again we have mixed cases, for example h and H or m and M.
stormdata$PROPDMGEXP <- toupper(stormdata$PROPDMGEXP)
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" "" "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"
table(stormdata$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B H K M
## 4 5 1 40 7 424665 11337
Now that we cleaned the exponents, lets convert them to numeric values.
calcExp <- function(x, exp = "") {
switch(exp, `-` = x * -1, `?` = x, `+` = x, `1` = x, `2` = x * (10^2), `3` = x *
(10^3), `4` = x * (10^4), `5` = x * (10^5), `6` = x * (10^6), `7` = x *
(10^7), `8` = x * (10^8), H = x * 100, K = x * 1000, M = x * 1e+06,
B = x * 1e+09, x)
}
applyCalcExp <- function(vx, vexp) {
if (length(vx) != length(vexp))
stop("Not same size")
result <- rep(0, length(vx))
for (i in 1:length(vx)) {
result[i] <- calcExp(vx[i], vexp[i])
}
result
}
Now we are able to calculate the damage costs, caused by the events, let me call it EconomicCosts
stormdata$EconomicCosts <- applyCalcExp(as.numeric(stormdata$PROPDMG), stormdata$PROPDMGEXP)
summary(stormdata$EconomicCosts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.50e+01 0.00e+00 0.00e+00 4.75e+05 5.00e+02 1.15e+11
Consolidate the economic costs based on event
costs <- as.data.table(subset(aggregate(EconomicCosts ~ EVTYPE, data = stormdata,
FUN = "sum"), EconomicCosts > 0))
costs <- costs[order(-EconomicCosts), ]
Show again the first 20 rows
library(scales)
top20c <- costs[1:20, ]
ggplot(data = top20c, aes(EVTYPE, EconomicCosts, fill = EconomicCosts)) + geom_bar(stat = "identity") +
scale_y_continuous(labels = comma) + xlab("Event") + ylab("Economic costs in $") +
ggtitle("Economic costs caused by Events (top 20) ") + coord_flip() + theme(legend.position = "none")