The goals of the work were to address the following questions:
As the result of the analysis described below, the following answers are:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this work come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data can be downloaded from the web site:
Storm Data [47Mb] https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Documentation describing how some of the variables are constructed/defined can be obtained here:
National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
This part describes (in words and code) how the data were loaded into R and processed for analysis. Analysis starts from the raw CSV file containing the data.
#########################################################
# Obtain name of the file (.bz2 archive with the dataset)
#########################################################
urlStormDataSet <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fileName <- URLdecode(urlStormDataSet) %>% basename()
###########################
# Download the .bz2 archive
###########################
if (!file.exists(fileName)) {
download.file(urlStormDataSet, fileName, method = "curl")
print ("File was downloaded successfully")
} else {
cat("The file [", fileName, "] already exists!")
}
## The file [ StormData.csv.bz2 ] already exists!
Since the data file in the archive after unpacking has size about 550Mb, let’s first read only names of the columns
###################
# Read column names
###################
stormDataHeaders <- read.csv(bzfile(fileName), nrows = 1)
Take a glance at the column names:
headers <- colnames(stormDataHeaders)
headers
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
In order to address the questions posted above, out of 37 columns we need only 7:
So to save memory let’s read the dataframe only with these columns:
requiredColumns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
columnClasses <- ifelse(headers %in% requiredColumns, NA, "NULL")
# Depending on the performance of your system this process may take some time
subsetedDataFrame <- read.csv(bzfile(fileName), colClasses = columnClasses)
stormDataFrame <- subsetedDataFrame
Take a look at the data frame:
head(stormDataFrame)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Let’s check structure of the data set:
str(stormDataFrame)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
We see plenty of rows: 902297 rows! Also we can notice EVTYPE as a factor has 985 levels!
Take a look at some of them:
head(levels(stormDataFrame$EVTYPE) , 50)
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
## [21] "Beach Erosion" "BEACH EROSION"
## [23] "BEACH EROSION/COASTAL FLOOD" "BEACH FLOOD"
## [25] "BELOW NORMAL PRECIPITATION" "BITTER WIND CHILL"
## [27] "BITTER WIND CHILL TEMPERATURES" "Black Ice"
## [29] "BLACK ICE" "BLIZZARD"
## [31] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [33] "Blizzard Summary" "BLIZZARD WEATHER"
## [35] "BLIZZARD/FREEZING RAIN" "BLIZZARD/HEAVY SNOW"
## [37] "BLIZZARD/HIGH WIND" "BLIZZARD/WINTER STORM"
## [39] "BLOW-OUT TIDE" "BLOW-OUT TIDES"
## [41] "BLOWING DUST" "blowing snow"
## [43] "Blowing Snow" "BLOWING SNOW"
## [45] "BLOWING SNOW- EXTREME WIND CHI" "BLOWING SNOW & EXTREME WIND CH"
## [47] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"
## [49] "BRUSH FIRE" "BRUSH FIRES"
Here we see that ENVTYPE column has names of the disasters having differents spelling like ‘BLIZZARD/HIGH WIND’ and ‘BLIZZARD/WINTER STORM’ some names contain digits like ‘TSTM WIND (G45)’. But accirding to the table above we have 47 event types so our task as to this column is to match each of the names to the 48 event types. Obviously it’s very hard to perform manually so we will have to use Cluster Analysis for this purpose.
Let’s look at another factor columns ‘PROPDMGEXP’ and ‘CROPDMGEXP’:
levels(stormDataFrame$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(stormDataFrame$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
The next part will describe process of transformation the data set to tidy form.
In order to reduce 985 levels of the ‘EVTYPE’ column to 48 event types from the book (page 6), we are going to use Clustering technic. Fot this purpose we are to use ‘stringdist’ package which will calculate distance between strings. The idea of the method to create clusters from 985 levels and link it 48 reference event types.
For more information about the package follow: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
For first, we need to cleand the column ‘EVTYPE’ from rows containing:
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) %like% "summary"), ]
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) %like% "none"), ]
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) %like% "other"), ]
stormDataFrame <- stormDataFrame[!(tolower(stormDataFrame$EVTYPE) == '?'), ]
dim(stormDataFrame)
## [1] 902159 7
138 rows were deleted.
For second, we are going to delete rows where values of ‘FATALITIES’, ‘INJURIES’, ‘PROPDMG’, ‘CROPDMG’ are 0 since it will give us nothing during aggregation operation which will be described below.
Let’s delete these rows:
stormDataFrame <- stormDataFrame[!(stormDataFrame$FATALITIES == 0 &
stormDataFrame$INJURIES == 0 &
stormDataFrame$PROPDMG == 0 &
stormDataFrame$CROPDMG == 0), ]
dim(stormDataFrame)
## [1] 254591 7
647568 rows were deleted.
For third, let’s replace values of ‘PROPDMGEXP’ and ‘CROPDMGEXP’ columns with numbers
#Combine all literal exps:
strangeValues <- c(levels(stormDataFrame$PROPDMGEXP), levels(stormDataFrame$CROPDMGEXP))
strangeValues <- unique(strangeValues)
#create dataframe setting appropriate values
expDictionary <- data.frame(value = strangeValues, exp = c(0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 2, 3, 6, 6, 3))
expDictionary$value <- as.character(expDictionary$value)
stormDataFrame$PROPDMGEXP <- as.character(stormDataFrame$PROPDMGEXP)
stormDataFrame$CROPDMGEXP <- as.character(stormDataFrame$CROPDMGEXP)
Replacement literal exp with numerical one:
#Frequency of exp in PROPDMG
propDmgExpVector <- stormDataFrame$PROPDMGEXP
frequencyProp <- table(propDmgExpVector)
frequencyProp <- as.data.frame(frequencyProp)
frequencyProp <- frequencyProp[order(-frequencyProp$Freq), ]
frequencyProp
## propDmgExpVector Freq
## 14 K 231424
## 1 11547
## 16 M 11320
## 4 0 210
## 11 B 40
## 8 5 18
## 15 m 7
## 13 H 6
## 3 + 5
## 7 4 4
## 9 6 3
## 10 7 3
## 2 - 1
## 5 2 1
## 6 3 1
## 12 h 1
#Frequency of exp in CROPDMG
cropDmgExpVector <- stormDataFrame$CROPDMGEXP
frequencyCrop <- table(cropDmgExpVector)
frequencyCrop <- as.data.frame(frequencyCrop)
frequencyCrop <- frequencyCrop[order(-frequencyCrop$Freq), ]
frequencyCrop
## cropDmgExpVector Freq
## 1 152652
## 6 K 99902
## 8 M 1985
## 5 k 21
## 3 0 17
## 4 B 7
## 2 ? 6
## 7 m 1
We can see that ‘K’, ’’ and ‘M’ are the most popular here
#Replace 'PROPDMGEXP' with the dictionary:
if('0' %in% propDmgExpVector) {
for(index in 1 : length(propDmgExpVector)) {
pde <- propDmgExpVector[index]
if(pde != 'K' & pde != '' & pde != 'M') {
e <- expDictionary[expDictionary$value == pde, ]$exp
propDmgExpVector[index] <- 10^e
}
}
} else {
print("Replacement is already done!")
}
#Replace 'СROPDMGEXP' with the dictionary:
if('0' %in% cropDmgExpVector) {
for(index in 1 : length(cropDmgExpVector)) {
cde <- cropDmgExpVector[index]
if(cde != 'K' & cde != '' & cde != 'M') {
e <- expDictionary[expDictionary$value == cde, ]$exp
cropDmgExpVector[index] <- 10^e
}
}
} else {
print("Replacement is already done!")
}
#replace 'PROPDMGEXP'
propDmgExpVectorReplaced <- replace(propDmgExpVector, propDmgExpVector == 'K', 10^3) %>%
replace(propDmgExpVector == '', 1) %>%
replace(propDmgExpVector == 'M', 10^6)
propDmgExpVectorReplaced <- as.numeric(propDmgExpVectorReplaced)
stormDataFrame$PROPDMG <- stormDataFrame$PROPDMG * propDmgExpVectorReplaced
#replace 'CROPDMGEXP'
cropDmgExpVectorReplaced <- replace(cropDmgExpVector, cropDmgExpVector == 'K', 10^3) %>%
replace(cropDmgExpVector == '', 1) %>%
replace(cropDmgExpVector == 'M', 10^6)
cropDmgExpVectorReplaced <- as.numeric(cropDmgExpVectorReplaced)
stormDataFrame$CROPDMG <- stormDataFrame$CROPDMG * cropDmgExpVectorReplaced
Drop ‘PROPDMGEXP’ and ‘CROPDMGEXP’:
stormDataFrame <- stormDataFrame[-c(5, 7)]
testCluster <- stormDataFrame
Finally, we need to reduce ‘EVTYPE’ to 48 book types from the table above
#Create a vector with the reference event types:
targetEventTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow","Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout","Wildfire", "Winter Storm","Winter Weather")
To perform Clustering we are going to use ‘stringdist’ function with the Damerau-Levenshtein distance method which is like the optimal string alignment distance method calculating counts the number of deletions, insertions and substitutions necessary to turn b into a, but except that it allows for multiple edits on substrings.
#get factual event names
factualEventTypes <-as.character(testCluster$EVTYPE)
length(factualEventTypes)
## [1] 254591
Now we are going to use cluster analysis in order to reduce factual event types to 48 book reference event types
#install.packages("stringdist")
library("stringdist")
# Depending on your system this operation may take some time
distanceMatrix <- stringdistmatrix(tolower(factualEventTypes), tolower(targetEventTypes), method = "dl", useNames = "strings")
for(rowNumber in 1 : length(factualEventTypes)) {
minInRow <- min(distanceMatrix[rowNumber, ])
indexInTargetVector <- match(minInRow, distanceMatrix[rowNumber, ])
factualEventTypes[rowNumber] <- colnames(distanceMatrix)[indexInTargetVector]
}
#check that all factyal values were reduced ti target event types
length(unique(factualEventTypes))
## [1] 48
The last step in the data prepatation is aggregation
testCluster$EVTYPE <- toupper(factualEventTypes)
#aggregation per event types:
library(dplyr)
aggregatedStormDataFrame <- testCluster %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG))
Aggregated tidy data set
aggregatedStormDataFrame
## # A tibble: 48 x 5
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ASTRONOMICAL LOW TIDE 0 0 9745000 0
## 2 AVALANCHE 226 194 3721800 5000000
## 3 BLIZZARD 101 805 659333950 112060000
## 4 COASTAL FLOOD 41 89 520273960 43691600
## 5 COLD/WIND CHILL 97 27 2314500 66600000
## 6 DEBRIS FLOW 1 0 56050000 42000000
## 7 DENSE FOG 19 342 9774000 0
## 8 DENSE SMOKE 0 0 100050 0
## 9 DROUGHT 16 52 1053120600 13972631000
## 10 DUST DEVIL 2 43 743130 0
## # ... with 38 more rows
1. Across the United States, which types of events are most harmful with respect to population health?
#health impact
healthImpactDataFrame <- aggregatedStormDataFrame[c(1, 2, 3)]
transponedHealthImpactDataFrame <- t(healthImpactDataFrame)
colnames(transponedHealthImpactDataFrame) <- healthImpactDataFrame$EVTYPE
transponedHealthImpactDataFrame <- transponedHealthImpactDataFrame[-1, ]
transponedHealthImpactDataFrame <- transponedHealthImpactDataFrame[ , order((transponedHealthImpactDataFrame[1, ]), decreasing = TRUE)]
Plot “Health impacts from disasters sorted by fatalities”
barplot(transponedHealthImpactDataFrame,
main = "Injuries and fatalities from disasters sorted by fatalities",
col = c("red","blue"),
ylab = "Casualties, people",
ylim = range(pretty(c(0, 10000))),
cex.names = 0.6,
las = 2
)
legend("topright",
c("Injuries", "Fatalities"),
fill = c("blue" ,"red")
)
2. Across the United States, which types of events have the greatest economic consequences?
#economic impact
economicImpactDataFrame <- aggregatedStormDataFrame[c(1, 4, 5)]
transponedEconomicImpactDataFrame <- t(economicImpactDataFrame)
colnames(transponedEconomicImpactDataFrame) <- healthImpactDataFrame$EVTYPE
transponedEconomicImpactDataFrame <- transponedEconomicImpactDataFrame[-1, ]
transponedEconomicImpactDataFrame <- transponedEconomicImpactDataFrame[ , order((transponedEconomicImpactDataFrame[1, ]), decreasing = TRUE)]
Plot “Property and crop damages from disasters sorted by property damages”
barplot(transponedEconomicImpactDataFrame,
main = "Property and crop damages from disasters sorted by prop damages, USD",
col = c("red","blue"),
cex.names = 0.6,
las = 2
)
legend("topright",
c("Crop damages", "Property damages"),
fill = c("blue" ,"red"),
cex = 0.75
)
As it can be seen from the plots above:
1. Types of events are most harmful with respect to population health:
By total result (Fatalities + Injuries), people:
By Fatalities, people:
By Injuries, people:
2. Types of events have the greatest economic consequences:
By total result (Property damages + Crop damages), billions USD:
By Property damages, billions USD:
By Crop damages, billions USD: