-author: “Georgios Mintzopoulos” -date: “Tuesday, September 16, 2014”
Storms and other severe weather events cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where hey occur, as well as estimates of any fatalities, injuries, and property damage.
We analyze the raw data to produce a tidy data set and perform two steps of exploratory analysis regarding impact of events on health and the costs associated with damages.
The data for this Coursera course assignment, come in the form of a comma-separated-value file named “repdata-data-StormData.csv”. In this analysis we start from the raw CSV file containing the data, which is supposed to be available.
We read the data into R environment.
raw_data=read.csv("repdata-data-StormData.csv",header=T,sep=",")
# the raw data file size is:
format(object.size(raw_data),units="Mb")
## [1] "397.7 Mb"
The raw data have the following structure,
str(raw_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We set the printing of numbers in R as below.
options(scipen = 12) # set numbers printing
Some more processing of the data follows, next. We show the annotated code and some print outs.
library(stringr)
# get the variables names
var.raw_data=names(raw_data)
# number the variables (columns) as a ref for subsetting
names(var.raw_data)=seq(1:length(var.raw_data))
# print the columns
var.raw_data
## 1 2 3 4 5
## "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## 6 7 8 9 10
## "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## 11 12 13 14 15
## "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## 16 17 18 19 20
## "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## 21 22 23 24 25
## "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## 26 27 28 29 30
## "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## 31 32 33 34 35
## "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## 36 37
## "REMARKS" "REFNUM"
# subset the data to keep the columns of interest
data=subset(raw_data,select=c(8,23:28))
rm(raw_data)
# new data set has size:
format(object.size(data),units="Mb")
## [1] "37.9 Mb"
# new data structure
str(data)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
# get the variables names
var.data=names(data)
# number the variables (columns) as a ref for subsetting
names(var.data)=seq(1:length(var.data))
var.data
## 1 2 3 4 5
## "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP"
## 6 7
## "CROPDMG" "CROPDMGEXP"
Our data have show many different storm events in the variable EVTYPE. We will use the defined from NOAA in our codebook 48 events We manually define a vector containing these events and we continue our data processing based on these defined events.
# We set the storm data acceptable event types as defined in our codebook
# this will be a vector containing events as strings
# ...we set all values to upper case, and trim whitespaces
# Storm Data Events taken from description
StormDataEventTable=(c("Astronomical Low Tide","Avalanche", "Blizzard", "Coastal Flood","Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke ","Drought ","Dust Devil ","Dust Storm ","Excessive Heat ","Extreme Cold/Wind Chill ","Flash Flood ","Flood ","Frost/Freeze ","Funnel Cloud ","Freezing Fog ","Hail ","Heat ","Heavy Rain ","Heavy Snow ","High Surf ","High Wind ","Hurricane (Typhoon) ","Ice Storm ","Lake-Effect Snow ","Lakeshore Flood ","Lightning ","Marine Hail ","Marine High Wind ","Marine Strong Wind ","Marine Thunderstorm Wind ","Rip Current ","Seiche ","Sleet ","Storm Surge/Tide ", "Strong Wind ","Thunderstorm Wind ","Tornado ","Tropical Depression ","Tropical Storm ","Tsunami ","Volcanic Ash ","Waterspout ","Wildfire ","Winter Storm ","Winter Weather" ))
# let's use this as a vector with the events types
events=as.vector(str_trim(toupper(StormDataEventTable),side="both"))
rm(StormDataEventTable)
# remove all white spaces from the values of data$EVTYPE
data$EVTYPE=as.character(str_trim(data$EVTYPE,side="both"))
# check out which observations in our data fall in the category of valid events and subset the data accordingly, producing a new data set called "valid_data"
valid_data=subset(data,data$EVTYPE%in%events)
rm(data)
We have now produced a subset of the original data set called valid_data containing the events described previously.
# the variables and types of the valid_data data set are:
sapply(valid_data,class)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## "character" "numeric" "numeric" "numeric" "factor" "numeric"
## CROPDMGEXP
## "factor"
We next examine the cost parameters in our data. There are two variables, called “PROPDMG” and “CROPDMG”, for property damages and crop damages respectively. For each of these variables there is a also a column containing a multiplier for its values. The multipliers found in our data are:
# set cash multipliers from property and crop damages
valid_data$PROPDMGEXP=as.character(valid_data$PROPDMGEXP)
valid_data$CROPDMGEXP=as.character(valid_data$CROPDMGEXP)
#property multipliers
PROPDMGEXP=unique(toupper(valid_data$PROPDMGEXP))
PROPDMGEXP
## [1] "K" "M" "" "B" "+" "0" "5" "2" "4" "7" "?" "-" "6" "3" "1" "8" "H"
# crop multipliers
CROPDMGEXP=unique(toupper(valid_data$CROPDMGEXP))
CROPDMGEXP
## [1] "" "K" "M" "B" "0"
# CROPDMGEXP values are a subset of PROPDMGEXP values
CROPDMGEXP%in%PROPDMGEXP
## [1] TRUE TRUE TRUE TRUE TRUE
The multipliers are interpreted as following.. - H: hundred $ (10^2) - K: thousand $ (10^3) - M: million $ (10^6) - B: billion $ (10^9) - any number in {2,3,4,5,6,7,8} is a multiple; eg (DMG x number) - 0,1 and any other symbol is treated as dollars (no multiples)
Next we process the multipliers in R
# set the multipliers values
m0=data.frame(c("","+","?","-","1","0"),1)
names(m0)=c("symbol","multiplier")
m1=data.frame(c("2","3","4","5","6","7","8"),c(2,3,4,5,6,7,8))
names(m1)=c("symbol","multiplier")
m2=data.frame(c("H","K","M","B"),c(10^2,10^3,10^6,10^9))
names(m2)=c("symbol","multiplier")
M=rbind(m0,m1,m2)
M$symbol=as.character(M$symbol)
# The multipliers are
M
## symbol multiplier
## 1 1
## 2 + 1
## 3 ? 1
## 4 - 1
## 5 1 1
## 6 0 1
## 7 2 2
## 8 3 3
## 9 4 4
## 10 5 5
## 11 6 6
## 12 7 7
## 13 8 8
## 14 H 100
## 15 K 1000
## 16 M 1000000
## 17 B 1000000000
rm(m0,m1,m2)
We now calculate our cost of damages as following:
# write a function to calculate the costs
set_costs=function(vectorIn,multipliers)
{
for (i in 1:length(vectorIn))
{
index=match(vectorIn[i],multipliers[,1])
vectorIn[i]=multipliers[index,2]
}
return (vectorIn)
}
# calculate the costs for PROPDMG and CROPDMG
valid_data$PROPDMGEXP=set_costs(valid_data$PROPDMGEXP,M)
valid_data$PROPDMGEXP=as.numeric(valid_data$PROPDMGEXP)
valid_data$CROPDMGEXP=set_costs(valid_data$CROPDMGEXP,M)
valid_data$CROPDMGEXP=as.numeric(valid_data$CROPDMGEXP)
# calculate the two costs
valid_data$PROPDMG=valid_data$PROPDMG*valid_data$PROPDMGEXP
valid_data$CROPDMG=valid_data$CROPDMG*valid_data$CROPDMGEXP
Next we will process our data doing some final steps in order to produce a couple of plots and draw some conclusions
#_________________________ draw some conclusions_________________________
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(ggplot2)
library(scales)
# subset to keep only the variables of interest
valid_data=select(valid_data,1:4,6)
# the data size is now
format(object.size(valid_data),units="Mb")
## [1] "24.2 Mb"
# and has the variables
sapply(valid_data,class)
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## "character" "numeric" "numeric" "numeric" "numeric"
# fatalities and injuries are all complete cases
sum(is.na(valid_data$FATALITIES))
## [1] 0
sum(is.na(valid_data$INJURIES))
## [1] 0
The produced data that we will use in our plots are, “health” containing the health impacts (fatalities & injuries), and “cost” containing the damage impact (property & crop), per event type.
# aggreagated data are as following:
health=group_by(valid_data,EVTYPE)%.%summarize(Total_fatalities=sum(FATALITIES),Total_injuries=sum(INJURIES))
health=arrange(health,-Total_injuries)
health
## Source: local data frame [46 x 3]
##
## EVTYPE Total_fatalities Total_injuries
## 1 TORNADO 5633 91346
## 2 FLOOD 470 6789
## 3 EXCESSIVE HEAT 1903 6525
## 4 LIGHTNING 816 5230
## 5 HEAT 937 2100
## 6 ICE STORM 89 1975
## 7 FLASH FLOOD 978 1777
## 8 THUNDERSTORM WIND 133 1488
## 9 HAIL 15 1361
## 10 WINTER STORM 206 1321
## 11 HIGH WIND 248 1137
## 12 HEAVY SNOW 127 1021
## 13 WILDFIRE 75 911
## 14 BLIZZARD 101 805
## 15 DUST STORM 22 440
## 16 WINTER WEATHER 33 398
## 17 DENSE FOG 18 342
## 18 TROPICAL STORM 58 340
## 19 STRONG WIND 103 280
## 20 HEAVY RAIN 98 251
## 21 RIP CURRENT 368 232
## 22 AVALANCHE 224 170
## 23 HIGH SURF 101 152
## 24 TSUNAMI 33 129
## 25 DUST DEVIL 2 42
## 26 WATERSPOUT 3 29
## 27 MARINE THUNDERSTORM WIND 10 26
## 28 EXTREME COLD/WIND CHILL 125 24
## 29 MARINE STRONG WIND 14 22
## 30 COLD/WIND CHILL 95 12
## 31 STORM SURGE/TIDE 11 5
## 32 DROUGHT 0 4
## 33 FUNNEL CLOUD 0 3
## 34 COASTAL FLOOD 3 2
## 35 MARINE HIGH WIND 1 1
## 36 ASTRONOMICAL LOW TIDE 0 0
## 37 DENSE SMOKE 0 0
## 38 FREEZING FOG 0 0
## 39 FROST/FREEZE 0 0
## 40 LAKE-EFFECT SNOW 0 0
## 41 LAKESHORE FLOOD 0 0
## 42 MARINE HAIL 0 0
## 43 SEICHE 0 0
## 44 SLEET 2 0
## 45 TROPICAL DEPRESSION 0 0
## 46 VOLCANIC ASH 0 0
cost=group_by(valid_data,EVTYPE)%.%summarize(Property_damage=sum(PROPDMG,na.rm=T),Crop_damage=sum(CROPDMG,na.rm=T))
cost=arrange(cost,-Property_damage)
cost
## Source: local data frame [46 x 3]
##
## EVTYPE Property_damage Crop_damage
## 1 FLOOD 144657709807 5661968450
## 2 TORNADO 56925661188 414953270
## 3 FLASH FLOOD 16140862555 1421317100
## 4 HAIL 15727367663 3025537473
## 5 TROPICAL STORM 7703890550 678346000
## 6 WINTER STORM 6688497251 26944000
## 7 HIGH WIND 5270046295 638571300
## 8 WILDFIRE 4765114000 295472800
## 9 STORM SURGE/TIDE 4641188000 850000
## 10 ICE STORM 3944927860 5022113500
## 11 THUNDERSTORM WIND 3483121296 414843050
## 12 DROUGHT 1046106000 13972566000
## 13 HEAVY SNOW 932589149 134653100
## 14 LIGHTNING 928659516 12092090
## 15 HEAVY RAIN 694248090 733399800
## 16 BLIZZARD 659213950 112060000
## 17 COASTAL FLOOD 237665560 0
## 18 STRONG WIND 175241450 64953500
## 19 TSUNAMI 144062000 20000
## 20 HIGH SURF 89575000 0
## 21 LAKE-EFFECT SNOW 40115000 0
## 22 WINTER WEATHER 20866000 15000000
## 23 DENSE FOG 9674000 0
## 24 FROST/FREEZE 9480000 1094086000
## 25 WATERSPOUT 9353700 0
## 26 EXTREME COLD/WIND CHILL 8648000 50000
## 27 EXCESSIVE HEAT 7753700 492402000
## 28 LAKESHORE FLOOD 7540000 0
## 29 DUST STORM 5549000 3100000
## 30 AVALANCHE 3721800 0
## 31 FREEZING FOG 2182000 0
## 32 COLD/WIND CHILL 1990000 600000
## 33 HEAT 1797000 401461500
## 34 TROPICAL DEPRESSION 1737000 0
## 35 MARINE HIGH WIND 1297010 0
## 36 SEICHE 980000 0
## 37 DUST DEVIL 700330 0
## 38 VOLCANIC ASH 500000 0
## 39 MARINE THUNDERSTORM WIND 436400 50000
## 40 MARINE STRONG WIND 418330 0
## 41 ASTRONOMICAL LOW TIDE 320000 0
## 42 FUNNEL CLOUD 194600 0
## 43 DENSE SMOKE 100000 0
## 44 MARINE HAIL 4000 0
## 45 RIP CURRENT 1000 0
## 46 SLEET 0 0
Next we draw our plots
# draw plotA: total fatalities and injuries per event type
# transform EVTYPE back to factor ordered by decreasing total number of Injuries/Property damage
health$EVTYPE=factor(health$EVTYPE,levels=health$EVTYPE[order(-health$Total_injuries)],ordered=T)
cost$EVTYPE=factor(cost$EVTYPE,levels=cost$EVTYPE[order(-cost$Property_damage)],ordered=T)
# use melt function to transform the data frame to long form for correct plotting
health_long=melt(health,id="EVTYPE",measure.vars=c("Total_fatalities","Total_injuries"))
plotA=ggplot(health_long,aes(EVTYPE,value,group=variable,fill=variable))+geom_bar(stat="identity")+xlab("Event type")+ylab("")+scale_y_continuous(labels = comma)+theme(axis.title.x = element_text(face="bold", colour="black", size=10),axis.text.x = element_text(angle=90, vjust=0.5, size=8))
print(plotA)
# draw plot2: total cost per event type
# transform data frame to long form
cost_long=melt(cost,id="EVTYPE",measure.vars=c("Property_damage","Crop_damage"))
plotB=ggplot(cost_long,aes(EVTYPE,value,group=variable,fill=variable))+geom_bar(stat="identity")+xlab("Event type")+ylab("USD")+scale_y_continuous(labels=comma)+theme(axis.title.x = element_text(face="bold", colour="black", size=10),axis.text.x = element_text(angle=90, vjust=0.5, size=8))
print(plotB)
From the two plots we can identify the events that have the highest impact on health and the highest damage costs.