Data file is downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.
There is also some dcumentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
# download the data file if it doesnt already exist
if (!file.exists("StormData.csv.bz2"))
{
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")
}
# Read the datafile
stormData <- read.csv("StormData.csv.bz2")
# Check the structure
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The event types are stored in EVTYPE
To evaluate health impact we need FATALITIES and INJURIES
To evaluate economic impact we need PROPDMG (property damage), CROPDMG (crop damage). We also need the PROPDMGEXP and CROPDMGEXP which are multipliers on the ###DMG variables. (for example K for thousand, M for million)
# setup libraries required
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# create an extract with just the required fields
stormDataEx <- select(stormData, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)
Each of the fields is checked for its contents to see what needs to be cleaned.
EVTYPE seems to be inconsistent in upper/lower case, as well as has many sub groups. There is 985 EVTYPE factors, we will look at the most common 20 and clean them until they align with the documentation. Additional cleaning is added in this step as found by looking at the top 20 sorted by health and economic data.
The DMG fields are numbers, but need to be multiplied by the EXP fields to make sensible information.
The EXP fields need to be cleaned. Primarily it contains H, K, M, B (hundreds, thousands, millions, billions) But it also contains some direct multipliers like 3,4,5,6,7 (assumed to be the number of zeroes) Some lower case h and m (assumed to be H and M) Also some -, +, ? (assumed to be no multiplier)
FATALITIES and INJURIES are just raw numbers so can be combined directly to make a health impact variable
The 2 DMG variables are multiplied with their exponents and combined to make the ECON variable
# summary
summary(stormDataEx)
## EVTYPE PROPDMG PROPDMGEXP
## HAIL :288661 Min. : 0.00 :465934
## TSTM WIND :219940 1st Qu.: 0.00 K :424665
## THUNDERSTORM WIND: 82563 Median : 0.00 M : 11330
## TORNADO : 60652 Mean : 12.06 0 : 216
## FLASH FLOOD : 54277 3rd Qu.: 0.50 B : 40
## FLOOD : 25326 Max. :5000.00 5 : 28
## (Other) :170878 (Other): 84
## CROPDMG CROPDMGEXP FATALITIES INJURIES
## Min. : 0.000 :618413 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.000 K :281832 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.000 M : 1994 Median : 0.0000 Median : 0.0000
## Mean : 1.527 k : 21 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.: 0.000 0 : 19 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :990.000 B : 9 Max. :583.0000 Max. :1700.0000
## (Other): 9
# showing event types - sorted by top 20 as there is too many
head(sort(table(stormDataEx$EVTYPE), decreasing = TRUE), n=20)
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
# showing EXP field information
table(stormDataEx$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(stormDataEx$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
# upper case the EXP variables
stormDataEx$PROPDMGEXP <- toupper(stormDataEx$PROPDMGEXP)
stormDataEx$CROPDMGEXP <- toupper(stormDataEx$CROPDMGEXP)
# cleaning the EXP variables
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "")] <- 10^0
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "?")] <- 10^0
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "0")] <- 10^0
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "2")] <- 10^2
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "K")] <- 10^3
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "M")] <- 10^6
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "B")] <- 10^9
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "-")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "?")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "+")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "0")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "1")] <- 10^1
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "2")] <- 10^2
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "3")] <- 10^3
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "4")] <- 10^4
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "5")] <- 10^5
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "6")] <- 10^6
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "7")] <- 10^7
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "8")] <- 10^8
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "H")] <- 10^2
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "K")] <- 10^3
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "M")] <- 10^6
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "B")] <- 10^9
stormDataEx$PROPDMGEXP <- as.numeric(stormDataEx$PROPDMGEXP)
stormDataEx$CROPDMGEXP <- as.numeric(stormDataEx$PROPDMGEXP)
# creating the total economic consequences variable
stormDataEx <- mutate(stormDataEx, ECON = (PROPDMG*PROPDMGEXP)+(CROPDMG*CROPDMGEXP) )
# upper case the EVTYPE
stormDataEx$EVTYPE <- toupper(stormDataEx$EVTYPE)
# check EVTYPE again - looking at the top 20
head(sort(table(stormDataEx$EVTYPE), decreasing = TRUE), n=20)
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219942 82564
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25327
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20214 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11742 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7045 6844 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3569
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
# cleaning the variables
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "THUNDERSTORM WINDS")] <- "THUNDERSTORM WIND"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "MARINE TSTM WIND")] <- "MARINE THUNDERSTORM WIND"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "URBAN/SML STREAM FLD")] <- "FLOOD"
# some cleaning as found in the health data
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "WILD/FOREST FIRE")] <- "WILDFIRE"
# some cleaning as found in the economic data
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "HURRICANE")] <- "HURRICANE/TYPHOON"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "STORM SURGE")] <- "STORM SURGE/TIDE"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "RIVER FLOOD")] <- "FLOOD"
# check EVTYPE again - looking at the top 20
head(sort(table(stormDataEx$EVTYPE), decreasing = TRUE), n=20)
##
## THUNDERSTORM WIND HAIL TORNADO
## 323349 288661 60652
## FLASH FLOOD FLOOD HIGH WIND
## 54277 28892 20214
## LIGHTNING HEAVY SNOW MARINE THUNDERSTORM WIND
## 15754 15708 11987
## HEAVY RAIN WINTER STORM WINTER WEATHER
## 11742 11433 7045
## FUNNEL CLOUD WILDFIRE WATERSPOUT
## 6844 4218 3796
## STRONG WIND BLIZZARD DROUGHT
## 3569 2719 2488
## ICE STORM EXCESSIVE HEAT
## 2006 1678
# combining FATALITIES and INJURIES
stormDataEx <- mutate(stormDataEx, HEALTH = FATALITIES + INJURIES)
# summing the health impacts by event
healthdata <- with(stormDataEx, aggregate(HEALTH ~ EVTYPE, FUN = sum))
# arranging by descending number of total health impacts
healthdata <- arrange(healthdata, desc(HEALTH))
# summing the economic cost in dollars by event
econdata <- with(stormDataEx, aggregate(ECON ~ EVTYPE, FUN = sum))
# arranging by descending number of total economic cost
econdata <- arrange(econdata, desc(ECON))
Figure 1 displays the total health impact numbers by event type
# plotting health using ggplot
g_h <- ggplot(healthdata[1:10,], aes(x=reorder(EVTYPE, -HEALTH),y=HEALTH)) +
geom_bar(stat="identity") +
xlab("Event types") + ylab("Total health impacts (fatalities and injuries)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position="none") +
ggtitle("Total health impact numbers (Injuries + Fatalities) by weather event type")
g_h
Figure 2 displays the total economic cost by event type
# plotting economic cost using ggplot
g_e <- ggplot(econdata[1:10,], aes(x=reorder(EVTYPE, -ECON),y=ECON)) +
geom_bar(stat="identity") +
xlab("Event types") + ylab("Total economic cost ($)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position="none") +
ggtitle("Total economic cost by weather event type")
g_e