Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This report handles two basic questions:
First of all, we download and read the data using download.file and read.csv respecitively.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file (url, destfile = "data.csv.bz2")
data <- read.csv ("data.csv.bz2")
Now we take a look at the data using the tbl_df function of the dplyr package.
library(dplyr)
tbl_df(data)
## Source: local data frame [902,297 x 37]
##
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## (dbl) (fctr) (fctr) (fctr) (dbl) (fctr) (fctr)
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: EVTYPE (fctr), BGN_RANGE (dbl), BGN_AZI (fctr),
## BGN_LOCATI (fctr), END_DATE (fctr), END_TIME (fctr), COUNTY_END (dbl),
## COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (fctr), END_LOCATI (fctr),
## LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
## INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (fctr), CROPDMG (dbl),
## CROPDMGEXP (fctr), WFO (fctr), STATEOFFIC (fctr), ZONENAMES (fctr),
## LATITUDE (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl),
## REMARKS (fctr), REFNUM (dbl)
As part of processing the data, we want to change the header to lower case and select the relevant columns for the purpose of our research question. As we want to explore the relation between the type of event (evtype) and population health or economic consequences, we select the columns for event type, fatalities, injuries, property damage and crop damage. The propdmgexp and cropdmgexp variables are also needed as they indicate the scale of the damage values
names(data) <- tolower(names(data))
aData <- select(data, evtype, fatalities:cropdmgexp)
tbl_df(aData)
## Source: local data frame [902,297 x 7]
##
## evtype fatalities injuries propdmg propdmgexp cropdmg cropdmgexp
## (fctr) (dbl) (dbl) (dbl) (fctr) (dbl) (fctr)
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## 7 TORNADO 0 1 2.5 K 0
## 8 TORNADO 0 0 2.5 K 0
## 9 TORNADO 1 14 25.0 K 0
## 10 TORNADO 0 0 25.0 K 0
## .. ... ... ... ... ... ... ...
These are the top rows of our analytic data.
In order to show the total fatality and injury numbers per event type, we can apply the group_by and summarize function on our analytic data. Then we can arrange the new dataframe to already show some values.
totalNumbers <- aData %>%
group_by(evtype) %>%
summarize(fatalities = sum(fatalities), injuries = sum(injuries)) %>%
arrange(desc(fatalities))
tbl_df(totalNumbers)
## Source: local data frame [985 x 3]
##
## evtype fatalities injuries
## (fctr) (dbl) (dbl)
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
## .. ... ... ...
Now we prepare small subsets of totalNumbers for plotting. Fatal contains the top ten event types for fatalities. However, the fatality values may be ordered now, but the levels of evtype are not. To plot nice descending bars in our barplot, we arrange the levels so that they correspond with the fatality values. Then we plot using the ggplot2 package.
fatal <- arrange(totalNumbers, desc(fatalities))[1:10,]
fatal$evtype <- factor(fatal$evtype, levels = fatal$evtype[order(fatal$fatalities)])
library(ggplot2)
g <- ggplot(fatal, aes(evtype, fatalities))
g + geom_bar(stat="identity", fill = "darkred") + coord_flip() +
xlab("Event Types") + ylab("Number of Fatalities") + ggtitle("Event types with most fatalities")
Plotting the injury numbers is analogous. Injur contains the top ten event types for injuries. Again, we have to order the evtype levels.
injur <- arrange(totalNumbers, desc(injuries))[1:10,]
injur$evtype <- factor(injur$evtype, levels = injur$evtype[order(injur$injuries)])
g <- ggplot(injur, aes(evtype, injuries))
g + geom_bar(stat="identity", fill = "darkred") + coord_flip() +
xlab("Event Types") + ylab("Number of Injuries") + ggtitle("Event types with most injuries")
Clearly, tornados are by far the most harmful for both fatalities and injuries. Excessive heat, flash food, heat and lightning seem to cause high numbers of fatalities as well, whereas tstm wind, flood, excessive heat and lightning are the biggest causes (after tornados) for injuries.
To address the question which event types have the biggest economic consequences, we will first subset our analytic data. We select only the economic related columns, and filter out the observations with no property or crop damage. This makes our dataset a whole lot smaller.
damage <- aData %>%
select(evtype, propdmg:cropdmgexp) %>%
filter(propdmg != 0 | cropdmg != 0)
tbl_df(damage)
## Source: local data frame [245,031 x 5]
##
## evtype propdmg propdmgexp cropdmg cropdmgexp
## (fctr) (dbl) (fctr) (dbl) (fctr)
## 1 TORNADO 25.0 K 0
## 2 TORNADO 2.5 K 0
## 3 TORNADO 25.0 K 0
## 4 TORNADO 2.5 K 0
## 5 TORNADO 2.5 K 0
## 6 TORNADO 2.5 K 0
## 7 TORNADO 2.5 K 0
## 8 TORNADO 2.5 K 0
## 9 TORNADO 25.0 K 0
## 10 TORNADO 25.0 K 0
## .. ... ... ... ... ...
The next step is to change the propdmg and cropdmg values depending on their relevant exp values. “K” multiplies the damage values by 1.000, “M” does so by 1.000.000 and “B” does so by 1.000.000.000. In order to do this we first create logical vectors for each multiplier. By using these vectors, we can multiply the damage values with the correct factor. Note that we have to do this process two times, once for the property values and once for the crop values.
# Changing the property damage values
kvector <- toupper(damage$propdmgexp)=="K"
mvector <- toupper(damage$propdmgexp)=="M"
bvector <- toupper(damage$propdmgexp)=="B"
damage[kvector,]$propdmg <- damage[kvector,]$propdmg * 10**3
damage[mvector,]$propdmg <- damage[mvector,]$propdmg * 10**6
damage[bvector,]$propdmg <- damage[bvector,]$propdmg * 10**9
# Changing the crop damage values
kvector <- toupper(damage$cropdmgexp)=="K"
mvector <- toupper(damage$cropdmgexp)=="M"
bvector <- toupper(damage$cropdmgexp)=="B"
damage[kvector,]$cropdmg <- damage[kvector,]$cropdmg * 10**3
damage[mvector,]$cropdmg <- damage[mvector,]$cropdmg * 10**6
damage[bvector,]$cropdmg <- damage[bvector,]$cropdmg * 10**9
Now that we have the correct damage numbers, we can start calculating the total damage per event type. First we group by event type, then apply summarize to sum both the propdmg and cropdmg values, lastly we arrange on total damage. All of this can be done using the dplyr package.
damage <- damage %>%
group_by(evtype) %>%
summarize(totalDamage = sum(propdmg) + sum(cropdmg)) %>%
arrange(desc(totalDamage))
tbl_df(damage)
## Source: local data frame [431 x 2]
##
## evtype totalDamage
## (fctr) (dbl)
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352114049
## 4 STORM SURGE 43323541000
## 5 HAIL 18758221521
## 6 FLASH FLOOD 17562129167
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
## .. ... ...
To plot these numbers we subset the top ten, and again we have to address the issue that the levels of evtype are not ordered by the amount of damage that they cause (despite the totalDamage variable being arranged). We do so with the second statement. Only then we can plot.
dam <- damage[1:10,]
dam$evtype <- factor(dam$evtype, levels = dam$evtype[order(dam$totalDamage)])
g <- ggplot(dam, aes(evtype, totalDamage))
g + geom_bar(stat="identity", fill = "darkblue") + coord_flip() + xlab("Event Types") +
ylab("Total damage") + ggtitle("Event types with biggest economic consequences")
Floods are a clear winner here, followed by hurricanes, tornados and storm surges.