Ion Kel
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This report handles two basic questions:
First of all, we download and read the data using download.file and read.csv respecitively.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists('data.csv.bz2')){download.file(url, destfile = "data.csv.bz2")}
data <- read.csv("data.csv.bz2")
Now we take a look at the data using the tbl_df function of the dplyr package.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tbl_df(data)
## # A tibble: 902,297 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <dbl> <fctr> <fctr> <fctr> <dbl> <fctr> <fctr>
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## # ... with 902,287 more rows, and 30 more variables: EVTYPE <fctr>,
## # BGN_RANGE <dbl>, BGN_AZI <fctr>, BGN_LOCATI <fctr>, END_DATE <fctr>,
## # END_TIME <fctr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <fctr>, END_LOCATI <fctr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <fctr>, CROPDMG <dbl>, CROPDMGEXP <fctr>, WFO <fctr>,
## # STATEOFFIC <fctr>, ZONENAMES <fctr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <fctr>, REFNUM <dbl>
names(data) <- tolower(names(data))
aData <- select(data, evtype, fatalities:cropdmgexp)
In order to show the total fatality and injury numbers per event type, we can apply the group_by and summarize function on our analytic data. Then we can arrange the new dataframe to already show some values.
totalNumbers <- aData %>%
group_by(evtype) %>%
summarize(fatalities = sum(fatalities), injuries = sum(injuries)) %>%
arrange(desc(fatalities))
## Warning: package 'bindrcpp' was built under R version 3.4.2
tbl_df(totalNumbers)
## # A tibble: 985 x 3
## evtype fatalities injuries
## <fctr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
## # ... with 975 more rows
Now we prepare small subsets of totalNumbers for plotting. Fatal contains the top ten event types for fatalities. However, the fatality values may be ordered now, but the levels of evtype are not.
fatal <- arrange(totalNumbers, desc(fatalities))[1:15,]
fatal$evtype <- factor(fatal$evtype, levels = fatal$evtype[order(fatal$fatalities, decreasing = TRUE)])
library(ggplot2)
g = ggplot(fatal, aes(evtype, fatalities))
g = g + geom_bar(stat="identity", fill = "blue")
g = g + theme(axis.text.x = element_text(angle=90, hjust=1))
g = g + labs(x = "Event Types", y = "Number of Fatalities", title = "Event types with most fatalities")
g
To address the question which event types have the biggest economic consequences, we will first subset our analytic data. We select only the economic related columns, and filter out the observations with no property or crop damage. This makes our dataset a whole lot smaller.
damage <- aData %>%
select(evtype, propdmg:cropdmgexp) %>%
filter(propdmg != 0 | cropdmg != 0)
tbl_df(damage)
## # A tibble: 245,031 x 5
## evtype propdmg propdmgexp cropdmg cropdmgexp
## <fctr> <dbl> <fctr> <dbl> <fctr>
## 1 TORNADO 25.0 K 0
## 2 TORNADO 2.5 K 0
## 3 TORNADO 25.0 K 0
## 4 TORNADO 2.5 K 0
## 5 TORNADO 2.5 K 0
## 6 TORNADO 2.5 K 0
## 7 TORNADO 2.5 K 0
## 8 TORNADO 2.5 K 0
## 9 TORNADO 25.0 K 0
## 10 TORNADO 25.0 K 0
## # ... with 245,021 more rows
First we group by event type, then apply summarize to sum both the propdmg and cropdmg values, lastly we arrange on total damage. All of this can be done using the dplyr package.
damage <- damage %>%
group_by(evtype) %>%
summarize(totalDamage = sum(propdmg) + sum(cropdmg)) %>%
arrange(desc(totalDamage))
tbl_df(damage)
## # A tibble: 431 x 2
## evtype totalDamage
## <fctr> <dbl>
## 1 TORNADO 3312276.7
## 2 FLASH FLOOD 1599325.1
## 3 TSTM WIND 1445168.2
## 4 HAIL 1268289.7
## 5 FLOOD 1067976.4
## 6 THUNDERSTORM WIND 943635.6
## 7 LIGHTNING 606932.4
## 8 THUNDERSTORM WINDS 464978.1
## 9 HIGH WIND 342014.8
## 10 WINTER STORM 134699.6
## # ... with 421 more rows
To plot these numbers we subset the top ten, and again we have to address the issue that the levels of evtype are not ordered by the amount of damage that they cause (despite the totalDamage variable being arranged). We do so with the second statement. Only then we can plot.
dam <- damage[1:15,]
dam$evtype <- factor(dam$evtype, levels = dam$evtype[order(dam$totalDamage, decreasing = TRUE)])
g = ggplot(dam, aes(evtype, totalDamage))
g = g + geom_bar(stat="identity", fill = "red")
g = g + theme(axis.text.x = element_text(angle=90, hjust=1))
g = g + labs(x = "Event Types", y = "Total damage", title = "Event types with biggest economic consequences")
g