Ion Kel

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This report handles two basic questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data processing

First of all, we download and read the data using download.file and read.csv respecitively.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists('data.csv.bz2')){download.file(url, destfile = "data.csv.bz2")}
data <- read.csv("data.csv.bz2")

Now we take a look at the data using the tbl_df function of the dplyr package.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
tbl_df(data)
## # A tibble: 902,297 x 37
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME  STATE
##      <dbl>             <fctr>   <fctr>    <fctr>  <dbl>     <fctr> <fctr>
##  1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE     AL
##  2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN     AL
##  3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE     AL
##  4       1   6/8/1951 0:00:00     0900       CST     89    MADISON     AL
##  5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN     AL
##  6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE     AL
##  7       1 11/16/1951 0:00:00     0100       CST      9     BLOUNT     AL
##  8       1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA     AL
##  9       1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA     AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE     AL
## # ... with 902,287 more rows, and 30 more variables: EVTYPE <fctr>,
## #   BGN_RANGE <dbl>, BGN_AZI <fctr>, BGN_LOCATI <fctr>, END_DATE <fctr>,
## #   END_TIME <fctr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <fctr>, END_LOCATI <fctr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <fctr>, CROPDMG <dbl>, CROPDMGEXP <fctr>, WFO <fctr>,
## #   STATEOFFIC <fctr>, ZONENAMES <fctr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <fctr>, REFNUM <dbl>
names(data) <- tolower(names(data))
aData <- select(data, evtype, fatalities:cropdmgexp)

Relation between event type and population health

In order to show the total fatality and injury numbers per event type, we can apply the group_by and summarize function on our analytic data. Then we can arrange the new dataframe to already show some values.

totalNumbers <- aData %>% 
        group_by(evtype) %>% 
        summarize(fatalities = sum(fatalities), injuries = sum(injuries)) %>%
        arrange(desc(fatalities))
## Warning: package 'bindrcpp' was built under R version 3.4.2
tbl_df(totalNumbers)
## # A tibble: 985 x 3
##            evtype fatalities injuries
##            <fctr>      <dbl>    <dbl>
##  1        TORNADO       5633    91346
##  2 EXCESSIVE HEAT       1903     6525
##  3    FLASH FLOOD        978     1777
##  4           HEAT        937     2100
##  5      LIGHTNING        816     5230
##  6      TSTM WIND        504     6957
##  7          FLOOD        470     6789
##  8    RIP CURRENT        368      232
##  9      HIGH WIND        248     1137
## 10      AVALANCHE        224      170
## # ... with 975 more rows

Now we prepare small subsets of totalNumbers for plotting. Fatal contains the top ten event types for fatalities. However, the fatality values may be ordered now, but the levels of evtype are not.

fatal <- arrange(totalNumbers, desc(fatalities))[1:15,]
fatal$evtype <- factor(fatal$evtype, levels = fatal$evtype[order(fatal$fatalities, decreasing = TRUE)])

library(ggplot2)
g = ggplot(fatal, aes(evtype, fatalities))
g = g + geom_bar(stat="identity", fill = "blue")
g = g + theme(axis.text.x = element_text(angle=90, hjust=1))
g = g + labs(x = "Event Types", y = "Number of Fatalities", title = "Event types with most fatalities")
g

Relation between event type and economic consequences

To address the question which event types have the biggest economic consequences, we will first subset our analytic data. We select only the economic related columns, and filter out the observations with no property or crop damage. This makes our dataset a whole lot smaller.

damage <- aData %>% 
        select(evtype, propdmg:cropdmgexp) %>% 
        filter(propdmg != 0 | cropdmg != 0)
tbl_df(damage)
## # A tibble: 245,031 x 5
##     evtype propdmg propdmgexp cropdmg cropdmgexp
##     <fctr>   <dbl>     <fctr>   <dbl>     <fctr>
##  1 TORNADO    25.0          K       0           
##  2 TORNADO     2.5          K       0           
##  3 TORNADO    25.0          K       0           
##  4 TORNADO     2.5          K       0           
##  5 TORNADO     2.5          K       0           
##  6 TORNADO     2.5          K       0           
##  7 TORNADO     2.5          K       0           
##  8 TORNADO     2.5          K       0           
##  9 TORNADO    25.0          K       0           
## 10 TORNADO    25.0          K       0           
## # ... with 245,021 more rows

First we group by event type, then apply summarize to sum both the propdmg and cropdmg values, lastly we arrange on total damage. All of this can be done using the dplyr package.

damage <- damage %>% 
        group_by(evtype) %>% 
        summarize(totalDamage = sum(propdmg) + sum(cropdmg)) %>%
        arrange(desc(totalDamage))
tbl_df(damage)
## # A tibble: 431 x 2
##                evtype totalDamage
##                <fctr>       <dbl>
##  1            TORNADO   3312276.7
##  2        FLASH FLOOD   1599325.1
##  3          TSTM WIND   1445168.2
##  4               HAIL   1268289.7
##  5              FLOOD   1067976.4
##  6  THUNDERSTORM WIND    943635.6
##  7          LIGHTNING    606932.4
##  8 THUNDERSTORM WINDS    464978.1
##  9          HIGH WIND    342014.8
## 10       WINTER STORM    134699.6
## # ... with 421 more rows

To plot these numbers we subset the top ten, and again we have to address the issue that the levels of evtype are not ordered by the amount of damage that they cause (despite the totalDamage variable being arranged). We do so with the second statement. Only then we can plot.

dam <- damage[1:15,]
dam$evtype <- factor(dam$evtype, levels = dam$evtype[order(dam$totalDamage, decreasing = TRUE)])

g = ggplot(dam, aes(evtype, totalDamage))
g = g + geom_bar(stat="identity", fill = "red")
g = g + theme(axis.text.x = element_text(angle=90, hjust=1))
g = g + labs(x = "Event Types", y = "Total damage", title = "Event types with biggest economic consequences")
g