Reproducible Research: Analysis of Storms Data

Synopsis

We would like to answer two questions:

  • Across the US, which types of events are most harmful with respect to population health?
  • Across the US, which types of events have the greatest economic consequences?

To answer these questions, first we look at the dataset to understand the organization of the data, then we arrange some data frames to get the data into a useful (and less cluttered) form, and finally we apply some tools from R language to visualize the desired information.

Data Processing

We start by reading in the raw data. Since the data is in the compressed bz2 format, we must first uncompress it by using the “bunzip2” command, which can be found in the GIT bash shell. Use “bunzip2 repdata_data_StormData.csv.bz2”, and it will create the CSV file in the same folder. Then read the CSV file into R, as follows:

data <- read.csv("repdata_data_StormData.csv")

We’d like to know how the data is organized, so we use the R commands “str” and “names” to look at the data frame.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

And we can see that there are 4 columns of interest, in addition to the EVTYPE event type.

These are:

  • FATALITIES
  • INJURIES
  • PROPDMG
  • CROPDMG

After looking at the table, it becomes apparent that the dplyr library can help us, so we access that. Also we will use the ggplot2 library to plot some graphs.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

We would like the data grouped by event type EVTYPE.

data_grouped <- group_by(data,data$EVTYPE)

And we would like four summaries, one for each of the columns we’re interested in.

fatalities_summary <- summarize(data_grouped,Fatalities=sum(FATALITIES))
injuries_summary <- summarize(data_grouped,Injuries=sum(INJURIES))
propdmg_summary <- summarize(data_grouped,PropDmg=sum(PROPDMG))
cropdmg_summary <- summarize(data_grouped,CropDmg=sum(CROPDMG))

We could now, for example, look for the event type that caused the greatest number of fatalities.

which.max(fatalities_summary$Fatalities)
## [1] 826
fatalities_summary[826,]
## # A tibble: 1 × 2
##   `data$EVTYPE` Fatalities
##   <chr>              <dbl>
## 1 TORNADO             5633

Clearly, tornadoes are the deadliest event type.

To see what this means in the larger context, we can do a quick plot of the number of fatalities for each event type.

qplot(EVTYPE,FATALITIES,data=data)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We notice there’s a few event types at the top, and we’d like to see what those are and how many fatalities there are. It looks like we can get the top 10 or so by setting the threshold around 80.

However the question asks “most harmful”, and injuries and fatalities are both harmful, so let’s combine those two columns into a single data frame. And since the second question asks for economic consequences, we do the same for property damage and crop damage.

First though, we change the names in the summary tables so we can more conveniently access the data.

After combining the harm and economic consequences, we can then subset each table to get the top 10 or so contributors. We can quickly look at the structure of one of the tables to make sure it’s what we want, and optionally we can do a quick plot to see where we can threshold to get a convenient Top 10-or-so.

We can print out the top contributors to adverse health and to adverse economic consequences.

names(fatalities_summary) <- c("EvType", "Fatalities")
names(injuries_summary) <- c("EvType", "Injuries")
names(propdmg_summary) <- c("EvType", "PropDmg")
names(cropdmg_summary) <- c("EvType", "CropDmg")

harmful_summary <- merge(fatalities_summary,injuries_summary,by="EvType")
economic_summary <- merge(propdmg_summary,cropdmg_summary,by="EvType")

harmful_summary$Total <- harmful_summary$Fatalities + harmful_summary$Injuries
economic_summary$Total <- economic_summary$PropDmg + economic_summary$CropDmg

str(harmful_summary)
## 'data.frame':    985 obs. of  4 variables:
##  $ EvType    : chr  "   HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
##  $ Fatalities: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Injuries  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total     : num  0 0 0 0 0 0 0 0 0 0 ...
top_harmful <- subset(harmful_summary,harmful_summary$Total > 1000)
top_economic <- subset(economic_summary,economic_summary$Total > 100000)

top_harmful
##                EvType Fatalities Injuries Total
## 130    EXCESSIVE HEAT       1903     6525  8428
## 153       FLASH FLOOD        978     1777  2755
## 170             FLOOD        470     6789  7259
## 244              HAIL         15     1361  1376
## 275              HEAT        937     2100  3037
## 310        HEAVY SNOW        127     1021  1148
## 359         HIGH WIND        248     1137  1385
## 411 HURRICANE/TYPHOON         64     1275  1339
## 427         ICE STORM         89     1975  2064
## 464         LIGHTNING        816     5230  6046
## 760 THUNDERSTORM WIND        133     1488  1621
## 834           TORNADO       5633    91346 96979
## 856         TSTM WIND        504     6957  7461
## 972      WINTER STORM        206     1321  1527
top_economic
##                 EvType   PropDmg   CropDmg     Total
## 153        FLASH FLOOD 1420124.6 179200.46 1599325.1
## 170              FLOOD  899938.5 168037.88 1067976.4
## 244               HAIL  688693.4 579596.28 1268289.7
## 310         HEAVY SNOW  122252.0   2165.72  124417.7
## 359          HIGH WIND  324731.6  17283.21  342014.8
## 464          LIGHTNING  603351.8   3580.61  606932.4
## 760  THUNDERSTORM WIND  876844.2  66791.45  943635.6
## 786 THUNDERSTORM WINDS  446293.2  18684.93  464978.1
## 834            TORNADO 3212258.2 100018.52 3312276.7
## 856          TSTM WIND 1335965.6 109202.60 1445168.2
## 972       WINTER STORM  132720.6   1978.99  134699.6

Results

It is useful to view the results in two forms: one as a table, and the other as a graph.

First we sort each table in descending order, so the greatest contributor is on top.

Then we plot each table to see how the contributions look.

harmful_sorted <- arrange(top_harmful,desc(top_harmful$Total))
economic_sorted <- arrange(top_economic,desc(top_economic$Total))

harmful_sorted
##               EvType Fatalities Injuries Total
## 1            TORNADO       5633    91346 96979
## 2     EXCESSIVE HEAT       1903     6525  8428
## 3          TSTM WIND        504     6957  7461
## 4              FLOOD        470     6789  7259
## 5          LIGHTNING        816     5230  6046
## 6               HEAT        937     2100  3037
## 7        FLASH FLOOD        978     1777  2755
## 8          ICE STORM         89     1975  2064
## 9  THUNDERSTORM WIND        133     1488  1621
## 10      WINTER STORM        206     1321  1527
## 11         HIGH WIND        248     1137  1385
## 12              HAIL         15     1361  1376
## 13 HURRICANE/TYPHOON         64     1275  1339
## 14        HEAVY SNOW        127     1021  1148
economic_sorted
##                EvType   PropDmg   CropDmg     Total
## 1             TORNADO 3212258.2 100018.52 3312276.7
## 2         FLASH FLOOD 1420124.6 179200.46 1599325.1
## 3           TSTM WIND 1335965.6 109202.60 1445168.2
## 4                HAIL  688693.4 579596.28 1268289.7
## 5               FLOOD  899938.5 168037.88 1067976.4
## 6   THUNDERSTORM WIND  876844.2  66791.45  943635.6
## 7           LIGHTNING  603351.8   3580.61  606932.4
## 8  THUNDERSTORM WINDS  446293.2  18684.93  464978.1
## 9           HIGH WIND  324731.6  17283.21  342014.8
## 10       WINTER STORM  132720.6   1978.99  134699.6
## 11         HEAVY SNOW  122252.0   2165.72  124417.7
qplot(EvType, Total, data=harmful_sorted, xlab="Event Type", ylab="Fatalities and Injuries", main="Events Causing Greatest Health Damage") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

qplot(EvType, Total, data=economic_sorted, xlab="Event Type", ylab="Property and Crop Damage", main="Events Causing Greatest Economic Damage") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))