Reproducible Research: Course Project 2

Synopsis

This report will, based on the storm database of the U.S. National Oceanic and Atmospheric Administration (NOAA), study the adverse effects of severe weather events on the US population health, i.e. the number of associated fatalities and injuries, as well as their economic consequences, i.e. the damage caused to property and crops. For each type of event registered in the NOAA storm database, the total number of fatalities and injuries, and the total property and crop damages (as well as total combined damages) are calculated. The 10 most severe events from each of these categories are selected and compared.

Data Processing

In the first step, we will check, whether the CSV.BZ2 file is already in the folder. If it isn’t, we will download it from the link that was provided in the instructions.

if(!("repdata_data_StormData.csv.bz2" %in% dir())) {
        fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileUrl,destfile="repdata_data_StormData.csv.bz2",method="libcurl")
}

Next we read in the data using the read.csv function, and look at the variable names.

storm <- read.csv("repdata_data_StormData.csv.bz2",header=TRUE)
names(storm)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The columns of interest are the ones labeling the event type, EVTYPE, the number or associated fatalities and injuries, FATALITIES and INJURIES, and the columns classifying the property and crop damages, PROPDMG and CROPDMG. Since the damage amounts are supposed to be recorded in numbers between 0 and 999, together with a multiplier (“K” for thousand, “M” for million, “B” for billion), we also need the columns containing these multipliers, PROPDMGEXP and CROPDMGEXP. Starting off by looking at some of the values in EVTYPE, we can see that there are some problems.

head(table(storm$EVTYPE),50)

## 
##             HIGH SURF ADVISORY                  COASTAL FLOOD 
##                              1                              1 
##                    FLASH FLOOD                      LIGHTNING 
##                              1                              1 
##                      TSTM WIND                TSTM WIND (G45) 
##                              4                              1 
##                     WATERSPOUT                           WIND 
##                              1                              1 
##                              ?                ABNORMAL WARMTH 
##                              1                              4 
##                 ABNORMALLY DRY                 ABNORMALLY WET 
##                              2                              1 
##           ACCUMULATED SNOWFALL            AGRICULTURAL FREEZE 
##                              4                              6 
##                  APACHE COUNTY         ASTRONOMICAL HIGH TIDE 
##                              1                            103 
##          ASTRONOMICAL LOW TIDE                       AVALANCE 
##                            174                              1 
##                      AVALANCHE                   BEACH EROSIN 
##                            386                              1 
##                  Beach Erosion                  BEACH EROSION 
##                              1                              3 
##    BEACH EROSION/COASTAL FLOOD                    BEACH FLOOD 
##                              1                              2 
##     BELOW NORMAL PRECIPITATION              BITTER WIND CHILL 
##                              2                              1 
## BITTER WIND CHILL TEMPERATURES                      Black Ice 
##                              3                              3 
##                      BLACK ICE                       BLIZZARD 
##                             14                           2719 
## BLIZZARD AND EXTREME WIND CHIL        BLIZZARD AND HEAVY SNOW 
##                              2                              1 
##               Blizzard Summary               BLIZZARD WEATHER 
##                              1                              1 
##         BLIZZARD/FREEZING RAIN            BLIZZARD/HEAVY SNOW 
##                              1                              2 
##             BLIZZARD/HIGH WIND          BLIZZARD/WINTER STORM 
##                              1                              1 
##                  BLOW-OUT TIDE                 BLOW-OUT TIDES 
##                              1                              1 
##                   BLOWING DUST                   blowing snow 
##                              4                              2 
##                   Blowing Snow                   BLOWING SNOW 
##                              3                             12 
## BLOWING SNOW- EXTREME WIND CHI BLOWING SNOW & EXTREME WIND CH 
##                              1                              2 
## BLOWING SNOW/EXTREME WIND CHIL               BREAKUP FLOODING 
##                              1                              1 
##                     BRUSH FIRE                    BRUSH FIRES 
##                              3                              1

Some entries are clearly misspelled, while others appear multiple times with different capitalizations. Since fixing all typos manually would exceed the scope of this project, and likely not make any useful differences (assuming that the majority of entries are spelled correctly), we will just employ a quick fix to capitalize all entries, which already reduces the number of event types by roughly 10%.

length(unique((storm$EVTYPE)))

## [1] 985

length(unique(toupper(storm$EVTYPE)))

## [1] 898

storm$EVTYPE <- toupper(storm$EVTYPE)

In the next step, we check that there are no NA values in any of the four numeric columns.

sum(is.na(storm$FATALITIES))

## [1] 0

sum(is.na(storm$INJURIES))

## [1] 0

sum(is.na(storm$PROPDMG))

## [1] 0

sum(is.na(storm$CROPDMG))

## [1] 0

Now we take a look at the PROPDMGEXP and CROPDMGEXP columns. These should ideally only be filled with empty fields or the characters “K”, “M” and “B”.

table(storm$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

table(storm$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Again, there are some clearly mislabeled entries, as well as cases of different capitalization. Over the next series of steps, we will create new columns for the property and crop damages, which multiply the entries with their correct exponents, i.e. 1000 for “K”, 1000000 for “M” and “1000000000 for”B".

thousands <- toupper(storm$PROPDMGEXP) == "K"
millions  <- toupper(storm$PROPDMGEXP) == "M"
billions  <- toupper(storm$PROPDMGEXP) == "B"

prop <- storm$PROPDMG
prop[thousands] <- 1000*prop[thousands]
prop[millions] <- 1000000*prop[millions]
prop[billions] <- 1000000000*prop[billions]

cthousands <- toupper(storm$CROPDMGEXP) == "K"
cmillions  <- toupper(storm$CROPDMGEXP) == "M"
cbillions  <- toupper(storm$CROPDMGEXP) == "B"

crop <- storm$CROPDMG
crop[cthousands] <- 1000*crop[cthousands]
crop[cmillions] <- 1000000*crop[cmillions]
crop[cbillions] <- 1000000000*crop[cbillions]

All mistaken entries in the two exponential columns are ignored, but different capitalizations are accounted for. The two columns are added to the original data frame.

storm$propd <- prop
storm$cropd <- crop

Results

The analysis is split into two separate questions: which types of events are most harmful with respect to population health, and which types of events have the greatest economic consequences?

Assessing harm to the population

As harm to the population, we can define both the number of fatalities as well as the number of injuries related to certain types of events. These will be treated separately.

The dplyr package will be used for the analysis. First the data set will be grouped by event type, and the total number of fatalities for each type is calculated.

library(dplyr)
fatal <- storm %>% group_by(EVTYPE) %>% summarise(number=sum(FATALITIES))

head(fatal)

## # A tibble: 6 x 2
##   EVTYPE                  number
##   <chr>                    <dbl>
## 1 "   HIGH SURF ADVISORY"      0
## 2 " COASTAL FLOOD"             0
## 3 " FLASH FLOOD"               0
## 4 " LIGHTNING"                 0
## 5 " TSTM WIND"                 0
## 6 " TSTM WIND (G45)"           0

The deadliest event type and the associated number of fatalities can be calculated.

max(fatal$number)

## [1] 5633

fatal$EVTYPE[fatal$number == max(fatal$number)]

## [1] "TORNADO"

Tornados are responsible for the most fatalities, 5633. The same process is repeated for injuries.

inj <- storm %>% group_by(EVTYPE) %>% summarise(number=sum(INJURIES))
head(inj)

## # A tibble: 6 x 2
##   EVTYPE                  number
##   <chr>                    <dbl>
## 1 "   HIGH SURF ADVISORY"      0
## 2 " COASTAL FLOOD"             0
## 3 " FLASH FLOOD"               0
## 4 " LIGHTNING"                 0
## 5 " TSTM WIND"                 0
## 6 " TSTM WIND (G45)"           0

max(inj$number)

## [1] 91346

inj$EVTYPE[inj$number == max(inj$number)]

## [1] "TORNADO"

Here, too, tornados are the number one source of injuries. In order to have a more comprehensive overview of the most harmful event types, we can look at the top 10 entries in each category. To that purpose, we can sort the earlier output.

fatal_ord <- order(fatal$number,decreasing = TRUE)
fatal10 <- head(fatal[fatal_ord,],10)
fatal10

## # A tibble: 10 x 2
##    EVTYPE         number
##    <chr>           <dbl>
##  1 TORNADO          5633
##  2 EXCESSIVE HEAT   1903
##  3 FLASH FLOOD       978
##  4 HEAT              937
##  5 LIGHTNING         816
##  6 TSTM WIND         504
##  7 FLOOD             470
##  8 RIP CURRENT       368
##  9 HIGH WIND         248
## 10 AVALANCHE         224

inj_ord <- order(inj$number,decreasing = TRUE)
inj10 <- head(inj[inj_ord,],10)
inj10

## # A tibble: 10 x 2
##    EVTYPE            number
##    <chr>              <dbl>
##  1 TORNADO            91346
##  2 TSTM WIND           6957
##  3 FLOOD               6789
##  4 EXCESSIVE HEAT      6525
##  5 LIGHTNING           5230
##  6 HEAT                2100
##  7 ICE STORM           1975
##  8 FLASH FLOOD         1777
##  9 THUNDERSTORM WIND   1488
## 10 HAIL                1361

While tornados are the clear leader in both categories, the following places show some variation. To illustrate these rankings, we combine both lists into one.

fatal10$type <- "Fatalities"
inj10$type <- "Injuries"

top10 <- rbind(fatal10,inj10)

This combined list can then be plotted, using the ggplot2 package.

library(ggplot2)

g <- ggplot(top10,aes(EVTYPE,number,fill=EVTYPE))
g + geom_bar(stat="identity") + facet_grid(type~.,scales="free") +
        theme_bw() + theme(axis.text.x = element_text(angle = 90)) +
        labs(x = "Event Type", y = "Number of Fatalities/Injuries") +
        ggtitle("Top 10 event types responsible for fatalities and injuries")

This figure shows the total number of fatalities and injuries for the ten most harmful event types in each category. In both cases, tornados are the most harmful type of event. We can also see, that some of the categories should possibly be combined, since they might refer to the same event type (i.e. HEAT and EXCESSIVE HEAT, or THUNDERSTORM WIND and TSTM WIND). But even then, they are no match for tornado events. The events which are most harmful to the population health can thus be identified as tornados, excessive heat, (flash) floods and lightning strikes.

Assessing the economic consequences

For the purposes of this report, we assume that the economic consequences can be estimated by the total damages to properties and crops which are associated with each event type. We are interested in looking at these damages both separately and combined. As in the previous section, the original data is grouped according to the respective category, and total damages are calculated for each event type.

prop_sum <- storm %>% group_by(EVTYPE) %>% summarise(cost=sum(propd))
head(prop_sum)

## # A tibble: 6 x 2
##   EVTYPE                     cost
##   <chr>                     <dbl>
## 1 "   HIGH SURF ADVISORY"  200000
## 2 " COASTAL FLOOD"              0
## 3 " FLASH FLOOD"            50000
## 4 " LIGHTNING"                  0
## 5 " TSTM WIND"            8100000
## 6 " TSTM WIND (G45)"         8000

max(prop_sum$cost)

## [1] 144657709807

prop_sum$EVTYPE[prop_sum$cost == max(prop_sum$cost)]

## [1] "FLOOD"

crop_sum <- storm %>% group_by(EVTYPE) %>% summarise(cost=sum(cropd))
head(crop_sum)

## # A tibble: 6 x 2
##   EVTYPE                   cost
##   <chr>                   <dbl>
## 1 "   HIGH SURF ADVISORY"     0
## 2 " COASTAL FLOOD"            0
## 3 " FLASH FLOOD"              0
## 4 " LIGHTNING"                0
## 5 " TSTM WIND"                0
## 6 " TSTM WIND (G45)"          0

max(crop_sum$cost)

## [1] 13972566000

crop_sum$EVTYPE[crop_sum$cost == max(crop_sum$cost)]

## [1] "DROUGHT"

combined <- merge(prop_sum,crop_sum,by="EVTYPE")
combined$total <- combined$cost.x+combined$cost.y

This time, the event type responsible for incurring the largest damage is not the same for the categories. Floods lead to the largest property damage, while droughts are responsible for the largest crop damage (but note that that sum is one order of magnitude smaller). As before, the lists are ordered and the 10 largest damage sources each are extracted:

prop_ord <- order(prop_sum$cost,decreasing = TRUE)
prop10 <- head(prop_sum[prop_ord,],10)
prop10

## # A tibble: 10 x 2
##    EVTYPE                     cost
##    <chr>                     <dbl>
##  1 FLOOD             144657709807 
##  2 HURRICANE/TYPHOON  69305840000 
##  3 TORNADO            56937160779.
##  4 STORM SURGE        43323536000 
##  5 FLASH FLOOD        16140812067.
##  6 HAIL               15732267048.
##  7 HURRICANE          11868319010 
##  8 TROPICAL STORM      7703890550 
##  9 WINTER STORM        6688497251 
## 10 HIGH WIND           5270046295

crop_ord <- order(crop_sum$cost,decreasing = TRUE)
crop10 <- head(crop_sum[crop_ord,],10)
crop10

## # A tibble: 10 x 2
##    EVTYPE                   cost
##    <chr>                   <dbl>
##  1 DROUGHT           13972566000
##  2 FLOOD              5661968450
##  3 RIVER FLOOD        5029459000
##  4 ICE STORM          5022113500
##  5 HAIL               3025954473
##  6 HURRICANE          2741910000
##  7 HURRICANE/TYPHOON  2607872800
##  8 FLASH FLOOD        1421317100
##  9 EXTREME COLD       1312973000
## 10 FROST/FREEZE       1094186000

comb_ord <- order(combined$total,decreasing = TRUE)
comb10 <- head(combined[comb_ord,],10)
comb10

##                EVTYPE       cost.x      cost.y        total
## 154             FLOOD 144657709807  5661968450 150319678257
## 372 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 758           TORNADO  56937160779   414953270  57352114049
## 599       STORM SURGE  43323536000        5000  43323541000
## 212              HAIL  15732267048  3025954473  18758221521
## 138       FLASH FLOOD  16140812067  1421317100  17562129167
## 84            DROUGHT   1046106000 13972566000  15018672000
## 363         HURRICANE  11868319010  2741910000  14610229010
## 529       RIVER FLOOD   5118945500  5029459000  10148404500
## 387         ICE STORM   3944927860  5022113500   8967041360

The three lists are again combined, in order to create a panel plot.

comb10a <- data.frame(comb10$EVTYPE)
comb10a$cost <- comb10$total
colnames(comb10a) <- c("EVTYPE","cost")

prop10$type <- "Property damage"
crop10$type <- "Crop damage"
comb10a$type <- "Combined damage"

top10dam <- rbind(prop10,crop10,comb10a)

g <- ggplot(top10dam,aes(EVTYPE,cost,fill=EVTYPE))
g + geom_bar(stat="identity") + facet_grid(type~.,scales="free") +
        theme_bw() + theme(axis.text.x = element_text(angle = 90)) +
        labs(x = "Event Type", y = "Estimated damage [USD]") +
        ggtitle("Property and crop damage due to different event types")

This figure shows, the crop damage (middle), property damage (bottom) and total combined damage (top) associated with the top 10 types of severe weather events in each category. As before, some categories should possibly be combined (such as HURRICANE and HURRICANE/TYPHOON), but that does not affect the final results significantly. While droughts are the number one source of crop damages, the event types with the largest total economic consequences are floods, followed by hurricanes/typhoons and tornados.