Course Project in Reproducible Research Class

This is the R markdown file to describe the results of the analysis of NOAA weather data as second Course Project in the Coursera Class ‘Reproducible Research’.

The analysis to be conducted with this report ahall answer following two questions from the available NOAA data:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Following libraries are loaded to analyse the data set:

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## 
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

The data is available at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and read in and stored as data table “storm_data”.

storm_data <- read.csv("/Users/markjack/data/repdata-data-StormData.csv")

Data Processing and Results:

The data set is quickly reviewed for its column names and a general summary of its content.

colnames(storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
str(storm_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We select as key indicators (columns) to address the above two questions following columns from the “storm_data” data set: STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, and CROPDMG.

storm_damage <- select(storm_data, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
head(storm_damage)
##   STATE  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1    AL TORNADO          0       15    25.0       0
## 2    AL TORNADO          0        0     2.5       0
## 3    AL TORNADO          0        2    25.0       0
## 4    AL TORNADO          0        2     2.5       0
## 5    AL TORNADO          0        2     2.5       0
## 6    AL TORNADO          0        6     2.5       0

Across United States, health impacts by weather: The two columns FATALITIES and INJURIES are individually summed with respect to different event types listed in the column EVTYPE. The summation is done across all the states listed in column STATE to obtain the total fatalities and injuries for each weather event in the Uinted States. Different tools from the R packages dplyr, tidyr and plyr are being used to first melt the existing data, then sum across the new row values “FATALITIES” and “INJURIES” in a new column named “variable”. The new variable “sum” is then cast back into two columns FATALITIES and INJURIES to separate the summed results for those two categories. The data is then sorted with the command ‘arrange’ to first show the total number of fatalities for each weather type in descending order which is then repeated with the total number of injuries for each weather event in descending order. The largest 10 values are printed in each case.

storm_damage <- group_by(storm_damage, STATE, EVTYPE)
storm_damage_health <- melt(storm_damage, id.vars = c("EVTYPE"), measure.vars = c("FATALITIES", "INJURIES"), na.rm = TRUE)
storm_damage_order_health <- group_by(storm_damage_health, EVTYPE, variable)
storm_damage_sum_health <- ddply(storm_damage_order_health, c("EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_damage_sum_health <- arrange(storm_damage_sum_health, desc(sum))
storm_damage_sum_health2 <- dcast(storm_damage_sum_health, EVTYPE ~ variable)
## Using sum as value column: use value.var to override.
storm_damage_sum_health3 <- arrange(storm_damage_sum_health2, desc(FATALITIES), desc(INJURIES))
head(storm_damage_sum_health3, 10)
##            EVTYPE FATALITIES INJURIES
## 1         TORNADO       5633    91346
## 2  EXCESSIVE HEAT       1903     6525
## 3     FLASH FLOOD        978     1777
## 4            HEAT        937     2100
## 5       LIGHTNING        816     5230
## 6       TSTM WIND        504     6957
## 7           FLOOD        470     6789
## 8     RIP CURRENT        368      232
## 9       HIGH WIND        248     1137
## 10      AVALANCHE        224      170
storm_damage_sum_health4 <- arrange(storm_damage_sum_health2, desc(INJURIES), desc(FATALITIES))
head(storm_damage_sum_health4, 10)
##               EVTYPE FATALITIES INJURIES
## 1            TORNADO       5633    91346
## 2          TSTM WIND        504     6957
## 3              FLOOD        470     6789
## 4     EXCESSIVE HEAT       1903     6525
## 5          LIGHTNING        816     5230
## 6               HEAT        937     2100
## 7          ICE STORM         89     1975
## 8        FLASH FLOOD        978     1777
## 9  THUNDERSTORM WIND        133     1488
## 10              HAIL         15     1361

The largest number of fatalities and injuries in the United States can be clearly attributed to tornados as seen from the table. Tornados as a cause of bodily harm to people by far leads in the number of total fatalities listed with excessive heat following as the second most important cause in case of fatalities. For injuries, TSTM winds and floods and lightning strikes are equally important as second leading causes.

This is further visualized with these two bar plots of the total number of fatalities and injuries for each weather event type in descending order:

par(mfrow = c(1,2))
barplot(storm_damage_sum_health3$FATALITIES[1:10], xlab="EVTYPE", col="darkblue")
barplot(storm_damage_sum_health4$INJURIES[1:10], xlab="EVTYPE", col="red")

Across the United States, economic impacts by weather: Analogously to the analysis shown above on the health impacts of weather related incidents, the above procedure is repeated exactly but with columns FATALITIES and INJURIES from the original data set storm_data replaced by the columns PROPDMG and CROPDMG. The two columns summarize any property damage and crop damage inflicted by a weather-related event. Following the exact same procedure as above we also obtain a listing of the 10 most significant weather causes of property damage and crop damage in the United States.

storm_damage <- group_by(storm_damage, STATE, EVTYPE)
storm_damage_econ <- melt(storm_damage, id.vars = c("EVTYPE"), measure.vars = c("PROPDMG", "CROPDMG"), na.rm = TRUE)
storm_damage_order_econ <- group_by(storm_damage_econ, EVTYPE, variable)
storm_damage_sum_econ <- ddply(storm_damage_order_econ, c("EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_damage_sum_econ <- arrange(storm_damage_sum_econ, desc(sum))
storm_damage_sum_econ2 <- dcast(storm_damage_sum_econ, EVTYPE ~ variable)
## Using sum as value column: use value.var to override.
storm_damage_sum_econ3 <- arrange(storm_damage_sum_econ2, desc(PROPDMG), desc(CROPDMG))
head(storm_damage_sum_econ3, 10)
##                EVTYPE   PROPDMG   CROPDMG
## 1             TORNADO 3212258.2 100018.52
## 2         FLASH FLOOD 1420124.6 179200.46
## 3           TSTM WIND 1335965.6 109202.60
## 4               FLOOD  899938.5 168037.88
## 5   THUNDERSTORM WIND  876844.2  66791.45
## 6                HAIL  688693.4 579596.28
## 7           LIGHTNING  603351.8   3580.61
## 8  THUNDERSTORM WINDS  446293.2  18684.93
## 9           HIGH WIND  324731.6  17283.21
## 10       WINTER STORM  132720.6   1978.99
storm_damage_sum_econ4 <- arrange(storm_damage_sum_econ2, desc(CROPDMG), desc(PROPDMG))
head(storm_damage_sum_econ4, 10)
##                EVTYPE    PROPDMG   CROPDMG
## 1                HAIL  688693.38 579596.28
## 2         FLASH FLOOD 1420124.59 179200.46
## 3               FLOOD  899938.48 168037.88
## 4           TSTM WIND 1335965.61 109202.60
## 5             TORNADO 3212258.16 100018.52
## 6   THUNDERSTORM WIND  876844.17  66791.45
## 7             DROUGHT    4099.05  33898.62
## 8  THUNDERSTORM WINDS  446293.18  18684.93
## 9           HIGH WIND  324731.56  17283.21
## 10         HEAVY RAIN   50842.14  11122.80

The results show again tornados as leading cause for weather-inflicted property damage in the US at more than 3 million in damages, while hail is the most important contributor to crop damage at more than 500,000 in damages. the next 10 major weather reasons for property and crop damage are then listed for each table. We have further added bar plots of the total amount in property and crop damages for each weather event type with values listed in descending order in each case:

par(mfrow = c(1,2))
barplot(storm_damage_sum_econ3$PROPDMG[1:10], xlab="EVTYPE", col="darkblue")
barplot(storm_damage_sum_econ4$CROPDMG[1:10], xlab="EVTYPE", col="red")

Health impact and economic impacts by weather for each state: Below we simply repeated the above analysis but for each state individually and combined the analysis for property damage and crop damage while fatalities and injuries were investigated deparately.

storm_damage <- group_by(storm_damage, STATE, EVTYPE)
storm_damage_melt <- melt(storm_damage, id.vars = c("STATE", "EVTYPE"), measure.vars = c("PROPDMG", "CROPDMG"), na.rm = TRUE)
storm_damage_order <- group_by(storm_damage_melt, STATE, EVTYPE, variable)
storm_damage_sum <- ddply(storm_damage_order, c("STATE", "EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_damage_sum <- arrange(storm_damage_sum, desc(sum))
head(storm_damage_sum, 10)
##    STATE    EVTYPE variable      sum
## 1     TX   TORNADO  PROPDMG 283097.2
## 2     NE      HAIL  CROPDMG 201031.1
## 3     MS   TORNADO  PROPDMG 187840.9
## 4     AL   TORNADO  PROPDMG 167816.2
## 5     OK   TORNADO  PROPDMG 165167.9
## 6     FL   TORNADO  PROPDMG 159752.6
## 7     IA   TORNADO  PROPDMG 152142.8
## 8     GA   TORNADO  PROPDMG 151349.5
## 9     TX TSTM WIND  PROPDMG 144959.0
## 10    KS   TORNADO  PROPDMG 143209.9
storm_fat_melt <- melt(storm_damage, id.vars = c("STATE", "EVTYPE"), measure.vars = c("FATALITIES"), na.rm = TRUE)
storm_fat_order <- group_by(storm_fat_melt, STATE, EVTYPE, variable)
storm_fat_sum <- ddply(storm_fat_order, c("STATE", "EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_fat_sum <- arrange(storm_fat_sum, desc(sum))
head(storm_fat_sum, 10)
##    STATE         EVTYPE   variable sum
## 1     IL           HEAT FATALITIES 653
## 2     AL        TORNADO FATALITIES 617
## 3     TX        TORNADO FATALITIES 538
## 4     MS        TORNADO FATALITIES 450
## 5     MO        TORNADO FATALITIES 388
## 6     AR        TORNADO FATALITIES 379
## 7     TN        TORNADO FATALITIES 368
## 8     PA EXCESSIVE HEAT FATALITIES 359
## 9     IL EXCESSIVE HEAT FATALITIES 330
## 10    OK        TORNADO FATALITIES 296
storm_inj_melt <- melt(storm_damage, id.vars = c("STATE", "EVTYPE"), measure.vars = c("INJURIES"), na.rm = TRUE)
storm_inj_order <- group_by(storm_inj_melt, STATE, EVTYPE, variable)
storm_inj_sum <- ddply(storm_inj_order, c("STATE", "EVTYPE", "variable"), summarise, sum = sum(value, na.rm = TRUE))
storm_inj_sum <- arrange(storm_inj_sum, desc(sum))
head(storm_inj_sum, 10)
##    STATE  EVTYPE variable  sum
## 1     TX TORNADO INJURIES 8207
## 2     AL TORNADO INJURIES 7929
## 3     TX   FLOOD INJURIES 6338
## 4     MS TORNADO INJURIES 6244
## 5     AR TORNADO INJURIES 5116
## 6     OK TORNADO INJURIES 4829
## 7     TN TORNADO INJURIES 4748
## 8     OH TORNADO INJURIES 4438
## 9     MO TORNADO INJURIES 4330
## 10    IN TORNADO INJURIES 4224