Reproducible Research - Peer Assessment II - Analysis of NWS Storm Database

Synopsis

This purpose of this report is to load and analyze the National Weather Service Storm Database, and answer some basic questions regarding trends across the United States. It will be a combination of text, graphics, and code (r) in order to enable the reader to reproduce any of the analysis steps (if desired). The main two questions that this report will answer are:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events (as indicated in the EVTYPE variable) have the greatest economic consequences?

Data Processing

This section of the report shows all steps of the data processing necessary in the creation of the NWS Storm Database Analysis.

1. The first step of this analysis is to load the NWS database file “repdata-data-StormData.csv.bz2”. This data file can be downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

This file will be read into memory, and stored into a dataframe called “weather_data”.

#1. Load the data - "repdata-data-StormData.csv.bz2"
weather_data<-read.csv(bzfile("repdata-data-StormData.csv.bz2", "r"), header=TRUE)

2. The next step is to verify that the file has been read in correctly and check the file layout. I’ll do this using the head() function for the first 5 rows. I’ll also store the column names into a character vector called “names” so that I can use it for displaying or selecting columns of interest during subsequent processing steps.

options(width = 125) #sets correct width for viewing
head(weather_data,5)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO         0                            
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO         0                            
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO         0                            
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO         0                            
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO         0                            
##   END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1                   0         NA         0                      14.0   100 3   0          0       15    25.0          K
## 2                   0         NA         0                       2.0   150 2   0          0        0     2.5          K
## 3                   0         NA         0                       0.1   123 2   0          0        2    25.0          K
## 4                   0         NA         0                       0.0   100 2   0          0        2     2.5          K
## 5                   0         NA         0                       0.0   150 2   0          0        2     2.5          K
##   CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       0                                         3040      8812       3051       8806              1
## 2       0                                         3042      8755          0          0              2
## 3       0                                         3340      8742          0          0              3
## 4       0                                         3458      8626          0          0              4
## 5       0                                         3412      8642          0          0              5

names<-names(weather_data)
names

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"     "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE" 
## [10] "BGN_AZI"    "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN" "END_RANGE"  "END_AZI"    "END_LOCATI"
## [19] "LENGTH"     "WIDTH"      "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"    "PROPDMGEXP" "CROPDMG"   
## [28] "CROPDMGEXP" "WFO"        "STATEOFFIC" "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_" "REMARKS"   
## [37] "REFNUM"

3. I’ll be using the “sqldf” library in order to further process the data, so I load it at the start of this next data processing step.
Next, I create 3 new dataframes that are each unique “Top 10” aggregated queries by event type (EVTYPE).

The first is called “agg_fatalities”, and is a “Top 10” aggregated query of fatalities by event type (EVTYPE).
The second is called “agg_injuries”, and is a “Top 10” aggregated query of injuries by event type (EVTYPE).
The third is called “agg_injuries_or_fatalities”, and is a “Top 10” aggregated query of injuries and fatalities by event type (EVTYPE).

After creating these three dataframes I re-factor the event type (just using the list of the top 10), and also re-order the event type factors by the relevent sums. This is done so that when later plotted in the “Results” section the event type plots will be in decending order from largest to smallest.

library(sqldf)
agg_fatalities<-sqldf("select EVTYPE, sum(FATALITIES) as total_fatalities, 
                      count(*) as count from weather_data group by EVTYPE order by total_fatalities desc limit 10")
agg_fatalities$EVTYPE<-factor(agg_fatalities$EVTYPE)
agg_fatalities$EVTYPE <- reorder(agg_fatalities$EVTYPE, -agg_fatalities$total_fatalities)
agg_fatalities

##            EVTYPE total_fatalities  count
## 1         TORNADO             5633  60652
## 2  EXCESSIVE HEAT             1903   1678
## 3     FLASH FLOOD              978  54277
## 4            HEAT              937    767
## 5       LIGHTNING              816  15754
## 6       TSTM WIND              504 219940
## 7           FLOOD              470  25326
## 8     RIP CURRENT              368    470
## 9       HIGH WIND              248  20212
## 10      AVALANCHE              224    386

agg_injuries<-sqldf("select EVTYPE, sum(INJURIES) AS total_injuries, 
                    count(*) as count from weather_data group by EVTYPE order by total_injuries desc limit 10")
agg_injuries$EVTYPE<-factor(agg_injuries$EVTYPE)
agg_injuries$EVTYPE <- reorder(agg_injuries$EVTYPE, -agg_injuries$total_injuries)
agg_injuries

##               EVTYPE total_injuries  count
## 1            TORNADO          91346  60652
## 2          TSTM WIND           6957 219940
## 3              FLOOD           6789  25326
## 4     EXCESSIVE HEAT           6525   1678
## 5          LIGHTNING           5230  15754
## 6               HEAT           2100    767
## 7          ICE STORM           1975   2006
## 8        FLASH FLOOD           1777  54277
## 9  THUNDERSTORM WIND           1488  82563
## 10              HAIL           1361 288661

agg_injuries_or_fatalities<-sqldf("select EVTYPE, sum(FATALITIES) as total_fatalities, sum(INJURIES) AS total_injuries, 
                    sum(FATALITIES) + sum(INJURIES) as total_injuries_or_fatalities, count(*) as count from weather_data 
                    group by EVTYPE order by total_injuries_or_fatalities desc limit 10")

agg_injuries_or_fatalities$EVTYPE<-factor(agg_injuries_or_fatalities$EVTYPE)
agg_injuries_or_fatalities$EVTYPE<- reorder(agg_injuries_or_fatalities$EVTYPE, -agg_injuries_or_fatalities$total_injuries_or_fatalities)
agg_injuries_or_fatalities

##               EVTYPE total_fatalities total_injuries total_injuries_or_fatalities  count
## 1            TORNADO             5633          91346                        96979  60652
## 2     EXCESSIVE HEAT             1903           6525                         8428   1678
## 3          TSTM WIND              504           6957                         7461 219940
## 4              FLOOD              470           6789                         7259  25326
## 5          LIGHTNING              816           5230                         6046  15754
## 6               HEAT              937           2100                         3037    767
## 7        FLASH FLOOD              978           1777                         2755  54277
## 8          ICE STORM               89           1975                         2064   2006
## 9  THUNDERSTORM WIND              133           1488                         1621  82563
## 10      WINTER STORM              206           1321                         1527  11433

4. At this point, I’ve created the data in order to answer the first question for the analysis (and create a plot). However, I still don’t have a dataset that is required to answer the second question (or create a plot). That is going to be the focus of this last data preparation step.

I start by selecting 5 fields from “weather_data” that pertain to event type and damage values (EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP). Using these fields (and understanding the exponent postfix columns) I then create two additional fields (PROPDMG_VALUE, CROPDMG_VALUE). These are calculations for the total property & total crop damages in dollars. The result is stored into an intermediate dataframe that I’m calling temp_a.

The next step is to aggregate all of the property and crop damage from “temp_a” by event type, and then store it into a dataframe for future use. I do this with another sql query and store the result into a dataframe called “agg_dmg_values”, again only showing the “Top 10” event types.

After creating the “agg_dmg_values” dataframe, I then re-factor the event type (just using the list of the “Top 10”), and also re-order the event type factors by the relevent total damage values. This is done so that when later plotted in the “Results” section the event type plots will be in decending order from largest to smallest.

temp_a<-sqldf("select
        EVTYPE,
        PROPDMG, 
        PROPDMGEXP,
        case PROPDMGEXP
                when '' then PROPDMG
                when 'k' then 1000*PROPDMG
                when 'm' then 1000000*PROPDMG
                when 'b' then 1000000000*PROPDMG
                when 'K' then 1000*PROPDMG
                when 'M' then 1000000*PROPDMG
                when 'B' then 1000000000*PROPDMG
                else null end as PROPDMG_VALUE, 
        CROPDMG,
        CROPDMGEXP,
        case CROPDMGEXP
                when '' then CROPDMG
                when 'k' then 1000*CROPDMG
                when 'm' then 1000000*CROPDMG
                when 'b' then 1000000000*CROPDMG
                when 'K' then 1000*CROPDMG
                when 'M' then 1000000*CROPDMG
                when 'B' then 1000000000*CROPDMG
                else null end as CROPDMG_VALUE
        from weather_data")

agg_dmg_values<-sqldf("select EVTYPE, sum(PROPDMG_VALUE + CROPDMG_VALUE) as total_dmg_value 
                      from temp_a 
                      group by EVTYPE
                      order by total_dmg_value desc limit 10")
agg_dmg_values$EVTYPE<-factor(agg_dmg_values$EVTYPE)
agg_dmg_values$EVTYPE <- reorder(agg_dmg_values$EVTYPE, -agg_dmg_values$total_dmg_value)

Results

This section of the report displays the results of the anlysis, and provides answers for the initial two questions. This is going to be fairly easy, now that all of the data is processed!!

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

library(ggplot2)
ggplot(agg_injuries_or_fatalities, aes(x=EVTYPE, y=total_injuries_or_fatalities/1000)) + 
        geom_bar(stat="identity", color="black", fill="orange") + theme(axis.text.x=element_text(angle=45, hjust=1)) + 
        labs(title = "Sum of Injuries and Fatalities [Thousands] by Event Type (EVTYPE)") + 
        labs(x="Event Type (EVTYPE)") + labs(y="Sum of Injuries and Fatalities [Thousands]") +
        geom_text(aes(label=round(total_injuries_or_fatalities/1000,1),vjust=-0.5, size=12)) + scale_size(guide="none")

Looking at this plot it is fairly straight-forward to see that “Tornadoes” are the most harmful to population health. It is almost a factor of 10 greater than the next closest event type (Excessive Heat) when looking at fatalitiy and injury combined counts. The remainder of the “Top 10” event types can be seen in the plot in descending order.

2. Across the United States, which types of events (as indicated in the EVTYPE variable) have the greatest economic consequences?

ggplot(agg_dmg_values, aes(x=EVTYPE, y=total_dmg_value/1000000000)) + 
        geom_bar(stat="identity", color="black", fill="green") + theme(axis.text.x=element_text(angle=45, hjust=1)) + 
        labs(title = "Total Damage Value (Property + Crop) [$ Billions] by Event Type (EVTYPE)") + 
        labs(x="Event Type (EVTYPE)") + labs(y="Total Damage Value (Property + Crop) [$ Billions]") +
        geom_text(aes(label=round(total_dmg_value/1000000000,1),vjust=-0.5, size=12)) + scale_size(guide="none")

Looking at this plot it is fairly straight-forward to see that “Floods” have the greatest econimic consequences. They are more than double the next closest event type (Hurricane/Typhoon) when looking at total damage in dollars. The remainder of the “Top 10” event types can be seen in the plot in descending order.