Load required packages

# load packages
suppressMessages(require("data.table"))
suppressMessages(require("dplyr"))
suppressMessages(require("tidyr"))
suppressMessages(require("ggplot2"))
suppressMessages(require("gridExtra"))

## Warning: package 'gridExtra' was built under R version 3.1.3

suppressMessages(require("R.utils"))

## Warning: package 'R.utils' was built under R version 3.1.3

## Warning: package 'R.oo' was built under R version 3.1.3

## Warning: package 'R.methodsS3' was built under R version 3.1.3

Data Processing

Loading and preprocessing the data

First we will download the database from the course web site. It is in the form of a comma-separated-value file, compressed via the bzip2 algorithm to reduce its size, so we will have to unzip it.

# download file from the URL
if (!file.exists("stormdata.csv.bz2")) {
        fileUrl = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileUrl, destfile = "./stormdata.csv.bz2")
        
        # unzip stormdata.csv dataset
        bunzip2("./stormdata.csv.bz2", "./stormdata.csv", overwrite = T, remove = F)
}

Next we will read the stormdata.csv data that is now loaded in the working directory.

stormdata <- read.table("stormdata.csv", header = TRUE, sep = ",")

Then we will explore its dataset.

dim(stormdata)

## [1] 902297     37

We can see that the stormdata dataset has 902297 rows and 37 columns.

Next we will look at a sample view of the head of the dataset and the tail of the dataset.

# view rows at the head of the dataset
head(stormdata)[1:3,]

##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3

# view rows at the tail of the dataset
tail(stormdata)[1:3,]

##        STATE__           BGN_DATE    BGN_TIME TIME_ZONE COUNTY
## 902292      47 11/28/2011 0:00:00 03:00:00 PM       CST     21
## 902293      56 11/30/2011 0:00:00 10:30:00 PM       MST      7
## 902294      30 11/10/2011 0:00:00 02:48:00 PM       MST      9
##                                  COUNTYNAME STATE         EVTYPE BGN_RANGE
## 902292 TNZ001>004 - 019>021 - 048>055 - 088    TN WINTER WEATHER         0
## 902293                         WYZ007 - 017    WY      HIGH WIND         0
## 902294                         MTZ009 - 010    MT      HIGH WIND         0
##        BGN_AZI BGN_LOCATI           END_DATE    END_TIME COUNTY_END
## 902292                    11/29/2011 0:00:00 12:00:00 PM          0
## 902293                    11/30/2011 0:00:00 10:30:00 PM          0
## 902294                    11/10/2011 0:00:00 02:48:00 PM          0
##        COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH  F MAG
## 902292         NA         0                         0     0 NA   0
## 902293         NA         0                         0     0 NA  66
## 902294         NA         0                         0     0 NA  52
##        FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO
## 902292          0        0       0          K       0          K MEG
## 902293          0        0       0          K       0          K RIW
## 902294          0        0       0          K       0          K TFX
##                       STATEOFFIC
## 902292           TENNESSEE, West
## 902293 WYOMING, Central and West
## 902294          MONTANA, Central
##                                                                                                                                                            ZONENAMES
## 902292 LAKE - LAKE - OBION - WEAKLEY - HENRY - DYER - GIBSON - CARROLL - LAUDERDALE - TIPTON - HAYWOOD - CROCKETT - MADISON - CHESTER - HENDERSON - DECATUR - SHELBY
## 902293                                                                              OWL CREEK & BRIDGER MOUNTAINS - OWL CREEK & BRIDGER MOUNTAINS - WIND RIVER BASIN
## 902294                                                                                     NORTH ROCKY MOUNTAIN FRONT - NORTH ROCKY MOUNTAIN FRONT - EASTERN GLACIER
##        LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 902292        0         0          0          0
## 902293        0         0          0          0
## 902294        0         0          0          0
##                                                                                                                                                                                                                                                                                                                                                                                                     REMARKS
## 902292 EPISODE NARRATIVE: A powerful upper level low pressure system brought snow to portions of Northeast Arkansas, the Missouri Bootheel, West Tennessee and extreme north Mississippi. Most areas picked up between 1 and 3 inches of with areas of Northeast Arkansas and the Missouri Bootheel receiving between 4 and 6 inches of snow.EVENT NARRATIVE: Around 1 inch of snow fell in Carroll County.
## 902293                                                                                                                                        EPISODE NARRATIVE: A strong cold front moved south through north central Wyoming bringing high wind to the Meeteetse area and along the south slopes of the western Owl Creek Range. Wind gusts to 76 mph were recorded at Madden Reservoir.EVENT NARRATIVE: 
## 902294                                                                                   EPISODE NARRATIVE: A strong westerly flow aloft produced gusty winds at the surface along the Rocky Mountain front and over the plains of Central Montana. Wind gusts in excess of 60 mph were reported.EVENT NARRATIVE: A wind gust to 60 mph was reported at East Glacier Park 1ENE (the Two Medicine DOT site).
##        REFNUM
## 902292 902292
## 902293 902293
## 902294 902294

From this view, We can see that recording of events in the dataset starts around April 18, 1950 and ends in November 30, 2011. In the earlier years however, their seems to be fewer events recorded than in recent years; which is most likely due to a lack of good recording of storm events.

Next we will plot an histogram of recorded storm events between 1950 and November 2011, to further explore this analysis.

if (dim(stormdata)[2] == 37) {
        stormdata$year <- as.numeric(format(as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))     
        }
hist(stormdata$year, breaks = 30, main = "Recorded Storm Events between 1950 and November 2011", xlab = "Year", ylab= "Number of Storm Events", col= "purple")

By looking at the histogram we can see that poor recording of storm events are between the period 1950 and 1998, with a slow but gradual increase of good recordings between the period 1999 to 2010 and followed by a drastic decrease in 2011 which is the end of the recording period.

Now, in order to identify the variables associated with this research we will look at the required dataset below and the information related to this study from the Storm Data Documentation.

names(stormdata)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"     "year"

The variables identified are:

FATALITIES, located at column [23], that affects population health
INJURIES, located at column [24], that affects population health
PROPDMG, located at column [25] , that affects economic consequencies
PROPDMGEXP, located at column [26], that is an exponent data on economic consequencies
CROPDMG, located at column [27], that affects economic consequencies
CROPDMGEXP, located at column [28], that is an exponent data on economic consequencies
EVTYPE, located at column [8], that has the different types of storm events

Next we will extract these columns and print a summary of the required dataset.

ext.col <- subset(stormdata, select = c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG", "CROPDMGEXP"))
# summarise the ext.col dataset
summary(ext.col)

##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

From the summary, their are no missing values so we can now process the data for analysis.

Processing the property damage data

Looking at the property damage data, we will have to convert the property damage exponent data column PROPDMGEXP into a column that we will call NUMPROPEXP, that has comparible numerical data based on the units of measurement in the code book Storm Data Documentation.

We will also need to compute a new property damage variable for PROPDMG called PROPDMGVAR based on this conversion.

# property damage exponent data
unique(ext.col$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

# convert the property data exponent data
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "K"] <- 1000
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "M"] <- 1e+06
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == ""] <- 1
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "B"] <- 1e+09
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "m"] <- 1e+06
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "0"] <- 1
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "5"] <- 1e+05
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "6"] <- 1e+06
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "4"] <- 10000
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "2"] <- 100
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "3"] <- 1000
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "h"] <- 100
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "7"] <- 1e+07
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "H"] <- 100
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "1"] <- 10
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "8"] <- 1e+08
# give 0 to invalid exponent data, so they do not count in
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "+"] <- 0
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "-"] <- 0
ext.col$NUMPROPEXP[ext.col$PROPDMGEXP == "?"] <- 0
# compute the new property damage variable 
ext.col$PROPDMGVAR <- ext.col$PROPDMG * ext.col$NUMPROPEXP

Processing the crop damage data

Looking at the crop damage data, we will have to convert the crop damage exponent data column CROPDMGEXP into a column that we will call NUMCROPEXP, that has comparible numerical data based on the units of measurement in the code book Storm Data Documentation.

We will also need to compute a new crop damage variable for CROPDMG called CROPDMGVAR based on this conversion.

# crop damage exponent data
unique(ext.col$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

# convert the crop damage exponent data
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "M"] <- 1e+06
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "K"] <- 1000
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "m"] <- 1e+06
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "B"] <- 1e+09
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "0"] <- 1
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "k"] <- 1000
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "2"] <- 100
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == ""] <- 1
# give 0 to invalid exponent data, so they not count in
ext.col$NUMCROPEXP[ext.col$CROPDMGEXP == "?"] <- 0
# compute the new crop damage variable 
ext.col$CROPDMGVAR <- ext.col$CROPDMG * ext.col$NUMCROPEXP

Now that we have completed processing the property damage data as well as the crop damage data, we can aggregate all variables in the ext.col dataset.

Aggregating all storm data by event

# compute the storm data
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = ext.col, sum)
injuries <- aggregate(INJURIES ~ EVTYPE, data = ext.col, sum)
property <- aggregate(PROPDMGVAR ~ EVTYPE, data = ext.col, sum)
crop <- aggregate(CROPDMGVAR ~ EVTYPE, data = ext.col, sum)

Results

Across the United States, which types of events (as indicated in the [EVTYPE] variable) are most harmful with respect to population health?

To answer this question, we will create a dataset of the top 15 total fatalites and injuries caused by storms and plot 2 graphs, one showing total storm fatalites and the other showing total storm injuries between the years 1950 and 2011.

# get top 15 event with highest fatalities
fatal <- fatalities[order(-fatalities$FATALITIES), ][1:15, ]
# get top 15 event with highest injuries
injury <- injuries[order(-injuries$INJURIES), ][1:15, ]

# plot graph showing fatalities
 fatalitiesPlot <- qplot(EVTYPE, data = fatal, weight = FATALITIES, geom = "bar", binwidth = 1) + 
    scale_y_continuous("Number of Fatalities") + 
    theme(axis.text.x = element_text(angle = 45, 
    hjust = 1)) + xlab("Severe Storm Events") + 
    ggtitle("Total Fatalities \n by Severe Storms in the U.S.\n from 1950 - 2011")

# plot graph showing injuries
injuriesPlot <- qplot(EVTYPE, data = injury, weight = INJURIES, geom = "bar", binwidth = 1) + 
    scale_y_continuous("Number of Injuries") + 
    theme(axis.text.x = element_text(angle = 45, 
    hjust = 1)) + xlab("Severe Storm Events") + 
    ggtitle("Total Injuries \n by Severe Storms in the U.S.\n from 1950 - 2011")

 grid.arrange(fatalitiesPlot, injuriesPlot, ncol=2 )

Base on the outcome of the total fatalities graph, we can see that tornadoes are the most harmful storm type, having over 5000 fatalities, followed by execessive heat having over 1500 fatalities then heat and lightning with approximately 1000 fatalities.
Base on the outcome of the total injuries graph we can see that once again tornadoes are the most harmful storm type, having over 7500 injuries, followed by tropical storm wind, flood and execessive heat; all in close proximity to 1250 injuries.

Across the United States, which types of events have the greatest economic consequences?

To answer this question, We will create a dataset of the top 15 property and crop damages caused by storms and plot 2 graphs, one showing total property damages and the other showing total crop damages between the years 1950 and 2011.

# get top 15 event with highest property damages
properties <- property[order(-property$PROPDMGVAR), ][1:15, ]
# get top 15 event with highest crop damages
crops <- crop[order(-crop$CROPDMGVAR), ][1:15, ]

# plot graph showing property damages
properyPlot <- qplot(EVTYPE, data = properties, weight = PROPDMGVAR/(10^9), geom = "bar", binwidth = 1) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    scale_y_continuous("Property Damages ($ billion)")+ 
    xlab("Severe Storm Events") + 
    ggtitle("Total Property Damages by \n Severe Storms in the U.S. \n from 1950 - 2011")

# plot graph showing crop damages
cropPlot <- qplot(EVTYPE, data = crops, weight = CROPDMGVAR/(10^9), geom = "bar", binwidth = 1) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    scale_y_continuous("Crop Damages ($ billion)") + 
    xlab("Severe Storm Events") + 
    ggtitle("Total Crop Damages by \n Severe Storms in the U.S. \n from 1950 - 2011")

grid.arrange(properyPlot, cropPlot, ncol=2)

Base on the outcome of the total properties damage graph, we can see that floods are the most costly, having over US$125 billion in property damages, followed by hurricane and typhoon having close to US$75 billion. Next is tornado having cost of over US$50 billion and storm surge having cost close to US$50 billion.
Base on the outcome of the total crop damage graph we can see that droughts are the most costly, having over US$12.5 billion in crop damages, followed by flood, having cost of over US$5 billion. Next, almost sharing similar costs of US$5 billion are ice storm and river flood .

Conclusion

From this exploratory analysis we were able to see that among the most harmful storm events identifed, tornadoes are the most destructive. This type of storm not only drastically affects population health but the number one major contributing factor to economic consequencies across the United States from the year 1950 until the year 2011.

Reproducible Research: Peer Assessment 2

The Most Harmful Weather Events in the Unitied States Between 1950 and November 2011

Synopsis

Prepare the working evironment