Exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database - Health and Economic Impacts

Synopsis

This is the 2nd Course project in the COursera course : Reproducible Research.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
A link was provided to download the data, the of of which was approximately 47 MB. The data needs to be processed and then we can use it to answer some of the questions that are needed to be answered.

Turning off the Warnings

Sometimes we see while calling some library or during some general calculations, we see a lot of warnings that turn up, which makes the final Markdown file look a bit ugly, so we use the opts_chunck$set function to turn off the warnings that might show up.

knitr::opts_chunk$set(warning = FALSE)

The Data

The working directory is same as the directory in which the data has been downloded and extracted. We see the components in the directory by the dir function

dir()

## [1] "course_project_2.html"          "course_project_2.Rmd"          
## [3] "course_project_2.txt"           "FALSE.html"                    
## [5] "repdata_data_StormData.csv"     "repdata_data_StormData.csv.bz2"
## [7] "rsconnect"

We see that the directory contains the required file with a .csv extention, so we will use the read.csv() funtion to read the entire file.
Now we will declare a new variable StormData , that will store all the data related to the storm data that has been downloaded.

StormData <- read.csv("repdata_data_StormData.csv")

It takes some time to read the entire data, but we need only a certain part of the data, but before that we check the structure and the dimenssion of the data.

dim(StormData)

## [1] 902297     37

str(StormData)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The dimmension of the data is 902297 by 37 and the structure says that the data is stored as a data.frame. Now it is time for processing the data.

Processing Data

We need only the columns: 1. EVTYPE 2. FATALITIES 3. INJURIES 4. PROPDMG 5. PROPDMGEXP 6. CROPDMG 7. CROMDMGEXP

As these are the required variables, we declare a new variable reqvar that will be a vetor of only the required variables, and make a new data frame named mewdata which contains the data of only the required variables.

reqvar <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
newdata <- StormData[, reqvar]

We check the dimenssions of the newdata variable and check out the head and tail of the same.

dim(newdata)

## [1] 902297      7

head(newdata)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

tail(newdata)

##                EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER          0        0       0          K       0          K
## 902293      HIGH WIND          0        0       0          K       0          K
## 902294      HIGH WIND          0        0       0          K       0          K
## 902295      HIGH WIND          0        0       0          K       0          K
## 902296       BLIZZARD          0        0       0          K       0          K
## 902297     HEAVY SNOW          0        0       0          K       0          K

As we see that the variables has no “NA’s”, but still to confirm, we check the sum of is.na values of some of the variables in newdata

sum(is.na(newdata$FATALITIES))

## [1] 0

sum(is.na(newdata$INJURIES))

## [1] 0

sum(is.na(newdata$PROPDMG))

## [1] 0

As we see that the value of the sum of is.na function in some of the variables of the newdata data frame is zero. Hence, no data has any “NA’s”.

Transforming the data

Editing the entire data

First we see the 10 variables with the highest values as individual units. We do this by sorting the table() function and sorting it in decreasing order and view only the first 10 outputs.

sort(table(newdata$EVTYPE), decreasing = TRUE)[1:10]

## 
##               HAIL          TSTM WIND  THUNDERSTORM WIND            TORNADO 
##             288661             219940              82563              60652 
##        FLASH FLOOD              FLOOD THUNDERSTORM WINDS          HIGH WIND 
##              54277              25326              20843              20212 
##          LIGHTNING         HEAVY SNOW 
##              15754              15708

Now we see that there are multiple variable like HIGH WIND, TSTM WIND, THUNDERSTORM WINDS and many other such variables thagt signify just the WIND data, se we club such datas together. We do this for only the 9 variables and set the other variables as other. The variables are:
1. HAIL
2. HEAD
3. FLOOD
4. WIND
5. SNOW
6. STORM
7. WINTER
8. RAIN
9. TORNADO
10. OTHER

newdata$EVENT <- "OTHER"
# The data containg "HAIL" in the EVTYPE ignore the case is replaced by "HAIL".
newdata$EVENT[grep("HAIL", newdata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
# The data containg "HEAT" in the EVTYPE ignore the case is replaced by "HEAT".
newdata$EVENT[grep("HEAT", newdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
newdata$EVENT[grep("FLOOD", newdata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
newdata$EVENT[grep("WIND", newdata$EVTYPE, ignore.case = TRUE)] <- "WIND"
newdata$EVENT[grep("SNOW", newdata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
newdata$EVENT[grep("STORM", newdata$EVTYPE, ignore.case = TRUE)] <- "STORM"
newdata$EVENT[grep("WINTER", newdata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
newdata$EVENT[grep("RAIN", newdata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
newdata$EVENT[grep("TORNADO", newdata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"

Now we need to see that how the sorted table looks. We again check the first 10 columns in the decresasing order.

sort(table(newdata$EVENT), decreasing = TRUE)[1:10]

## 
##    HAIL    WIND   STORM   FLOOD TORNADO   OTHER  WINTER    SNOW    RAIN    HEAT 
##  289270  255362  113180   82686   60700   48970   19604   17636   12241    2648

We see the table has been sorted and the names of the event are as we required.

Editing the expenses

We see the first 10 elements of the sorted table of property and crop expenses.

sort(table(newdata$PROPDMGEXP), decreasing = TRUE)[1:10]

## 
##             K      M      0      B      5      1      2      ?      m 
## 465934 424665  11330    216     40     28     25     13      8      7

sort(table(newdata$CROPDMGEXP), decreasing = TRUE)[1:10]

## 
##             K      M      k      0      B      ?      2      m   <NA> 
## 618413 281832   1994     21     19      9      7      1      1

We see that there are a lot of different variables representing the same unit of dollar like
* B or b: Billion(10^9)
* M or m: Million(10^6)
* K or k: Thousand(10^3)

So we club all the data’s into one variable so that we can get a better result.

Editing the property data

First we deal with the property expenses and add a new column property_damage to the newdata dataframe.

newdata$PROPDMGEXP <- as.character(newdata$PROPDMGEXP)
newdata$PROPDMGEXP[is.na(newdata$PROPDMGEXP)] <- 0
newdata$PROPDMGEXP[!grepl("K|M|B", newdata$PROPDMGEXP, ignore.case = TRUE)] <- 0
newdata$PROPDMGEXP[grep("K", newdata$PROPDMGEXP, ignore.case = TRUE)] <- "3"
newdata$PROPDMGEXP[grep("M", newdata$PROPDMGEXP, ignore.case = TRUE)] <- "6"
newdata$PROPDMGEXP[grep("B", newdata$PROPDMGEXP, ignore.case = TRUE)] <- "9"
newdata$PROPDMGEXP <- as.numeric(as.character(newdata$PROPDMGEXP))
newdata$property_damage <- newdata$PROPDMG * 10^newdata$PROPDMGEXP

We check the sorted table according to the property_damage variable in decreasing order and see the first 10 expenses list

sort(table(newdata$property_damage), decreasing = TRUE)[1:10]

## 
##      0   5000  10000   1000   2000  25000  50000   3000  20000  15000 
## 663123  31731  21787  17544  17186  17104  13596  10364   9179   8617

We see the property data is as we required. All the expenses are in dolloar($)

Editing the crop data

Now we deal with the crop expenses and add a new column crop_damage to the newdata dataframe.

newdata$CROPDMGEXP <- as.character(newdata$CROPDMGEXP)
newdata$CROPDMGEXP[is.na(newdata$CROPDMGEXP)] <- 0
newdata$CROPDMGEXP[!grepl("K|M|B", newdata$CROPDMGEXP, ignore.case = TRUE)] <- 0
newdata$CROPDMGEXP[grep("K", newdata$CROPDMGEXP, ignore.case = TRUE)] <- "3"
newdata$CROPDMGEXP[grep("M", newdata$CROPDMGEXP, ignore.case = TRUE)] <- "6"
newdata$CROPDMGEXP[grep("B", newdata$CROPDMGEXP, ignore.case = TRUE)] <- "9"
newdata$CROPDMGEXP <- as.numeric(as.character(newdata$CROPDMGEXP))
newdata$crop_damage <- newdata$CROPDMG * 10^newdata$CROPDMGEXP

We check the sorted table according to the crop_damage variable in decreasing order and see the first 10 expenses list

sort(table(newdata$crop_damage), decreasing = TRUE)[1:10]

## 
##      0   5000  10000  50000  1e+05   1000   2000  25000  20000  5e+05 
## 880198   4097   2349   1984   1233    956    951    830    758    721

We see the crop data is as we required. All the expenses are in dolloar($)

Analysing

For analysing we need 3 libraries.

library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

These library conatain the ddply() function which will be used to aggregate the datas as we want, and ggplot2 will be used to plot the data.

The Health effect

First we aggregate the fatality from the newdata. We make a new variable agg_fatality

agg_fatality <- ddply(newdata, .(EVENT), summarise, Total = sum(FATALITIES, nar.rm = TRUE))
agg_fatality$type <- "fatality"

Now we aggregate the injury data from the newdata. We make a new variable agg_injury

agg_injury <- ddply(newdata, .(EVENT), summarize, Total = sum(INJURIES, na.rm = TRUE))
agg_injury$type <- "injury"

Now we make 2 new data frames:
1. agg_health : binded the fatality and injury data frames.
2. health_as_per_event : helps us to see the entire dataset of health as per the events.

agg_health <- rbind(agg_fatality, agg_injury)
health_as_per_event <- join(agg_fatality, agg_injury, by = "EVENT", type = "inner")
health_as_per_event

##      EVENT Total     type Total   type
## 1    FLOOD  1525 fatality  8602 injury
## 2     HAIL    16 fatality  1371 injury
## 3     HEAT  3139 fatality  9224 injury
## 4    OTHER  2627 fatality 12224 injury
## 5     RAIN   115 fatality   305 injury
## 6     SNOW   165 fatality  1164 injury
## 7    STORM   417 fatality  5339 injury
## 8  TORNADO  5662 fatality 91407 injury
## 9     WIND  1210 fatality  9001 injury
## 10  WINTER   279 fatality  1891 injury

The Economic effect

First we aggregate the fatality from the newdata. We make a new variable agg_prop

agg_prop <- ddply(newdata, .(EVENT), summarise, Total = sum(property_damage, nar.rm = TRUE))
agg_prop$type <- "property"

Now we aggregate the injury data from the newdata. We make a new variable agg_crop

agg_crop <- ddply(newdata, .(EVENT), summarize, Total = sum(crop_damage, na.rm = TRUE))
agg_crop$type <- "crop"

Now we make 2 new data frames:
1. agg_money : binded the fatality and injury data frames.
2. money_by_event : helps us to see the entire dataset of health as per the events.

agg_money <- rbind(agg_prop, agg_crop)
money_by_event <- join(agg_prop, agg_crop, by = "EVENT", type = "inner")
money_by_event

##      EVENT        Total     type       Total type
## 1    FLOOD 167502193930 property 12266906100 crop
## 2     HAIL  15733043049 property  3046837473 crop
## 3     HEAT     20325751 property   904469280 crop
## 4    OTHER  97246712338 property 23588880870 crop
## 5     RAIN   3270230193 property   919315800 crop
## 6     SNOW   1023569753 property   134683100 crop
## 7    STORM  66305015394 property  6374474888 crop
## 8  TORNADO  58593098030 property   417461520 crop
## 9     WIND  10847166619 property  1403719150 crop
## 10  WINTER   6777295252 property    47444000 crop

Now we plot the graphs.

Final Report

We need to show which events have the most effect on both health and economy using graphs.

Effect of events on health

ggplot(agg_health, aes(x = EVENT, y = Total, fill = type)) + geom_bar(stat = "identity") + coord_flip() + 
xlab("Event Type") + ylab("Total health impact") + ggtitle("Event type VS health impact")

Effect of events on property and crops

ggplot(agg_money, aes(x = EVENT, y = Total, fill = type)) + geom_bar(stat = "identity") + coord_flip() + 
xlab("Event Type") + ylab("Total damage in dollar") + ggtitle("Event type VS on economy")

Reproducible Research:Course project 2

Bipul Jha