Synopsis

This is an R Markdown document for data analysis on NOAA Storm Events data in order to get answer of following question:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?
  • About NOAA Storm Event data:- this data contains 902297 records from year 1950 to 2011. each record contains 37 variables, but all the variables are not significant for this analysis, so I picked 7 variables that seemed to be significant for population health and economic damage, then aggregating the data by storm event type, bar charts are ploted for showing the top 6 most significant events in term of fatalities, injuries and economic damage caused. Overall, tornadoes are the most hazardous to human health with 5633 reported fatalities and 91346 reported injuries, and floods have been responsible for the most economic damage ($150+ billion).

Data Analysis

Data preprocessing

library(data.table)
## Warning: package 'data.table' was built under R version 3.1.3
dat<-read.csv("repdata-data-StormData.csv.bz2")
dat<-as.data.table(dat)
names(dat)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
head(dat)
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6:       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1: TORNADO         0                                               0
## 2: TORNADO         0                                               0
## 3: TORNADO         0                                               0
## 4: TORNADO         0                                               0
## 5: TORNADO         0                                               0
## 6: TORNADO         0                                               0
##    COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1:         NA         0                      14.0   100 3   0          0
## 2:         NA         0                       2.0   150 2   0          0
## 3:         NA         0                       0.1   123 2   0          0
## 4:         NA         0                       0.0   100 2   0          0
## 5:         NA         0                       0.0   150 2   0          0
## 6:         NA         0                       1.5   177 2   0          0
##    INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1:       15    25.0          K       0                                    
## 2:        0     2.5          K       0                                    
## 3:        2    25.0          K       0                                    
## 4:        2     2.5          K       0                                    
## 5:        2     2.5          K       0                                    
## 6:        6     2.5          K       0                                    
##    LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1:     3040      8812       3051       8806              1
## 2:     3042      8755          0          0              2
## 3:     3340      8742          0          0              3
## 4:     3458      8626          0          0              4
## 5:     3412      8642          0          0              5
## 6:     3450      8748          0          0              6

There are following 7 variables , which are relevent to population health and economic damege

  • EVTYPE as a measure of event type (e.g. tornado, flood, etc.)
  • FATALITIES as a measure of harm to human health
  • INJURIES as a measure of harm to human health
  • PROPDMG as a measure of property damage and hence economic damage in USD
  • PROPDMGEXP as a measure of magnitude of property damage (e.g. thousands,millions USD, etc.)
  • CROPDMG as a measure of crop damage and hence economic damage in USD -CROPDMGEXP as a measure of magnitude of crop damage (e.g. thousands, millionsUSD, etc.) Therefore, a new data frame data is subsetted from dat with just these 7 variables.

Data Cleaning

str(data)
## Classes 'data.table' and 'data.frame':   902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
unique(data$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M
unique(data$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

as we can see from structure of data, there are two variables named PROPDMGEXP and CROPDMGEXP having alphanumeric mixed unique values, which may have problems in further analysis, so it will be better to replace these charector values with numeric values

Data Processing

-summing up the property and crop damage data to get the total economic impact per event record

data$Eco_damage<-((data$PROPDMG * 10^data$PROPDMGEXP)+(data$CROPDMG*10^data$CROPDMGEXP))

-aggregate the fatalities, injuries and economic damage by event type.

dat_fa <- sort(tapply(data$FATALITIES, data$EVTYPE, sum), decreasing = T)
dat_in<- sort(tapply(data$INJURIES, data$EVTYPE, sum), decreasing = T)
dat_dm <- sort(tapply(data$Eco_damage, data$EVTYPE, sum), decreasing = T)

Results

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3

Fatalities per event type

names <- as.vector(names(head(dat_fa)))
values <- as.vector(head(dat_fa))
dat1<- data.frame(names, values)
ggplot(data = dat1, aes(x = dat1$names, y = dat1$values)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Storm Event Type") + 
ylab("Fatalities") + ggtitle("Top 6 Event type for high Fatality")

Injuries per event type

names <- as.vector(names(head(dat_in)))
values <- as.vector(head(dat_in))
dat2 <- data.frame(names, values)
ggplot(data = dat2, aes(x = dat2$names, y = dat2$values)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Storm Event Type") +  ylab("Injuries") + ggtitle("Top 6 Event type for high Injuries")

###Economic Impact

names <- as.vector(names(head(dat_dm)))
values <- as.vector(head(dat_dm))
dat3 <- data.frame(names, values)
ggplot(data = dat3, aes(x = dat3$names, y = dat3$values)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Storm Event Type") + ylab("Economic Damage ($)") + ggtitle("Top 6 Event type for high economic damage")

Summary

-As you can see from above plots. Tornadoes are most harmful with respect to population health(both injuries and fatalities) and floods have the greatest economic consequences.