US Storm Event Human Health and Economic Impact Analysis

Synopsis.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

1. Data Processing.

Reading Data

download.file ("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./Coursera_Data/storm.csv.bz2")


storm <- read.csv("./Coursera_Data/storm.csv.bz2", header = TRUE) 

This operation will take quite time as the size of database is very big.

Now we are interested in structure of the data.

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Let’s check the names

names(storm)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Data Cleaning

We want to extract only those variables that we need for our analysis.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
storm <- select(storm, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Now, it is useful to summarize our data.

summary(storm)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

We don’t need all the variables, but 7. Therefore, we are creating a new dataset with only those variables that we require for futher processing.

library(dplyr)
storm1 <- select(storm, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

For the convinience, let’s rename our variables

colnames(storm1) <- c("EventType", "Fatalities", "Injuries", "PropertyDamage", "PropertyDamageExp", "CropDamage", "CropDamageExp")

Dataset on health impact

The data is very messy. We need to clean it. We start from mapping values of damage variables to the numeric ones.

PropDmg <-  c("\"\"" = 10^0, "-" = 10^0, "+" = 10^0, "0" = 10^0, "1" = 10^1, "2" = 10^2, "3" = 10^3, "4" = 10^4, "5" = 10^5, "6" = 10^6, "7" = 10^7, "8" = 10^8, "9" = 10^9, "H" = 10^2,        "K" = 10^3, "M" = 10^6, "B" = 10^9)

storm1$PropertyDamageExp <- PropDmg[as.character(storm1$PropertyDamageExp)]
storm1$PropertyDamageExp[is.na(storm1$PropertyDamageExp)] <- 10^0

The same needs to be done with the variable on crop damage.

CropDmg <-  c("\"\"" = 10^0,
                 "?" = 10^0, 
                 "0" = 10^0,
                 "K" = 10^3,
                 "M" = 10^6,
                 "B" = 10^9)

storm1$CropDamageExp <- CropDmg[as.character(storm1$CropDamageExp)] 
storm1$CropDamageExp[is.na(storm1$CropDamageExp)] <- 10^0

Now, we can merge a new dataset on fatalities and injuries sorted by Event Type.

storm_health <- aggregate(cbind(Fatalities, Injuries) ~ EventType, data=storm1, FUN=sum)

On the next step, we shoul merge the sets on total injuries and fatalities, removing the observations with 0 effect.

storm_health$total <- storm_health$Fatalities + storm_health$Injuries
storm_health <- storm_health[storm_health$total >0, ]
storm_health <- storm_health[order(storm_health$total, decreasing = T), ] #sorts our data in descending order, as we are interested in top-10. 
rownames(storm_health) <- 1:nrow(storm_health)
healthtopten <- storm_health[1:10, ] 

Dataset on economic impact

We start combining sets and calculating new variables

storm1$PropertyLoss <- storm1$PropertyDamage*storm1$PropertyDamageExp
storm1$CropLoss <- storm1$CropDamage*storm1$CropDamageExp
Economic <- aggregate(cbind(PropertyLoss, CropLoss) ~ EventType, data=storm1, FUN = sum)
Economic$Total <- Economic$PropertyLoss+Economic$CropLoss

Actually, we just repeat the same procedures that we used before with the dataset on health impact.

We delete the observations that don’t have any income leaving only those values that count more than 0.

Economic <- Economic[Economic$Total > 0, ]

As we are interested in the most influental factors, we sort the data in the descending order and extract top 10 lines.

Economic <- Economic[order(Economic$Total, decreasing = T), ]
EconomicTop <- Economic[1:10, ]

2. Results.

1. Which types of events are most harmful to population health?

library(pander)
pander(healthtopten)
EventType Fatalities Injuries total
TORNADO 5633 91346 96979
EXCESSIVE HEAT 1903 6525 8428
TSTM WIND 504 6957 7461
FLOOD 470 6789 7259
LIGHTNING 816 5230 6046
HEAT 937 2100 3037
FLASH FLOOD 978 1777 2755
ICE STORM 89 1975 2064
THUNDERSTORM WIND 133 1488 1621
WINTER STORM 206 1321 1527

Let’s plot it.

library(ggplot2)
library(reshape2)

healthtoptenmelt <- melt(healthtopten, id.vars="EventType")

myplot <- ggplot(healthtoptenmelt, aes(x=EventType, y=value)) 
myplot + geom_bar(stat="identity", aes(fill=variable), position="dodge")+
    scale_y_sqrt("Frequency Count")+
    xlab("Event Type")+
    theme(axis.text.x = element_text(angle=45, hjust=1))+
    ggtitle("Pareto Chart. Top 10 Storm Impacts on Health")

As we can see, tornado has the strongest effect on human health.

2. Which types of events have the greatest economic consequences?

We can answer on this question just by creating a table.

pander(EconomicTop)
  EventType PropertyLoss CropLoss Total
170 FLOOD 1.447e+11 5.662e+09 1.503e+11
411 HURRICANE/TYPHOON 6.931e+10 2.608e+09 7.191e+10
834 TORNADO 5.694e+10 4.15e+08 5.735e+10
670 STORM SURGE 4.332e+10 5000 4.332e+10
244 HAIL 1.573e+10 3.026e+09 1.876e+10
153 FLASH FLOOD 1.682e+10 1.421e+09 1.824e+10
95 DROUGHT 1.046e+09 1.397e+10 1.502e+10
402 HURRICANE 1.187e+10 2.742e+09 1.461e+10
590 RIVER FLOOD 5.119e+09 5.029e+09 1.015e+10
427 ICE STORM 3.945e+09 5.022e+09 8.967e+09

As we can see, flood has the greatest effect on the economy. Let’s create a ggplot for this.

EconomyTopMelt <- melt(EconomicTop, id.vars = "EventType")
myplot2 <- ggplot(EconomyTopMelt, aes(x=EventType, y=value))
myplot2 + geom_bar(stat = "identity", aes(fill=variable))+
    xlab("Event Type")+
        theme(axis.text.x = element_text(angle=45, hjust=1))+
    ggtitle("Pareto Chart. Top 10 Storm Impacts on the Economy, USA")