SYNOPSIS:


This is a report on the effects of weather events on public health and economical impacts in the United States from 1950 to 2011. The basic goal of this report is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to answer basic questions about severe weather events.

This data analysis used the statistical software R x64 3.2.5. This report will try to address 1. Which types of events across the United States were the most harmful with respect to population health. The report considers the total number of injuries and fatalities over the time period from 1950 to 2011. 2. Which type of events have the greatest consequences for the U.S. economy. The total amount of damage measured in dollars, were used to analyze the effects.


Load Libraries

### Used to Unzip Data file

library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## R.utils v2.3.0 (2016-04-13) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Download the NOAA Database


Unzip NOAA DATA

Tools used to unzip the data file with R the package “R.utils” and the command “bunzip2”

if(!file.exists('StormData.csv')){
   bunzip2 ("FStormData.csv.bz2", overwrite=TRUE, remove=FALSE, destname="StormData.csv")
}

Load Raw data


Loading the data set with the ‘read.csv’ command, setting the headers of the columns to TRUE and defining the seperator as commas.

raw_data <- read.csv("StormData.csv", sep=',', header=TRUE )

Examine the Data

###Looking at the dimensions and different kinds of varibles.

dim(raw_data)
## [1] 902297     37
str(raw_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Data Processing


We only need the variables for the statenames, event types, fatalities, injuries, and the damage-variables.

data <- raw_data[,c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP","CROPDMG", "CROPDMGEXP")]

sum (is.na (data))
## [1] 0
head(data)
##   STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1    AL TORNADO          0       15    25.0          K       0           
## 2    AL TORNADO          0        0     2.5          K       0           
## 3    AL TORNADO          0        2    25.0          K       0           
## 4    AL TORNADO          0        2     2.5          K       0           
## 5    AL TORNADO          0        2     2.5          K       0           
## 6    AL TORNADO          0        6     2.5          K       0
summary(data)
##      STATE                      EVTYPE         FATALITIES      
##  TX     : 83728   HAIL             :288661   Min.   :  0.0000  
##  KS     : 53440   TSTM WIND        :219940   1st Qu.:  0.0000  
##  OK     : 46802   THUNDERSTORM WIND: 82563   Median :  0.0000  
##  MO     : 35648   TORNADO          : 60652   Mean   :  0.0168  
##  IA     : 31069   FLASH FLOOD      : 54277   3rd Qu.:  0.0000  
##  NE     : 30271   FLOOD            : 25326   Max.   :583.0000  
##  (Other):621339   (Other)          :170878                     
##     INJURIES            PROPDMG          PROPDMGEXP        CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00          :465934   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00   K      :424665   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   M      : 11330   Median :  0.000  
##  Mean   :   0.1557   Mean   :  12.06   0      :   216   Mean   :  1.527  
##  3rd Qu.:   0.0000   3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000  
##  Max.   :1700.0000   Max.   :5000.00   5      :    28   Max.   :990.000  
##                                        (Other):    84                    
##    CROPDMGEXP    
##         :618413  
##  K      :281832  
##  M      :  1994  
##  k      :    21  
##  0      :    19  
##  B      :     9  
##  (Other):     9

Clean the NOAA dataset


After examing the dataset, there is a lot of event types with erroneous types, that are named nearly the same. An example of this is “FLASH FLOODING”, “FLOOD/FLASH FLOOD” they are the same event types as documented on the page 6 of the storm data documentation Cleaning the dataset and combining the same types of events following the table on page 6 of the documentation.

###Examine the EVTYPE data 
unique_evtype <-summary(data$EVTYPE)

str(unique_evtype)
##  Named int [1:100] 288661 219940 82563 60652 54277 25326 20843 20212 15754 15708 ...
##  - attr(*, "names")= chr [1:100] "HAIL" "TSTM WIND" "THUNDERSTORM WIND" "TORNADO" ...

Clean EV_TYPE data labels

# clean EVTYPE, aggregate duplicate lables
data$EVTYPE <- toupper(str_trim(data$EVTYPE))
data$EVTYPE <- gsub("TSTM WIND", "MARINE THUNDERSTORM WIND" , data$EVTYPE)
data$EVTYPE <- gsub("URBAN/SML STREAM FLD", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub("MARINE TSTM WIND","MARINE THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("WILD/FOREST FIRE", "WILDFIRE", data$EVTYPE)
data$EVTYPE <- gsub("marinethunderstormwind/hail", "marinethunderstormwind", data$EVTYPE)
data$EVTYPE <- gsub("TSTM WIND/HAIL","MARINE THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("flashflooding", "flashflood", data$EVTYPE)
data$EVTYPE <- gsub("FLOOD/FLASH FLOOD", "FLASH FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("WINTER data/MIX", "WINTER data", data$EVTYPE)
data$EVTYPE <- gsub("RIP CURRENTS", "RIP CURRENT", data$EVTYPE)
data$EVTYPE <- gsub("DENSEDENSEFOG", "DENSE FOG", data$EVTYPE)
data$EVTYPE <- gsub("STRONG WINDS","ASTROMICAL LOW TIDE", data$EVTYPE)
data$EVTYPE <- gsub("COASTAL FLOODING", "COASTAL FLOOD ", data$EVTYPE)
data$EVTYPE <- gsub("RIVER FLOOD", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("RECORD WARMTH", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub("RECORD HEAT", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub("FREEZE", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub("HEATWAVE", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub("HURRICANE/TYPHOON", "HURRICANE", data$EVTYPE)

Results


Effects on Public Health (Injuries and Fatalities) In the next steps we will be investigating the type of events that caused the largest total numbers of fatalities and injuries from 1950 to 2011.

#### first aggregate the number of injuries for each type of event in a new dataset "injuries"

injuries <- aggregate(INJURIES~EVTYPE, data=data, sum)

Now exclude all events with zero number of injuries, then we sort the injuires dataset by decreasing number of injuries, use the package “dplyr”.

injuries_noZero<- injuries %>% filter(INJURIES > 0)

injuries_Ordered<-injuries_noZero[with(injuries_noZero,order(-injuries_noZero$INJURIES)),]

Next plot the 10 most significant events by total number of injuries with ggplot2 package, EVTYPE is ordered factor already, so that the ggplot output isn’t sorted aphabetically.

injuries_Ordered$EVTYPE <-  factor(injuries_Ordered$EVTYPE , levels = injuries_Ordered$EVTYPE)
ggplot(injuries_Ordered[1:10,], aes(x=factor(EVTYPE)[1:10], y=INJURIES[1:10], fill=INJURIES[1:10]))+
  geom_bar(stat ="identity") + 
  theme_bw() +  
  theme(plot.title = element_text(color="BLACK", size=20, face="bold"),
        axis.text.x  = element_text(angle=65, vjust=0.5, size=12)
        ) +
  ggtitle("Injuries by Top 10 Events") + 
  xlab("Types of Weather Events") + 
  ylab("Total number of Injuries") + 
  coord_flip()  +
  scale_y_continuous(breaks = seq(0,100000, by = 10000)) + 
  scale_fill_continuous(name="Total Injuries")

FIGURE 1 Plot of Injuries by top 10 Events

Clean the Fatalities data

Now remove the zeros from the fatalites data

fatalites <- aggregate(FATALITIES~EVTYPE, data=data, sum)

fatalites_NoZero <-fatalites %>% filter(FATALITIES > 0) 

fatalites_Ordered <-fatalites_NoZero[with(fatalites_NoZero,order(-fatalites_NoZero$FATALITIES)),]

fatalites_Ordered$EVTYPE <-  factor(fatalites_Ordered$EVTYPE , levels = fatalites_Ordered$EVTYPE)
ggplot(fatalites_Ordered[1:10,], aes(x=factor(EVTYPE)[1:10], y=FATALITIES[1:10], fill=FATALITIES[1:10])) + 
  theme_bw() +  
  theme(plot.title = element_text(color="BLACK", size=20, face="bold"))+ 
  geom_bar(stat ="identity",fill="blue", colour="black") + 
  ggtitle("Fatalities by Top 10 Weather Events") + 
  xlab("Types of Weather Events") + 
  ylab("Total Number of Fatalities") + 
  coord_flip()+scale_y_continuous(breaks = seq(0,6000, by = 500)) +   
  scale_fill_continuous(name="Total Number")

FIGURE 2 A Plot of top 10 fatalities

It is evident that tornados cause the most injuries in the U.S. with 91,346 cases, followed by marine thunderstorm wind with barely 7,000 cases and flood with 6,800 cases. Tornados with 5,633, Excessive Heat with 1,903 and Flash Flood with 995 cases caused the most total number of fatalities in the U.S. in this time period.

Weather Effects on Economy in the United States (1950-2011)

The total economic damage in dollars caused by each type of event, to evaluate those that caused the largest costs. Property and crop damages are summed-up over the years from 1950.

Summary Property Damage

summary(data$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00

Summary Crop Damage

summary(data$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000

Summary Crop Expense

summary(data$CROPDMGEXP)
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Summary Property Expense

summary(data$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

The NOAA documentation on page 12 explains the symbol of the -DMGEXP identifier variable stands for, e.g. “b” and “B” stands for billion dollars

symbol <- c("", "+", "-", "?", 0:9, "h", "H", "k", "K", "m", "M", "b", "B")
fact <- c(rep(0,4), 0:9, 2, 2, 3, 3, 6, 6, 9, 9)

mult <- data.frame (symbol, fact)

Create new cost-variables for the property and crop damage variables, which are including numeric numbers.

data$damage.prop <- data$PROPDMG*10^mult[match(data$PROPDMGEXP,mult$symbol),2]
data$damage.crop <- data$CROPDMG*10^mult[match(data$CROPDMGEXP,mult$symbol),2]
data$damage <- data$damage.prop + data$damage.crop
damage <- aggregate (damage~EVTYPE, data, sum)

Plot damage by dividing the total cost by billion of dollars.

damage$billion <- damage$damage / 1000000000;
damage <- damage [order(damage$billion, decreasing=TRUE),]

damage$EVTYPE <-  factor(damage$EVTYPE , levels = damage$EVTYPE)
ggplot(damage[1:10,], aes(x=factor(EVTYPE)[1:10], y=billion[1:10]))+
  theme_bw() +  
  theme(plot.title = element_text(color="BLACK", size=20, face="bold"))+
  geom_bar(stat ="identity", fill = rainbow (10, start=0, end=0.5))+ 
  ggtitle("Total Damage by Top 10 Weather Events") + 
  xlab("Weather Events Type") + 
  ylab("Total damage in billion USD") + 
  coord_flip() +
  scale_y_continuous(breaks = seq(0,200, by = 25))

FIGURE 3 Plot Total damage by weather events

The data suggests the greatest effects on human health, floods and hurricanes caused the most economic damage with 160 and 86 billion dollar respectively from 1950 to 2011. Tornados are the third most damaging weather events that cause harm to human health and economic damage. These results are shown in the figure 3.


Conclusion


Tornado events have the strongest impact on public health indicated by total number of injuries and fatalities for the time period from 1950 to 2011.

Floods, on the other hand, have greatest economic damage in the United States from 1950 to 2011.