This is a report on the effects of weather events on public health and economical impacts in the United States from 1950 to 2011. The basic goal of this report is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to answer basic questions about severe weather events.
This data analysis used the statistical software R x64 3.2.5. This report will try to address 1. Which types of events across the United States were the most harmful with respect to population health. The report considers the total number of injuries and fatalities over the time period from 1950 to 2011. 2. Which type of events have the greatest consequences for the U.S. economy. The total amount of damage measured in dollars, were used to analyze the effects.
### Used to Unzip Data file
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.3.0 (2016-04-13) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Tools used to unzip the data file with R the package “R.utils” and the command “bunzip2”
if(!file.exists('StormData.csv')){
bunzip2 ("FStormData.csv.bz2", overwrite=TRUE, remove=FALSE, destname="StormData.csv")
}
Loading the data set with the ‘read.csv’ command, setting the headers of the columns to TRUE and defining the seperator as commas.
raw_data <- read.csv("StormData.csv", sep=',', header=TRUE )
###Looking at the dimensions and different kinds of varibles.
dim(raw_data)
## [1] 902297 37
str(raw_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We only need the variables for the statenames, event types, fatalities, injuries, and the damage-variables.
data <- raw_data[,c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP","CROPDMG", "CROPDMGEXP")]
sum (is.na (data))
## [1] 0
head(data)
## STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 AL TORNADO 0 15 25.0 K 0
## 2 AL TORNADO 0 0 2.5 K 0
## 3 AL TORNADO 0 2 25.0 K 0
## 4 AL TORNADO 0 2 2.5 K 0
## 5 AL TORNADO 0 2 2.5 K 0
## 6 AL TORNADO 0 6 2.5 K 0
summary(data)
## STATE EVTYPE FATALITIES
## TX : 83728 HAIL :288661 Min. : 0.0000
## KS : 53440 TSTM WIND :219940 1st Qu.: 0.0000
## OK : 46802 THUNDERSTORM WIND: 82563 Median : 0.0000
## MO : 35648 TORNADO : 60652 Mean : 0.0168
## IA : 31069 FLASH FLOOD : 54277 3rd Qu.: 0.0000
## NE : 30271 FLOOD : 25326 Max. :583.0000
## (Other):621339 (Other) :170878
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. : 0.0000 Min. : 0.00 :465934 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 K :424665 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 M : 11330 Median : 0.000
## Mean : 0.1557 Mean : 12.06 0 : 216 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000
## Max. :1700.0000 Max. :5000.00 5 : 28 Max. :990.000
## (Other): 84
## CROPDMGEXP
## :618413
## K :281832
## M : 1994
## k : 21
## 0 : 19
## B : 9
## (Other): 9
After examing the dataset, there is a lot of event types with erroneous types, that are named nearly the same. An example of this is “FLASH FLOODING”, “FLOOD/FLASH FLOOD” they are the same event types as documented on the page 6 of the storm data documentation Cleaning the dataset and combining the same types of events following the table on page 6 of the documentation.
###Examine the EVTYPE data
unique_evtype <-summary(data$EVTYPE)
str(unique_evtype)
## Named int [1:100] 288661 219940 82563 60652 54277 25326 20843 20212 15754 15708 ...
## - attr(*, "names")= chr [1:100] "HAIL" "TSTM WIND" "THUNDERSTORM WIND" "TORNADO" ...
# clean EVTYPE, aggregate duplicate lables
data$EVTYPE <- toupper(str_trim(data$EVTYPE))
data$EVTYPE <- gsub("TSTM WIND", "MARINE THUNDERSTORM WIND" , data$EVTYPE)
data$EVTYPE <- gsub("URBAN/SML STREAM FLD", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub("MARINE TSTM WIND","MARINE THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("WILD/FOREST FIRE", "WILDFIRE", data$EVTYPE)
data$EVTYPE <- gsub("marinethunderstormwind/hail", "marinethunderstormwind", data$EVTYPE)
data$EVTYPE <- gsub("TSTM WIND/HAIL","MARINE THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("flashflooding", "flashflood", data$EVTYPE)
data$EVTYPE <- gsub("FLOOD/FLASH FLOOD", "FLASH FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("WINTER data/MIX", "WINTER data", data$EVTYPE)
data$EVTYPE <- gsub("RIP CURRENTS", "RIP CURRENT", data$EVTYPE)
data$EVTYPE <- gsub("DENSEDENSEFOG", "DENSE FOG", data$EVTYPE)
data$EVTYPE <- gsub("STRONG WINDS","ASTROMICAL LOW TIDE", data$EVTYPE)
data$EVTYPE <- gsub("COASTAL FLOODING", "COASTAL FLOOD ", data$EVTYPE)
data$EVTYPE <- gsub("RIVER FLOOD", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("RECORD WARMTH", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub("RECORD HEAT", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub("FREEZE", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub("HEATWAVE", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub("HURRICANE/TYPHOON", "HURRICANE", data$EVTYPE)
Effects on Public Health (Injuries and Fatalities) In the next steps we will be investigating the type of events that caused the largest total numbers of fatalities and injuries from 1950 to 2011.
#### first aggregate the number of injuries for each type of event in a new dataset "injuries"
injuries <- aggregate(INJURIES~EVTYPE, data=data, sum)
Now exclude all events with zero number of injuries, then we sort the injuires dataset by decreasing number of injuries, use the package “dplyr”.
injuries_noZero<- injuries %>% filter(INJURIES > 0)
injuries_Ordered<-injuries_noZero[with(injuries_noZero,order(-injuries_noZero$INJURIES)),]
Next plot the 10 most significant events by total number of injuries with ggplot2 package, EVTYPE is ordered factor already, so that the ggplot output isn’t sorted aphabetically.
injuries_Ordered$EVTYPE <- factor(injuries_Ordered$EVTYPE , levels = injuries_Ordered$EVTYPE)
ggplot(injuries_Ordered[1:10,], aes(x=factor(EVTYPE)[1:10], y=INJURIES[1:10], fill=INJURIES[1:10]))+
geom_bar(stat ="identity") +
theme_bw() +
theme(plot.title = element_text(color="BLACK", size=20, face="bold"),
axis.text.x = element_text(angle=65, vjust=0.5, size=12)
) +
ggtitle("Injuries by Top 10 Events") +
xlab("Types of Weather Events") +
ylab("Total number of Injuries") +
coord_flip() +
scale_y_continuous(breaks = seq(0,100000, by = 10000)) +
scale_fill_continuous(name="Total Injuries")
FIGURE 1 Plot of Injuries by top 10 Events
Now remove the zeros from the fatalites data
fatalites <- aggregate(FATALITIES~EVTYPE, data=data, sum)
fatalites_NoZero <-fatalites %>% filter(FATALITIES > 0)
fatalites_Ordered <-fatalites_NoZero[with(fatalites_NoZero,order(-fatalites_NoZero$FATALITIES)),]
fatalites_Ordered$EVTYPE <- factor(fatalites_Ordered$EVTYPE , levels = fatalites_Ordered$EVTYPE)
ggplot(fatalites_Ordered[1:10,], aes(x=factor(EVTYPE)[1:10], y=FATALITIES[1:10], fill=FATALITIES[1:10])) +
theme_bw() +
theme(plot.title = element_text(color="BLACK", size=20, face="bold"))+
geom_bar(stat ="identity",fill="blue", colour="black") +
ggtitle("Fatalities by Top 10 Weather Events") +
xlab("Types of Weather Events") +
ylab("Total Number of Fatalities") +
coord_flip()+scale_y_continuous(breaks = seq(0,6000, by = 500)) +
scale_fill_continuous(name="Total Number")
FIGURE 2 A Plot of top 10 fatalities
It is evident that tornados cause the most injuries in the U.S. with 91,346 cases, followed by marine thunderstorm wind with barely 7,000 cases and flood with 6,800 cases. Tornados with 5,633, Excessive Heat with 1,903 and Flash Flood with 995 cases caused the most total number of fatalities in the U.S. in this time period.
The total economic damage in dollars caused by each type of event, to evaluate those that caused the largest costs. Property and crop damages are summed-up over the years from 1950.
Summary Property Damage
summary(data$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
Summary Crop Damage
summary(data$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.527 0.000 990.000
Summary Crop Expense
summary(data$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Summary Property Expense
summary(data$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
The NOAA documentation on page 12 explains the symbol of the -DMGEXP identifier variable stands for, e.g. “b” and “B” stands for billion dollars
symbol <- c("", "+", "-", "?", 0:9, "h", "H", "k", "K", "m", "M", "b", "B")
fact <- c(rep(0,4), 0:9, 2, 2, 3, 3, 6, 6, 9, 9)
mult <- data.frame (symbol, fact)
Create new cost-variables for the property and crop damage variables, which are including numeric numbers.
data$damage.prop <- data$PROPDMG*10^mult[match(data$PROPDMGEXP,mult$symbol),2]
data$damage.crop <- data$CROPDMG*10^mult[match(data$CROPDMGEXP,mult$symbol),2]
data$damage <- data$damage.prop + data$damage.crop
damage <- aggregate (damage~EVTYPE, data, sum)
Plot damage by dividing the total cost by billion of dollars.
damage$billion <- damage$damage / 1000000000;
damage <- damage [order(damage$billion, decreasing=TRUE),]
damage$EVTYPE <- factor(damage$EVTYPE , levels = damage$EVTYPE)
ggplot(damage[1:10,], aes(x=factor(EVTYPE)[1:10], y=billion[1:10]))+
theme_bw() +
theme(plot.title = element_text(color="BLACK", size=20, face="bold"))+
geom_bar(stat ="identity", fill = rainbow (10, start=0, end=0.5))+
ggtitle("Total Damage by Top 10 Weather Events") +
xlab("Weather Events Type") +
ylab("Total damage in billion USD") +
coord_flip() +
scale_y_continuous(breaks = seq(0,200, by = 25))
FIGURE 3 Plot Total damage by weather events
The data suggests the greatest effects on human health, floods and hurricanes caused the most economic damage with 160 and 86 billion dollar respectively from 1950 to 2011. Tornados are the third most damaging weather events that cause harm to human health and economic damage. These results are shown in the figure 3.
Tornado events have the strongest impact on public health indicated by total number of injuries and fatalities for the time period from 1950 to 2011.
Floods, on the other hand, have greatest economic damage in the United States from 1950 to 2011.