Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Basically the analysis contains 2 parts. 1. The Health Impacts : The first part was to figure out the impact on health from weather events and for this research I tried to find out the maxium number of casualties(deaths and Injuries combined) due to these weather events and the top harm was caused by TORNADO with a total of 96979 casualties. 2. The Economical Impacts : In the second part of the research I tried to figure out the impact on Economy from weather events and for this research I took two criteria for economic damages * The total property damages done (In Billions) * The total Crop damages (in Millions)
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.Also,some documentation of the database available at National Weather Service Storm Data Documentation and National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The First step involves downloading the data if it does not exi.
filename <- "repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(filename)){
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, filename, method="curl")
}
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(ggplot2)
raw_data <- read.csv(bzfile("repdata%2Fdata%2FStormData.csv.bz2"))[,c(8,23,24,25,26,27,28)]
# Looking at the raw_data
head(raw_data,50)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.00 K 0
## 2 TORNADO 0 0 2.50 K 0
## 3 TORNADO 0 2 25.00 K 0
## 4 TORNADO 0 2 2.50 K 0
## 5 TORNADO 0 2 2.50 K 0
## 6 TORNADO 0 6 2.50 K 0
## 7 TORNADO 0 1 2.50 K 0
## 8 TORNADO 0 0 2.50 K 0
## 9 TORNADO 1 14 25.00 K 0
## 10 TORNADO 0 0 25.00 K 0
## 11 TORNADO 0 3 2.50 M 0
## 12 TORNADO 0 3 2.50 M 0
## 13 TORNADO 1 26 250.00 K 0
## 14 TORNADO 0 12 0.00 K 0
## 15 TORNADO 0 6 25.00 K 0
## 16 TORNADO 4 50 25.00 K 0
## 17 TORNADO 0 2 25.00 K 0
## 18 TORNADO 0 0 25.00 K 0
## 19 TORNADO 0 0 25.00 K 0
## 20 TORNADO 0 0 25.00 K 0
## 21 TORNADO 0 0 25.00 K 0
## 22 TORNADO 0 0 2.50 K 0
## 23 TORNADO 0 0 2.50 K 0
## 24 TORNADO 0 1 25.00 K 0
## 25 TORNADO 0 1 25.00 K 0
## 26 TORNADO 1 8 25.00 K 0
## 27 TORNADO 0 2 25.00 K 0
## 28 TORNADO 0 1 25.00 K 0
## 29 TORNADO 0 6 25.00 K 0
## 30 TORNADO 0 2 2.50 K 0
## 31 TORNADO 0 0 2.50 K 0
## 32 TORNADO 0 12 2.50 K 0
## 33 TORNADO 0 0 25.00 K 0
## 34 TORNADO 6 195 2.50 M 0
## 35 TORNADO 0 2 25.00 K 0
## 36 TORNADO 7 12 250.00 K 0
## 37 TORNADO 0 0 2.50 K 0
## 38 TORNADO 2 3 25.00 K 0
## 39 TORNADO 0 2 2.50 K 0
## 40 TORNADO 0 0 25.00 K 0
## 41 TORNADO 0 0 2.50 K 0
## 42 TORNADO 0 1 25.00 K 0
## 43 TORNADO 0 0 2.50 K 0
## 44 TORNADO 0 0 25.00 K 0
## 45 TORNADO 0 0 25.00 K 0
## 46 TORNADO 0 0 0.03 K 0
## 47 TORNADO 0 1 25.00 K 0
## 48 TORNADO 0 4 250.00 K 0
## 49 TORNADO 0 26 250.00 K 0
## 50 TORNADO 0 3 2.50 K 0
# Lets look at the different Event Types
head(unique(raw_data$EVTYPE),50)
## [1] TORNADO TSTM WIND
## [3] HAIL FREEZING RAIN
## [5] SNOW ICE STORM/FLASH FLOOD
## [7] SNOW/ICE WINTER STORM
## [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [11] RECORD COLD HURRICANE ERIN
## [13] HURRICANE OPAL HEAVY RAIN
## [15] LIGHTNING THUNDERSTORM WIND
## [17] DENSE FOG RIP CURRENT
## [19] THUNDERSTORM WINS FLASH FLOOD
## [21] FLASH FLOODING HIGH WINDS
## [23] FUNNEL CLOUD TORNADO F0
## [25] THUNDERSTORM WINDS LIGHTNING THUNDERSTORM WINDS/HAIL
## [27] HEAT WIND
## [29] LIGHTING HEAVY RAINS
## [31] LIGHTNING AND HEAVY RAIN FUNNEL
## [33] WALL CLOUD FLOODING
## [35] THUNDERSTORM WINDS HAIL FLOOD
## [37] COLD HEAVY RAIN/LIGHTNING
## [39] FLASH FLOODING/THUNDERSTORM WI WALL CLOUD/FUNNEL CLOUD
## [41] THUNDERSTORM WATERSPOUT
## [43] EXTREME COLD HAIL 1.75)
## [45] LIGHTNING/HEAVY RAIN HIGH WIND
## [47] BLIZZARD BLIZZARD WEATHER
## [49] WIND CHILL BREAKUP FLOODING
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
As you can see we have ambiguity in the data , We need to first work on this ambiguity. For example SNOW, SNOW/ICE acctually mean the same. So lets try to remove the ambiguity from Event Types.
# Checking the number of event types before processing the event types
length(unique(raw_data$EVTYPE))
## [1] 985
event_types <- toupper(raw_data$EVTYPE)
# replace all punctuation characters with blank spaces
event_types <- gsub("[[:blank:][:punct:]+]", " ", event_types)
event_types<-gsub("^ ","",event_types)
length(unique(event_types))
## [1] 867
event_types<-gsub("^ *| $*","",event_types) # remove leading & trailing spaces
event_types<-gsub(" "," ",event_types) #removing in-between extra spaces
# updating the data frame
raw_data$EVTYPE <- event_types
As you can see we now have lesser event types than before.We decreased the number of event types from 985 to 867. Although further cleaning up the data is possible but for now lets move on to the analysis part.
Let us try to analyse the total number of casualties caused by these events from the year 1950 to 2011.
raw_data$casualties <- raw_data$FATALITIES + raw_data$INJURIES
casualties <- aggregate(raw_data$casualties, by=list(Event=raw_data$EVTYPE), FUN=sum)
# Taking out the top 10 Events causing maximum number of casualties.
Top_Casualties <- head(casualties[order(casualties$x, decreasing = TRUE), ], 10)
library(ggplot2)
p <-ggplot(Top_Casualties, aes(Event, x))
p +geom_bar(stat = "identity") + xlab("Events") + ylab("Casualties") +coord_flip() + theme(axis.text.x = element_text(face="bold", color="#993333",
size=6, angle=45),
axis.text.y = element_text(face="bold", color="#993333",
size=6, angle=45)
)
Cleaning and transforming the Property and Crop data
raw_data$PROP_US <- 0
raw_data$CROP_US <- 0
raw_data$PROP_US <- ifelse(raw_data$PROPDMGEXP %in% "H" | raw_data$PROPDMGEXP %in% "h",
raw_data$PROPDMG*0.0000001, raw_data$PROP_US)
raw_data$CROP_US <- ifelse(raw_data$CROPDMGEXP %in%"H"| raw_data$CROPDMGEXP %in%"h",
raw_data$CROPDMG*0.0000001, raw_data$CROP_US)
raw_data$PROP_US <- ifelse(raw_data$PROPDMGEXP %in% "K"| raw_data$PROPDMGEXP %in% "k",
raw_data$PROPDMG*0.000001, raw_data$PROP_US)
raw_data$CROP_US <- ifelse(raw_data$CROPDMGEXP %in% "K"| raw_data$CROPDMGEXP %in% "k",
raw_data$CROPDMG*0.000001, raw_data$CROP_US)
raw_data$PROP_US <- ifelse(raw_data$PROPDMGEXP %in% "M"| raw_data$PROPDMGEXP %in% "m",
raw_data$PROPDMG*0.001, raw_data$PROP_US)
raw_data$CROP_US <- ifelse(raw_data$CROPDMGEXP %in% "M"| raw_data$CROPDMGEXP %in% "m",
raw_data$CROPDMG*0.001, raw_data$CROP_US)
raw_data$PROP_US <- ifelse(raw_data$PROPDMGEXP %in% "B"| raw_data$PROPDMGEXP %in% "b",
raw_data$PROPDMG*1, raw_data$PROP_US)
raw_data$CROP_US <- ifelse(raw_data$CROPDMGEXP %in% "B"| raw_data$CROPDMGEXP %in% "b",
raw_data$CROPDMG*1, raw_data$CROP_US)
Property_damage <- aggregate(raw_data$PROP_US, by=list(Event=raw_data$EVTYPE), FUN=sum)
# Taking out the top 10 Events causing maximum number of Property damage.
Top_Property_Damages <- head(Property_damage[order(Property_damage$x, decreasing = TRUE), ], 10)
library(ggplot2)
p <-ggplot(Top_Property_Damages, aes(Event, x))
p +geom_bar(stat = "identity") + xlab("Events") + ylab("Property Damages") +coord_flip() + theme(axis.text.x = element_text(face="bold", color="#993333",
size=6, angle=45),
axis.text.y = element_text(face="bold", color="#993333",
size=6, angle=45)
)
So it is clear from the graph that highest property damage was caused by Flood.
Crop_damage <- aggregate(raw_data$CROP_US, by=list(Event=raw_data$EVTYPE), FUN=sum)
# Taking out the top 10 Events causing maximum number of Property damage.
Top_Crop_Damages <- head(Crop_damage[order(Crop_damage$x, decreasing = TRUE), ], 10)
library(ggplot2)
p <-ggplot(Top_Crop_Damages, aes(Event, x))
p +geom_bar(stat = "identity") + xlab("Events") + ylab("Crop Damages") +coord_flip() + theme(axis.text.x = element_text(face="bold", color="#993333",
size=6, angle=45),
axis.text.y = element_text(face="bold", color="#993333",
size=6, angle=45)
)
So it is clear from the graph that highest property damage was caused by DROUGHT.
From the above research we can conclude that :