In this report I aim on discussing the economic and health problems that were caused by several natural disasters like storms and other severe weather events that occured in the Country of United States of America between 1950 and 2011. The data used for this analysis was obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. dataset In this analysis I found out that the main cause for most of the public deaths and injuries both combined in the US were caused by floods whereas from the economic point of view we had two conclusions as the damages done to the crop were from mostly drought and the damages to the property were mostly from the floods.
For this Analysis to be reproduced the usage of packages like “Dplyr”,“Plyr” & “Ggplot2” is highly recommended.
After downloading and unzipping the file from the give website the first thing is we read the csv and store in the variable.
data <- read.csv("repdata_data_StormData.csv")
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
After Exploring the dataset I realised that there were only a few coloumns that were relevant for the analysis of economic and public health, so I created a subset that comprises of only relevant coloumn’s like (FATALITIES,INJURIES) for public health and(PROPDMG,PROPDMGEXP,CROPDMG, CROPDMGEXP) which are all related to the damages caused to Crops and Goods and (EVTYPE) which are the natural disasters itself. For this I have used the “Dplyr” package.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dat1 <- data %>%
select(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
The coloumn’s for the economic damages like the PROPDMGEXP and CROPDMGEXP both contains exponents which need to be replaced with their corresponding values.
unique(dat1$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(dat1$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
So we need to replace these factors from “this” to :this “”, “?”, “+”, “-”: 0 “K”: 1.000 “M”: 1.000.000 “B”: 1.000.000.000 ANYTHING OTHER THAN THESE ARE CONSIDERED 0
dat1$PROPDMGEXP <- as.character(dat1$PROPDMGEXP)
dat1$PROPDMGEXP[is.na(dat1$PROPDMGEXP)] <- "0" #NA's are 0
dat1$PROPDMGEXP[!grepl("K|M|B",dat1$PROPDMGEXP,ignore.case = TRUE)] <- "0"#anything other then KMB
dat1$PROPDMGEXP[grep("K",dat1$PROPDMGEXP,ignore.case = TRUE)] <- "3"
dat1$PROPDMGEXP[grep("M",dat1$PROPDMGEXP,ignore.case = TRUE)] <- "6"
dat1$PROPDMGEXP[grep("B",dat1$PROPDMGEXP,ignore.case = TRUE)] <- "9"
dat1$PROPDMGEXP <- as.numeric(dat1$PROPDMGEXP)
dat1$TOTALPROPDMG <- dat1$PROPDMG*10^dat1$PROPDMGEXP
dat1$CROPDMGEXP <- as.character(dat1$CROPDMGEXP)
dat1$CROPDMGEXP[is.na(dat1$CROPDMGEXP)] <- "0" #NA's are 0
dat1$CROPDMGEXP[!grepl("K|M|B",dat1$CROPDMGEXP,ignore.case = TRUE)] <- "0"#anything other than KMB
dat1$CROPDMGEXP[grep("K",dat1$CROPDMGEXP,ignore.case = TRUE)] <- "3"
dat1$CROPDMGEXP[grep("M",dat1$CROPDMGEXP,ignore.case = TRUE)] <- "6"
dat1$CROPDMGEXP[grep("B",dat1$CROPDMGEXP,ignore.case = TRUE)] <- "9"
dat1$CROPDMGEXP <- as.numeric(dat1$CROPDMGEXP)
dat1$TOTALCROPDMG <- dat1$CROPDMG*10^dat1$CROPDMGEXP
Now we finally have the correct dataset that can be used for analysis.
In order to find the damages caused by the natural events on the overall public health we look into the number of FATALITIES & INJURIES caused by the EVTYPE.
fatality <- aggregate(FATALITIES ~ EVTYPE,data = dat1,FUN = "sum") #deaths from each EVTYPE
top5f <- fatality[order(-fatality$FATALITIES),][1:5, ] #top5 EVTYPE deaths
injury <- aggregate(INJURIES ~ EVTYPE,data = dat1,FUN = "sum") #injury from each EVTYPE
top5i <- injury[order(-injury$INJURIES),][1:5, ] #top5 EVTYPE injuries
Now after finally calculating the top 5 “number of deaths & injuries” caused by each EVTYPE,we plot the data. For the plotting I will be using “Ggplot2” package.
library(ggplot2)
ggplot(data=top5f, aes(x=reorder(EVTYPE, FATALITIES), y=FATALITIES, color=EVTYPE)) + geom_bar(stat="identity",fill="white") + xlab("Event Type") + ylab("Total number of fatalities") + ggtitle("Top 5 Events with most deaths")+coord_flip()
Now let’s make the same for Injuries caused by Evtype to compare and come up with a conclusion.
ggplot(data=top5i,aes(x=reorder(EVTYPE, INJURIES),y=INJURIES,color=EVTYPE)) + geom_bar(stat ="identity",fill="white") + xlab("Event type") + ylab("Total number of injuries") + ggtitle("Top 5 events with most injuries")+coord_flip()
In both of these box plots we obsereve that the main culprit behind most of the deaths and injuries is Tornado.
In order to find out the damage caused to the US economy we need to look for factors like crops and goods(property) damage.
cropdmg <- aggregate(TOTALCROPDMG ~ EVTYPE,data = dat1,FUN = "sum") #crop damage
top5c <- cropdmg[order(-cropdmg$TOTALCROPDMG),][1:5,]
propmg <- aggregate(TOTALPROPDMG ~ EVTYPE,data = dat1,FUN = "sum")
top5p <- propmg[order(-propmg$TOTALPROPDMG),][1:5,]
Now let’s plot the data for further findings.
ggplot(data=top5c, aes(x=reorder(EVTYPE, TOTALCROPDMG), y=TOTALCROPDMG, color=EVTYPE)) + geom_bar(stat="identity",fill="white") + xlab("Event Type") + ylab("Total damage to crops")
ggplot(data=top5p, aes(x=reorder(EVTYPE, TOTALPROPDMG), y=TOTALPROPDMG, color=EVTYPE)) + geom_bar(stat="identity",fill="white") + xlab("Event Type") + ylab("Total damage to property") +coord_flip()
In this we have come up with an interesting conclusion as the main reason for crop damages are drought’s whereas the main reason for damages of property is the polar opposite of that which are floods.
so now we have come to the end of my analysis.