This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including the estimates of any fatalities, injuries, Crop and property damages. The database has data from year 1950 and end in November 2011. The Data analysis must report the storm event most harmful (both fatality and injuries) to the population along with reporing the most economic loss (both property and crop) by storm event type. The analysis points to Tornados are most harmful to humans (nearly 97,000) and Flooding is the most damaging ($150 Billions) weather event.
The data required to do this analysis can be downloaded from the national weather service website. The data is compressed by .bz2 and the features are comma separated CSV format. After unloading the data rom above URL, we can unzip it using R utility function ‘bunzip2’. Following is the R code to achieve this.
options(scipen=999)
library(ggplot2)
library(data.table)
library(dplyr)
library(knitr)
library(R.utils)
# read data
url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./repdata-data-StormData.csv.bz2"
if(!file.exists("./repdata-data-StormData.csv.bz2")){
file.create("./repdata-data-StormData.csv.bz2")
download.file(url, destfile)
}
if(!file.exists("./repdata-data-StormData.csv")){
bunzip2("repdata-data-StormData.csv.bz2","repdata-data-StormData.csv",remove=FALSE,skip=TRUE)
}
As the file contain data for number of years, the file may have thousands of rows and tens of columns/features. Also, in order to do our analysis, we don’t need to load all the columns. First, read some sample data and determine the necessary columns for further analysis.
# read sample data...sampling.
noaa.smpl<-read.table(file="repdata-data-StormData.csv",header=TRUE,sep=",",quote = "\"",nrow=4000)
str(noaa.smpl)
head(noaa.smpl[,c('STATE','EVTYPE','FATALITIES','INJURIES','PROPDMG','CROPDMG','PROPDMGEXP','CROPDMGEXP')])
group_by(noaa.smpl,EVTYPE) %>% summarize(fatality=sum(FATALITIES),inj=sum(INJURIES))
## 'data.frame': 4000 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 1203 levels "1/10/1972 0:00:00",..: 416 416 199 849 76 76 79 16 182 182 ...
## $ BGN_TIME : int 130 145 1600 900 1500 2000 100 900 2000 2000 ...
## $ TIME_ZONE : Factor w/ 2 levels "CST","EST": 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 68 levels "","AUTAUGA","BALDWIN",..: 50 3 30 46 23 40 6 63 64 30 ...
## $ STATE : Factor w/ 1 level "AL": 1 1 1 1 1 1 1 1 1 1 ...
## $ EVTYPE : Factor w/ 3 levels "HAIL","TORNADO",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : logi NA NA NA NA NA NA ...
## $ BGN_LOCATI: logi NA NA NA NA NA NA ...
## $ END_DATE : logi NA NA NA NA NA NA ...
## $ END_TIME : logi NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : logi NA NA NA NA NA NA ...
## $ END_LOCATI: logi NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 3 levels "","K","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: logi NA NA NA NA NA NA ...
## $ WFO : Factor w/ 21 levels "","AT","ATA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: logi NA NA NA NA NA NA ...
## $ ZONENAMES : logi NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : logi NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## STATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP CROPDMGEXP
## 1 AL TORNADO 0 15 25.0 0 K NA
## 2 AL TORNADO 0 0 2.5 0 K NA
## 3 AL TORNADO 0 2 25.0 0 K NA
## 4 AL TORNADO 0 2 2.5 0 K NA
## 5 AL TORNADO 0 2 2.5 0 K NA
## 6 AL TORNADO 0 6 2.5 0 K NA
## # A tibble: 3 × 3
## EVTYPE fatality inj
## <fctr> <dbl> <dbl>
## 1 HAIL 0 0
## 2 TORNADO 225 3537
## 3 TSTM WIND 21 67
We need only the following columns to do answer our qestions. Skip the remaining columns using ‘NULL’ value in the read.table function.
‘EVTYPE’,‘FATALITIES’,‘INJURIES’,‘PROPDMG’,‘CROPDMG’,‘PROPDMGEXP’,‘CROPDMGEXP’
cc<-c(rep('NULL',6),NA,NA,rep('NULL',14),rep(NA,6),rep('NULL',9))
noaa<-read.table(file="repdata-data-StormData.csv",header=TRUE,sep=",",quote = "\"",colClasses=cc)
str(noaa)
## 'data.frame': 902297 obs. of 8 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Across the United States, the following storm event types are most harmful with respect to population health.
harmful.events<-noaa %>% group_by(EVTYPE) %>%
summarize(harmful = sum(FATALITIES)+sum(INJURIES)) %>%
arrange(desc(harmful))
harmful.events
## # A tibble: 985 × 2
## EVTYPE harmful
## <fctr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
## # ... with 975 more rows
In order to find the storm event types have caused most economic loss, we need to analyze PROPDMG, PROPDMGEXP (for property) and CROPDMG,CROPDMGEXP (for crops). We first need to convert the codes in PROPDMGEXP, CROPDMGEXP columns into multiplyers using the following R code. Also, make sure there are no ‘NA’ values in the PROPDMG and CROPDMG features.
table(noaa$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(noaa$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
sum(is.na(noaa$PROPDMG))
## [1] 0
sum(is.na(noaa$CROPDMG))
## [1] 0
mult<-function(exp) {
if (exp=='h' | exp=='H') return(100)
if (exp=='k' | exp=='K') return(1000)
if (exp=='m' | exp=='M') return(1000000)
if (exp=='b' | exp=='B') return(1000000000)
if (exp=='1') return(10)
if (exp=='2') return(100)
if (exp=='3') return(1000)
if (exp=='4') return(10000)
if (exp=='5') return(100000)
if (exp=='6') return(1000000)
if (exp=='7') return(10000000)
if (exp=='8') return(100000000)
return(0)
}
options(digits=11)
Create two new columns to store the multiplyer in the data table which can be used to calulate the total ecomonic loss by storm even type in Billion dollars.
#multiplier columns
noaa$propmult <- as.numeric(lapply(noaa$PROPDMGEXP,mult))
noaa$cropmult <- as.numeric(lapply(noaa$CROPDMGEXP,mult))
noaa.ecomomics <- noaa %>% mutate(proploss=propmult * PROPDMG,croploss= cropmult * CROPDMG) %>%
group_by(EVTYPE) %>% summarize(ttl=(sum(proploss)+sum(croploss))/1000000000) %>%
arrange(desc(ttl))
noaa.ecomomics
## # A tibble: 985 × 2
## EVTYPE ttl
## <fctr> <dbl>
## 1 FLOOD 150.31967825
## 2 HURRICANE/TYPHOON 71.91371280
## 3 TORNADO 57.36233359
## 4 STORM SURGE 43.32354100
## 5 HAIL 18.76122167
## 6 FLASH FLOOD 18.24399061
## 7 DROUGHT 15.01867200
## 8 HURRICANE 14.61022901
## 9 RIVER FLOOD 10.14840450
## 10 ICE STORM 8.96704131
## # ... with 975 more rows
The first histogram will show the storm events most harmful to the populations.The histograms will show first 15 most harmful strom events.
# Q1
g<-ggplot(harmful.events[1:15,],aes(x=reorder(EVTYPE,harmful),y=harmful,fill=EVTYPE,color=EVTYPE))
g<-g+geom_histogram(stat="identity")
g<-g+xlab("Storm events")+ylab("Number of humans harmed")+ggtitle("Total Fatalities & Injuries by event type")
g<-g+theme(axis.text.x = element_text(angle = 90, hjust = 1))
g<-g+geom_text(aes(label=round(harmful)),position = position_dodge(.5),vjust=0,color="black")
g
The second histogram will show the strom events which causes most economic loss.The histograms will show first 15 most loss causing events.
#Q2
g<-ggplot(noaa.ecomomics[1:15,],aes(x=reorder(EVTYPE,ttl),y=ttl,fill=EVTYPE,color=EVTYPE))
g<-g+geom_histogram(stat="identity")
g<-g+xlab("Storm events")+ylab("Economic loss in Billions")+ggtitle("Property and crop loss (in Billions) by event type")
g<-g+theme(axis.text.x = element_text(angle = 90, hjust = 1))
g<-g+geom_text(aes(label=round(ttl,3)),position = position_dodge(.5),vjust=0,color="black")
g
Remove all variables
## remove datasets
rm(noaa.ecomomics,noaa,noaa.smpl,harmful.events)
End of the document. (To create yourfilename.md file, use knit(“yourfilename.Rmd”))