This is my second project for reproducibe research. Reproducible research is very important. The dataset is from National Oceanic and Atmospheric Administration of U.S. I will use this dataset to analyze which type of storm has created a large impact on population health and on economy.
I have first processed this dataset, extracted the essential variables required for analysis, then I have checked for the null values and then I have created a graph to see which type of storm has created a huge impact.
After plotting graph, I found that Tornado has created highest health impact on population and Flood has created highest economic impact.
stormData<-read.csv("repdata_data_StormData.csv.bz2",header = TRUE, sep = ",")
dim(stormData)
## [1] 902297 37
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
head(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
The variables which are important for our analysis are: INJURIES, FATALITIES, PROPDMG, CROPDMG, CROPDMGEXY, PROPDMGEXP and EVTYPE (this is our target variable.)
mainVar<-c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
finalData<- stormData[,mainVar]
head(finalData)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
tail(finalData)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER 0 0 0 K 0 K
## 902293 HIGH WIND 0 0 0 K 0 K
## 902294 HIGH WIND 0 0 0 K 0 K
## 902295 HIGH WIND 0 0 0 K 0 K
## 902296 BLIZZARD 0 0 0 K 0 K
## 902297 HEAVY SNOW 0 0 0 K 0 K
sum(is.na(finalData$FATALITIES))
## [1] 0
sum(is.na(finalData$INJURIES))
## [1] 0
sum(is.na(finalData$PROPDMG))
## [1] 0
sum(is.na(finalData$PROPDMGEXP))
## [1] 0
sum(is.na(finalData$CROPDMG))
## [1] 0
sum(is.na(finalData$CROPDMGEXP))
## [1] 0
After checking the missing values for each variable, their are not any missing value. So, now we will transform this variables.
max(finalData$EVTYPE)
## [1] "WND"
sort(table(finalData$EVTYPE), decreasing = TRUE)[1:15]
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD FLOOD THUNDERSTORM WINDS HIGH WIND
## 54277 25326 20843 20212
## LIGHTNING HEAVY SNOW HEAVY RAIN WINTER STORM
## 15754 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
Now, we will group the same events together with the help of grep function.
finalData$EVENT <- "OTHER"
finalData$EVENT[grep("HAIL", finalData$EVTYPE, ignore.case = TRUE)] <- "HAIL"
finalData$EVENT[grep("HEAT", finalData$EVTYPE, ignore.case = TRUE)] <- "HEAT"
finalData$EVENT[grep("FLOOD", finalData$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
finalData$EVENT[grep("WIND", finalData$EVTYPE, ignore.case = TRUE)] <- "WIND"
finalData$EVENT[grep("STORM", finalData$EVTYPE, ignore.case = TRUE)] <- "STORM"
finalData$EVENT[grep("SNOW", finalData$EVTYPE, ignore.case = TRUE)] <- "SNOW"
finalData$EVENT[grep("TORNADO", finalData$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
finalData$EVENT[grep("WINTER", finalData$EVTYPE, ignore.case = TRUE)] <- "WINTER"
finalData$EVENT[grep("RAIN", finalData$EVTYPE, ignore.case = TRUE)] <- "RAIN"
Now,checking the values:
table(finalData$EVENT)
##
## FLOOD HAIL HEAT OTHER RAIN SNOW STORM TORNADO WIND WINTER
## 82686 289270 2648 48970 12241 17660 113156 60700 255362 19604
Doing same with another variables:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
varia<- c("EVTYPE", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
finDamage <- finalData[, varia]
sym <- sort(unique(as.character(finDamage$PROPDMGEXP)))
multiply <- c(0,0,0,1,10,10,10,10,10,10,10,10,10,10^9,10^2,10^2,10^3,10^6,10^6)
convert <- data.frame(sym, multiply)
finDamage$Prop<- convert$multiply[match(finDamage$PROPDMGEXP, convert$sym)]
finDamage$Crop <- convert$multiply[match(finDamage$CROPDMGEXP, convert$sym)]
finDamage <- finDamage %>% mutate(PROPDMG = PROPDMG*Prop) %>% mutate(CROPDMG = CROPDMG*Crop) %>% mutate(DMG = PROPDMG+CROPDMG)
finDamageTol <- finDamage %>% group_by(EVTYPE)%>% summarize(TOLEVTYPE=sum(DMG))%>%arrange(-TOLEVTYPE)
## `summarise()` ungrouping output (override with `.groups` argument)
head(finDamageTol,15)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 2
## EVTYPE TOLEVTYPE
## <chr> <dbl>
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352117607
## 4 STORM SURGE 43323541000
## 5 FLASH FLOOD 17562132111
## 6 DROUGHT 15018672000
## 7 HURRICANE 14610229010
## 8 RIVER FLOOD 10148404500
## 9 ICE STORM 8967041810
## 10 TROPICAL STORM 8382236550
## 11 WINTER STORM 6715441260
## 12 HIGH WIND 5908617580
## 13 WILDFIRE 5060586800
## 14 TSTM WIND 5038936340
## 15 STORM SURGE/TIDE 4642038000
To analyze the health impact, we will calculate the total injuries and total fatalities for each event. Health Impact
library(dplyr)
finFatalities <- finalData %>% select(EVTYPE, FATALITIES) %>% group_by(EVTYPE) %>% summarise(tolFatalities = sum(FATALITIES)) %>% arrange(-tolFatalities)
## `summarise()` ungrouping output (override with `.groups` argument)
head(finFatalities, 15)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 2
## EVTYPE tolFatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
finInjuries <- finalData %>% select(EVTYPE, INJURIES) %>% group_by(EVTYPE) %>% summarise(tolInjuries = sum(INJURIES)) %>% arrange(-tolInjuries)
## `summarise()` ungrouping output (override with `.groups` argument)
head(finInjuries, 15)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 2
## EVTYPE tolInjuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
Health Impact
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
g <- ggplot(finFatalities[1:15,], aes(x=reorder(EVTYPE, -tolFatalities), y=tolFatalities))+geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+ ggtitle("Top 15 Events Which Has Highest Total Fatalities") +labs(x="Type of Event", y="Total Fatalities")
print(g)
g1 <- ggplot(finInjuries[1:15,], aes(x=reorder(EVTYPE, -tolInjuries), y=tolInjuries))+geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+ggtitle("Top 15 Events Which Has Highest Total Injuries") +labs(x="Type of Event", y="Total Injuries")
print(g1)
By seeing the graph, it can be concluded that Tornado has caused highest health impact in both fatalities and injuries.
Now, we will study its impact on Economy.
g2 <- ggplot(finDamageTol[1:15,], aes(x=reorder(EVTYPE, -TOLEVTYPE), y=TOLEVTYPE))+geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+ggtitle("Top 15 Events which has Highest Impact on Economy") +labs(x="EVENT TYPE", y="Total Impact on Economy")
print(g2)
It can be concluded that Flood has created highest economic impact.