The purpose of this script is to explore the NOAA Strom Data set and foud the severe weather events that affect public health (cause most fatalities and injuries) and economic damages(property damage and crop damage) the most.
To do this:
First, I loaded the data into R, the data are very big and bzip2 compressed, use function bzfile to unzip the file.
used dplyr to select out the desired columns and manipulate the data for summary.
used ggplot2 to visualize the data.
concluded that TORNADO causes the most fatalites and injuries, and the most economic damages.
# download the repdata-data-StromData.csv.bz2 file to your working directory
setwd("/Users/Tammy/online_courses/data_science")
dat<- read.csv(bzfile("repdata-data-StormData.csv.bz2"),quote="\"", stringsAsFactors=F)
library(dplyr)
# convert to local dataframe
dat<- tbl_df(dat)
dat
## Source: local data frame [902,297 x 37]
##
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: EVTYPE (chr), BGN_RANGE (dbl), BGN_AZI (chr),
## BGN_LOCATI (chr), END_DATE (chr), END_TIME (chr), COUNTY_END (dbl),
## COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (chr), END_LOCATI (chr),
## LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
## INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (chr), CROPDMG (dbl),
## CROPDMGEXP (chr), WFO (chr), STATEOFFIC (chr), ZONENAMES (chr), LATITUDE
## (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl), REMARKS
## (chr), REFNUM (dbl)
There are too many columns, let’s select the interesting ones.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dat_select<- select(dat,
c(STATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG))
dat_select
## Source: local data frame [902,297 x 6]
##
## STATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 AL TORNADO 0 15 25.0 0
## 2 AL TORNADO 0 0 2.5 0
## 3 AL TORNADO 0 2 25.0 0
## 4 AL TORNADO 0 2 2.5 0
## 5 AL TORNADO 0 2 2.5 0
## 6 AL TORNADO 0 6 2.5 0
## 7 AL TORNADO 0 1 2.5 0
## 8 AL TORNADO 0 0 2.5 0
## 9 AL TORNADO 1 14 25.0 0
## 10 AL TORNADO 0 0 25.0 0
## .. ... ... ... ... ... ...
# summary statistics
str(dat_select)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 6 variables:
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
summary(dat_select)
## STATE EVTYPE FATALITIES
## Length:902297 Length:902297 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0000
## Mean : 0.0168
## 3rd Qu.: 0.0000
## Max. :583.0000
## INJURIES PROPDMG CROPDMG
## Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 Median : 0.000
## Mean : 0.1557 Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :1700.0000 Max. :5000.00 Max. :990.000
rank fatalities and injuries by different type of events
dat_select %>% group_by(EVTYPE) %>% summarise(sum_FATALITIES=sum(FATALITIES),
sum_INJURIES=sum(INJURIES)) %>%
arrange(desc(sum_FATALITIES, sum_INJURIES))
## Source: local data frame [985 x 3]
##
## EVTYPE sum_FATALITIES sum_INJURIES
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
## .. ... ... ...
we see that TORNADO causes the most fatalities.
boxplot of fatatlies caused by severe weather events
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
fatalities<- dat_select %>%
group_by(EVTYPE) %>%
summarise(sum_FATALITIES=sum(FATALITIES),sum_INJURIES=sum(INJURIES)) %>%
arrange(desc(sum_FATALITIES, sum_INJURIES))
ggplot(fatalities) + geom_boxplot(aes(x=1, y=sum_FATALITIES)) + ylim(c(5,1000)) +
xlab("severe weather events") + ylab("FATALITIES") +
ggtitle("boxplot of FATALITEIS caused by severe weather events")
## Warning: Removed 910 rows containing non-finite values (stat_boxplot).
we see not many death were caused by other severe weather events. 5633 deaths by TONADO is really a lot!
dat_select %>% group_by(EVTYPE) %>% summarise(sum_FATALITIES=sum(FATALITIES),
sum_INJURIES=sum(INJURIES)) %>%
arrange(desc( sum_INJURIES, sum_FATALITIES))
## Source: local data frame [985 x 3]
##
## EVTYPE sum_FATALITIES sum_INJURIES
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
## 4 EXCESSIVE HEAT 1903 6525
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 ICE STORM 89 1975
## 8 FLASH FLOOD 978 1777
## 9 THUNDERSTORM WIND 133 1488
## 10 HAIL 15 1361
## .. ... ... ...
we see that TORNADO causes the most injuries as well.
rank property damage and crop damage by differnt type of events
dat_select %>% group_by(EVTYPE) %>% summarise(sum_PROPDMG=sum(PROPDMG),
sum_CROPDMG=sum(CROPDMG)) %>%
arrange(desc(sum_PROPDMG, sum_CROPDMG))
## Source: local data frame [985 x 3]
##
## EVTYPE sum_PROPDMG sum_CROPDMG
## 1 TORNADO 3212258.2 100018.52
## 2 FLASH FLOOD 1420124.6 179200.46
## 3 TSTM WIND 1335965.6 109202.60
## 4 FLOOD 899938.5 168037.88
## 5 THUNDERSTORM WIND 876844.2 66791.45
## 6 HAIL 688693.4 579596.28
## 7 LIGHTNING 603351.8 3580.61
## 8 THUNDERSTORM WINDS 446293.2 18684.93
## 9 HIGH WIND 324731.6 17283.21
## 10 WINTER STORM 132720.6 1978.99
## .. ... ... ...
we see TORNADO causes the most property damage.
dat_select %>% group_by(EVTYPE) %>% summarise(sum_PROPDMG=sum(PROPDMG),
sum_CROPDMG=sum(CROPDMG)) %>%
arrange(desc(sum_CROPDMG, sum_PROPDMG))
## Source: local data frame [985 x 3]
##
## EVTYPE sum_PROPDMG sum_CROPDMG
## 1 HAIL 688693.38 579596.28
## 2 FLASH FLOOD 1420124.59 179200.46
## 3 FLOOD 899938.48 168037.88
## 4 TSTM WIND 1335965.61 109202.60
## 5 TORNADO 3212258.16 100018.52
## 6 THUNDERSTORM WIND 876844.17 66791.45
## 7 DROUGHT 4099.05 33898.62
## 8 THUNDERSTORM WINDS 446293.18 18684.93
## 9 HIGH WIND 324731.56 17283.21
## 10 HEAVY RAIN 50842.14 11122.80
## .. ... ... ...
we see that HAIL causes the most crop damage.
add the PROPDMG and CROPDMG to a new column called ECONOMYDMG
dat_select %>% mutate(ECONOMYDMG= PROPDMG + CROPDMG) %>% group_by(EVTYPE) %>%
summarise(sum_ECONOMYDMG=sum(ECONOMYDMG)) %>%
arrange(desc(sum_ECONOMYDMG))
## Source: local data frame [985 x 2]
##
## EVTYPE sum_ECONOMYDMG
## 1 TORNADO 3312276.7
## 2 FLASH FLOOD 1599325.1
## 3 TSTM WIND 1445168.2
## 4 HAIL 1268289.7
## 5 FLOOD 1067976.4
## 6 THUNDERSTORM WIND 943635.6
## 7 LIGHTNING 606932.4
## 8 THUNDERSTORM WINDS 464978.1
## 9 HIGH WIND 342014.8
## 10 WINTER STORM 134699.6
## .. ... ...
we see that TORNADO causes the most economy damage including property damage and crop damage.
Economy_dmg<- dat_select %>%
mutate(ECONOMYDMG= PROPDMG + CROPDMG) %>% group_by(EVTYPE) %>%
summarise(sum_ECONOMYDMG=sum(ECONOMYDMG)) %>%
arrange(desc(sum_ECONOMYDMG))
boxplot of the total Economy damages for all the severe weather events from year 1950 to 2011
ggplot(Economy_dmg) + geom_boxplot(aes(x=1, y=sum_ECONOMYDMG))
median(Economy_dmg$sum_ECONOMYDMG[Economy_dmg$sum_ECONOMYDMG>0])
## [1] 59
Most severe events causes a median of 59 million US dolloars
from the boxplot above, we can see that the data are skewly distributed, let’s do a log2 transformation and plot the boxplot again
ggplot(Economy_dmg) + geom_boxplot(aes(x=1, y=log2(sum_ECONOMYDMG+1))) +
xlab("severe weather events") + ylab("log2 transformed economy damage") +
ggtitle("boxplot of severe weather events caused economy damage in log2 scale")