In this data analysis of weather events, we begin by loading the data frame into R and viewing its structure. Then, we proceed to data cleaning and pre-processing for analysis. We create uniform variable names and subset the main data frame using only the columns we might need for analysis. Next, we address the two analysis questions by creating new, calculated variables for our question-specific data frames. We aggregate our data frames by event type and arrange our damage data in descending order. Finally, we take the top 10 results from each question-specific data frame and create bar plots to address which storm events are the most harmful in the US by either population health or economic measures.
First, we load and view the data in R.
res <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", na.strings="")
head(res)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 2 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 3 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 4 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 5 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 6 TORNADO 0 <NA> <NA> <NA> <NA> 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 <NA> <NA> 14.0 100 3 0 0
## 2 NA 0 <NA> <NA> 2.0 150 2 0 0
## 3 NA 0 <NA> <NA> 0.1 123 2 0 0
## 4 NA 0 <NA> <NA> 0.0 100 2 0 0
## 5 NA 0 <NA> <NA> 0.0 150 2 0 0
## 6 NA 0 <NA> <NA> 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0 <NA> <NA> <NA> <NA>
## 2 0 2.5 K 0 <NA> <NA> <NA> <NA>
## 3 2 25.0 K 0 <NA> <NA> <NA> <NA>
## 4 2 2.5 K 0 <NA> <NA> <NA> <NA>
## 5 2 2.5 K 0 <NA> <NA> <NA> <NA>
## 6 6 2.5 K 0 <NA> <NA> <NA> <NA>
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 <NA> 1
## 2 3042 8755 0 0 <NA> 2
## 3 3340 8742 0 0 <NA> 3
## 4 3458 8626 0 0 <NA> 4
## 5 3412 8642 0 0 <NA> 5
## 6 3450 8748 0 0 <NA> 6
We are told that the events in the database start in the year 1950 and end in Nov 2011. There are fewer events recorded in the earlier years in general. More recent years are more complete.
Let’s look at the structure of res.
str(res)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Upon further inspection, we see a mix of lower and uppercase letters in some of our variables’ factor levels. We’re going to clean up some of those variables.
head(table(res$EVTYPE), 12)
##
## HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD
## 1 1 1
## LIGHTNING TSTM WIND TSTM WIND (G45)
## 1 4 1
## WATERSPOUT WIND ?
## 1 1 1
## ABNORMAL WARMTH ABNORMALLY DRY ABNORMALLY WET
## 4 2 1
tail(table(res$EVTYPE), 12)
##
## WINTER STORM/HIGH WIND WINTER STORM/HIGH WINDS WINTER STORMS
## 1 1 3
## Winter Weather WINTER WEATHER WINTER WEATHER MIX
## 19 7026 6
## WINTER WEATHER/MIX WINTERY MIX Wintry mix
## 1104 2 3
## Wintry Mix WINTRY MIX WND
## 1 90 1
res$EVTYPE <- as.factor(tolower(res$EVTYPE))
table(res$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
res$PROPDMGEXP <- as.factor(toupper(res$PROPDMGEXP))
table(res$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 7 19 1 9 21 281832 1 1994
res$CROPDMGEXP <- as.factor(toupper(res$CROPDMGEXP))
Next, let’s subset only those variables which we might be using to answer our analysis questions.
sd <- subset(res, select=c(STATE, WFO, EVTYPE, BGN_DATE, END_DATE, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REMARKS))
summary(sd)
## STATE WFO EVTYPE
## TX : 83728 OUN : 17393 hail :288661
## KS : 53440 JAN : 13889 tstm wind :219942
## OK : 46802 LWX : 13174 thunderstorm wind: 82564
## MO : 35648 PHI : 12551 tornado : 60652
## IA : 31069 TSA : 12483 flash flood : 54277
## NE : 30271 (Other):690738 flood : 25327
## (Other):621339 NA's :142069 (Other) :170874
## BGN_DATE END_DATE FATALITIES
## 5/25/2011 0:00:00: 1202 4/27/2011 0:00:00: 1214 Min. : 0.0000
## 4/27/2011 0:00:00: 1193 5/25/2011 0:00:00: 1196 1st Qu.: 0.0000
## 6/9/2011 0:00:00 : 1030 6/9/2011 0:00:00 : 1021 Median : 0.0000
## 5/30/2004 0:00:00: 1016 4/4/2011 0:00:00 : 1007 Mean : 0.0168
## 4/4/2011 0:00:00 : 1009 5/30/2004 0:00:00: 998 3rd Qu.: 0.0000
## 4/2/2006 0:00:00 : 981 (Other) :653450 Max. :583.0000
## (Other) :895866 NA's :243411
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. : 0.0000 Min. : 0.00 K :424665 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 M : 11337 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 0 : 216 Median : 0.000
## Mean : 0.1557 Mean : 12.06 B : 40 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 5 : 28 3rd Qu.: 0.000
## Max. :1700.0000 Max. :5000.00 (Other): 77 Max. :990.000
## NA's :465934
## CROPDMGEXP REMARKS
## ? : 7 : 24013
## 0 : 19 Trees down.\n : 1110
## 2 : 1 Several trees were blown down.\n : 568
## B : 9 Trees were downed.\n : 446
## K :281853 Large trees and power lines were blown down.\n: 432
## M : 1995 (Other) :588295
## NA's:618413 NA's :287433
First, we copy the sd data frame into a new data frame called ph. Then, we create a new column called CASUALTIES that equals the sum of the number of INJURIES and FATALITIES recorded for each storm. Next, we aggregate the data frame by the total number of people historically harmed by each event type using the sum function. Finally, we arrange the data frame in descending order using the dplyr package to see which events are most harmful to human health. We store the top ten results for population health in a variable called p10.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ph <- sd
ph$CASUALTIES <- ph$INJURIES + ph$FATALITIES
phAgg <- aggregate(CASUALTIES ~ EVTYPE, ph, FUN=sum)
phArr <- arrange(phAgg, desc(CASUALTIES))
ph10 <- head(phArr, 10)
ph10$EVTYPE <- factor(ph10$EVTYPE)
We’re going to use the PROPDMG/PROPDMGEXP and CROPDMG/CROPDMGEXP variables to answer this question. Let’s again take a closer look at our data frame. What are our factor levels for the PROPDMGEXP column?
table(sd$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 1 8 5 216 25 13 4 4 28 4
## 7 8 B H K M
## 5 1 40 7 424665 11337
Our levels for our property damage exponent are -, ?, +, 0-8, B, H, K, M. Because the symbolic levels are unclear in the provided documentation, we will discard them by creating a new subsetted data frame, sdp. Then, we will calculate the appropriate integer relating to each of the exponent variables and store them in a new variable, PROPDMGFULL.
sdp <- subset(sd[!(sd$PROPDMGEXP=="-" | sd$PROPDMGEXP=="?" | sd$PROPDMGEXP=="+"),])
sdp$PROPDMGFULL <- with(sdp, ifelse(sdp$PROPDMGEXP == "H", 1e2, ## H stands for hundred
ifelse(sdp$PROPDMGEXP == "K", 1e3, ## K stands for thousand
ifelse(sdp$PROPDMGEXP == "M", 1e6, ## M stands for million
ifelse(sdp$PROPDMGEXP == "B", 1e9, ## B stands for billion
ifelse(sdp$PROPDMGEXP == 1, 1e1,
ifelse(sdp$PROPDMGEXP == 2, 1e2,
ifelse(sdp$PROPDMGEXP == 3, 1e3,
ifelse(sdp$PROPDMGEXP == 4, 1e4,
ifelse(sdp$PROPDMGEXP == 5, 1e5,
ifelse(sdp$PROPDMGEXP == 6, 1e6,
ifelse(sdp$PROPDMGEXP == 7, 1e7,
ifelse(sdp$PROPDMGEXP == 8, 1e8,
ifelse(sdp$PROPDMGEXP == 0, 1e0, NA))))))))))))))
Next, we repeat the above process with the crop variables.
table(sdp$CROPDMGEXP)
##
## ? 0 2 B K M
## 5 16 0 5 277988 1552
Our levels for our crop damage exponent are ?, 0, 2, B, K, M. There are only five instances where CROPDMGEXP=="?", so we will ignore them. We will copy the sdp data frame to a new one for our next calculations and calcluate the appropriate integer from the exponent column into a new variable called CROPDMGFULL.
sdc <- sdp
sdc$CROPDMGFULL <- with(sdc, ifelse(sdc$CROPDMGEXP == "H", 1e2,
ifelse(sdc$CROPDMGEXP == "K", 1e3,
ifelse(sdc$CROPDMGEXP == "M", 1e6,
ifelse(sdc$CROPDMGEXP == "B", 1e9,
ifelse(sdc$CROPDMGEXP == 2, 1e2,
ifelse(sdc$CROPDMGEXP == 0, 1e0, NA)))))))
Next, we calculate the total damage caused by each of the storms by adding together the complete cases of property damage and crop damage. To do this, we first create a new, logical variable that contains the complete cases of both PROPDMG and PROPDMGFULL. We then calculate the full amount of property damage by multiplying together the PROPDMG and PROPDMGFULL columns and storing the result in the new PROPTOTAL column. The analagous process is done for the crop variables, too. Finally, we add together the PROPTOTAL and CROPTOTAL variables and store them in our new variable, DMGTOTAL.
We begin this code chunk by copying the latest data frame into a new one called dat.
dat <- sdc
dat$CCPROP <- complete.cases(dat$PROPDMG + dat$PROPDMGFULL)
dat$PROPTOTAL <- with(dat, ifelse(dat$CCPROP == TRUE, dat$PROPDMG * dat$PROPDMGFULL, 0))
dat$CCCROP <- complete.cases(dat$CROPDMG + dat$CROPDMGFULL)
dat$CROPTOTAL <- with(dat, ifelse(dat$CCCROP == TRUE, dat$CROPDMG * dat$CROPDMGFULL, 0))
dat$DMGTOTAL <- with(dat, ifelse(!is.na(dat$PROPTOTAL) & !is.na(dat$CROPTOTAL),
dat$PROPTOTAL + dat$CROPTOTAL, NA))
Lastly, to answer the second question, we aggregate the total damage by event type using the sum function and store the resulting data frame in a variable called ec. Then, using the dplyr package, we arrange the aggregated data frame in descending order by total damage. Finally, we store the top ten results for economic damage in a variable called ec10.
ec <- aggregate(DMGTOTAL ~ EVTYPE, dat, FUN=sum)
ecArr <- arrange(ec, desc(DMGTOTAL))
ec10 <- head(ecArr, 10)
To answer question 1, we create a barplot of the top ten most historically harmful storm events to US Population Health.
print(ph10)
## EVTYPE CASUALTIES
## 1 tornado 96979
## 2 excessive heat 8428
## 3 tstm wind 7461
## 4 flood 7259
## 5 lightning 6046
## 6 heat 3037
## 7 flash flood 2755
## 8 ice storm 2064
## 9 thunderstorm wind 1621
## 10 winter storm 1527
## 1st barplot code
par(mar = c(6.7, 4.9, 3.0, 3.5), mgp= c(3.8, 0.5, 0))
barplot(ph10$CASUALTIES, names=ph10$EVTYPE, main="Most Harmful Storm Types to US Population", ylab = "Total Casualties", las=2, col=heat.colors(10), cex.axis=0.8, cex.names=0.8)
mtext("Historical Data 1950-2011")
Fig 1. We can see that tornadoes are by far the event that caused the most casualties in the US during 1950-2011. According to the data, tornadoes were responsible for 96,979 casualties in that time frame. The next storm event, ‘excessive heat’, falls almost 90,000 casualties short of tornadoes for the same time period at 8428 casualties.
In answering the second question, we build a barplot of the top ten most economically harmful storm events to US property and crops.
print(ec10)
## EVTYPE DMGTOTAL
## 1 flood 149828665250
## 2 hurricane/typhoon 71913712800
## 3 tornado 57350760234
## 4 storm surge 43323541000
## 5 flash flood 18210702822
## 6 hail 17789075356
## 7 hurricane 14557229010
## 8 river flood 10147679500
## 9 ice storm 8967041360
## 10 tropical storm 8155601550
## 2nd barplot code
par(mar = c(6.7, 4.9, 3.0, 3.5), mgp= c(3.8, 0.5, 0))
barplot(ec10$DMGTOTAL, names=ec10$EVTYPE, main="Most Economically Harmful Storm Types in US", ylab = "Total Damage in USD", las=2, col=heat.colors(10), cex.axis=0.8, cex.names=0.8)
mtext("Historical Data 1950-2011")
Fig 2. We see that floods were the most economically harmful event in the US during 1950-2011, causing the most combined property and crop damage. The total damage, according to the data, is almost $150 billion. The next highest event, hurricanes/typhoons, caused almost $72 billion in damages in the same time period. Tornadoes, while historically responsible for the most storm casualties in the US, fell into third place at $57 billion in property and crop damages during this time frame.