We would like to answer two questions:
To answer these questions, first we look at the dataset to understand the organization of the data, then we arrange some data frames to get the data into a useful (and less cluttered) form, and finally we apply some tools from R language to visualize the desired information.
We start by reading in the raw data. Since the data is in the compressed bz2 format, we must first uncompress it by using the “bunzip2” command, which can be found in the GIT bash shell. Use “bunzip2 repdata_data_StormData.csv.bz2”, and it will create the CSV file in the same folder. Then read the CSV file into R, as follows:
data <- read.csv("repdata_data_StormData.csv")
We’d like to know how the data is organized, so we use the R commands “str” and “names” to look at the data frame.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
And we can see that there are 4 columns of interest, in addition to the EVTYPE event type.
These are:
After looking at the table, it becomes apparent that the dplyr library can help us, so we access that. Also we will use the ggplot2 library to plot some graphs.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We would like the data grouped by event type EVTYPE.
data_grouped <- group_by(data,data$EVTYPE)
And we would like four summaries, one for each of the columns we’re interested in.
fatalities_summary <- summarize(data_grouped,Fatalities=sum(FATALITIES))
injuries_summary <- summarize(data_grouped,Injuries=sum(INJURIES))
propdmg_summary <- summarize(data_grouped,PropDmg=sum(PROPDMG))
cropdmg_summary <- summarize(data_grouped,CropDmg=sum(CROPDMG))
We could now, for example, look for the event type that caused the greatest number of fatalities.
which.max(fatalities_summary$Fatalities)
## [1] 826
fatalities_summary[826,]
## # A tibble: 1 × 2
## `data$EVTYPE` Fatalities
## <chr> <dbl>
## 1 TORNADO 5633
Clearly, tornadoes are the deadliest event type.
To see what this means in the larger context, we can do a quick plot of the number of fatalities for each event type.
qplot(EVTYPE,FATALITIES,data=data)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We notice there’s a few event types at the top, and we’d like to see what those are and how many fatalities there are. It looks like we can get the top 10 or so by setting the threshold around 80.
However the question asks “most harmful”, and injuries and fatalities are both harmful, so let’s combine those two columns into a single data frame. And since the second question asks for economic consequences, we do the same for property damage and crop damage.
First though, we change the names in the summary tables so we can more conveniently access the data.
After combining the harm and economic consequences, we can then subset each table to get the top 10 or so contributors. We can quickly look at the structure of one of the tables to make sure it’s what we want, and optionally we can do a quick plot to see where we can threshold to get a convenient Top 10-or-so.
We can print out the top contributors to adverse health and to adverse economic consequences.
names(fatalities_summary) <- c("EvType", "Fatalities")
names(injuries_summary) <- c("EvType", "Injuries")
names(propdmg_summary) <- c("EvType", "PropDmg")
names(cropdmg_summary) <- c("EvType", "CropDmg")
harmful_summary <- merge(fatalities_summary,injuries_summary,by="EvType")
economic_summary <- merge(propdmg_summary,cropdmg_summary,by="EvType")
harmful_summary$Total <- harmful_summary$Fatalities + harmful_summary$Injuries
economic_summary$Total <- economic_summary$PropDmg + economic_summary$CropDmg
str(harmful_summary)
## 'data.frame': 985 obs. of 4 variables:
## $ EvType : chr " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
## $ Fatalities: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Injuries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total : num 0 0 0 0 0 0 0 0 0 0 ...
top_harmful <- subset(harmful_summary,harmful_summary$Total > 1000)
top_economic <- subset(economic_summary,economic_summary$Total > 100000)
top_harmful
## EvType Fatalities Injuries Total
## 130 EXCESSIVE HEAT 1903 6525 8428
## 153 FLASH FLOOD 978 1777 2755
## 170 FLOOD 470 6789 7259
## 244 HAIL 15 1361 1376
## 275 HEAT 937 2100 3037
## 310 HEAVY SNOW 127 1021 1148
## 359 HIGH WIND 248 1137 1385
## 411 HURRICANE/TYPHOON 64 1275 1339
## 427 ICE STORM 89 1975 2064
## 464 LIGHTNING 816 5230 6046
## 760 THUNDERSTORM WIND 133 1488 1621
## 834 TORNADO 5633 91346 96979
## 856 TSTM WIND 504 6957 7461
## 972 WINTER STORM 206 1321 1527
top_economic
## EvType PropDmg CropDmg Total
## 153 FLASH FLOOD 1420124.6 179200.46 1599325.1
## 170 FLOOD 899938.5 168037.88 1067976.4
## 244 HAIL 688693.4 579596.28 1268289.7
## 310 HEAVY SNOW 122252.0 2165.72 124417.7
## 359 HIGH WIND 324731.6 17283.21 342014.8
## 464 LIGHTNING 603351.8 3580.61 606932.4
## 760 THUNDERSTORM WIND 876844.2 66791.45 943635.6
## 786 THUNDERSTORM WINDS 446293.2 18684.93 464978.1
## 834 TORNADO 3212258.2 100018.52 3312276.7
## 856 TSTM WIND 1335965.6 109202.60 1445168.2
## 972 WINTER STORM 132720.6 1978.99 134699.6
It is useful to view the results in two forms: one as a table, and the other as a graph.
First we sort each table in descending order, so the greatest contributor is on top.
Then we plot each table to see how the contributions look.
harmful_sorted <- arrange(top_harmful,desc(top_harmful$Total))
economic_sorted <- arrange(top_economic,desc(top_economic$Total))
harmful_sorted
## EvType Fatalities Injuries Total
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## 11 HIGH WIND 248 1137 1385
## 12 HAIL 15 1361 1376
## 13 HURRICANE/TYPHOON 64 1275 1339
## 14 HEAVY SNOW 127 1021 1148
economic_sorted
## EvType PropDmg CropDmg Total
## 1 TORNADO 3212258.2 100018.52 3312276.7
## 2 FLASH FLOOD 1420124.6 179200.46 1599325.1
## 3 TSTM WIND 1335965.6 109202.60 1445168.2
## 4 HAIL 688693.4 579596.28 1268289.7
## 5 FLOOD 899938.5 168037.88 1067976.4
## 6 THUNDERSTORM WIND 876844.2 66791.45 943635.6
## 7 LIGHTNING 603351.8 3580.61 606932.4
## 8 THUNDERSTORM WINDS 446293.2 18684.93 464978.1
## 9 HIGH WIND 324731.6 17283.21 342014.8
## 10 WINTER STORM 132720.6 1978.99 134699.6
## 11 HEAVY SNOW 122252.0 2165.72 124417.7
qplot(EvType, Total, data=harmful_sorted, xlab="Event Type", ylab="Fatalities and Injuries", main="Events Causing Greatest Health Damage") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
qplot(EvType, Total, data=economic_sorted, xlab="Event Type", ylab="Property and Crop Damage", main="Events Causing Greatest Economic Damage") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))