The source data for this research is the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The main goal of research is to address following questions:
2.1. Across the United States, which types of events are most harmful with respect to population health?
2.2. Across the United States, which types of events have the greatest economic consequences?
Most part of the records in the dataset have zero casualties and economic damage, so we could omit them while processing the data.
Another part of data processing was to align event records with the official event names from NATIONAL WEATHER SERVICE INSTRUCTION 10-1605
After that, we build two resulting datasets to address the main questions of the research.
Thus we have found the most severe weather events. In addition, we have built two maps showing different levels of damage of tornadoes for population health and of floods for economics by U.S. Counties
First we need to read the source data. Due to the bzip2 format, we can read it directly from the compressed file by read.csv() without unzipping.
temp <- tempfile()
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
data <- read.csv(temp)
unlink(temp)
rm(temp)
Let’s see what is inside the data.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163 CST :547493 Min. : 0.0
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350 EST :245558 1st Qu.: 31.0
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261 MST : 68390 Median : 75.0
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891 PST : 28302 Mean :100.6
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703 AST : 6360 3rd Qu.:131.0
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700 HST : 2563 Max. :873.0
## (Other) :895866 (Other) :857229 (Other): 3631
## COUNTYNAME STATE EVTYPE BGN_RANGE
## JEFFERSON : 7840 TX : 83728 HAIL :288661 Min. : 0.000
## WASHINGTON: 7603 KS : 53440 TSTM WIND :219940 1st Qu.: 0.000
## JACKSON : 6660 OK : 46802 THUNDERSTORM WIND: 82563 Median : 0.000
## FRANKLIN : 6256 MO : 35648 TORNADO : 60652 Mean : 1.484
## LINCOLN : 5937 IA : 31069 FLASH FLOOD : 54277 3rd Qu.: 1.000
## MADISON : 5632 NE : 30271 FLOOD : 25326 Max. :3749.000
## (Other) :862369 (Other):621339 (Other) :170878
## BGN_AZI BGN_LOCATI END_DATE END_TIME
## :547332 :287743 :243411 :238978
## N : 86752 COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## W : 38446 Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## S : 37558 SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## E : 33178 SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NW : 24041 NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other):134990 (Other) :591444 (Other) :653450 (Other) :622432
## COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## Min. :0 Mode:logical Min. : 0.0000 :724837 :499225
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 N : 28082 COUNTYWIDE : 19731
## Median :0 Median : 0.0000 S : 22510 SOUTH PORTION : 833
## Mean :0 Mean : 0.9862 W : 20119 NORTH PORTION : 780
## 3rd Qu.:0 3rd Qu.: 0.0000 E : 20047 CENTRAL PORTION: 617
## Max. :0 Max. :925.0000 NE : 14606 SPRINGFIELD : 575
## (Other): 72096 (Other) :380536
## LENGTH WIDTH F MAG FATALITIES
## Min. : 0.0000 Min. : 0.000 Min. :0.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.000 Median :1.0 Median : 50.0 Median : 0.0000
## Mean : 0.2301 Mean : 7.503 Mean :0.9 Mean : 46.9 Mean : 0.0168
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000
## Max. :2315.0000 Max. :4400.000 Max. :5.0 Max. :22000.0 Max. :583.0000
## NA's :843563
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.0000 Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.0000 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.0000 Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 0.1557 Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :1700.0000 Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## WFO STATEOFFIC
## :142069 :248769
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## ZONENAMES
## :594029
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## :287433 Min. : 1
## : 24013 1st Qu.:225575
## Trees down.\n : 1110 Median :451149
## Several trees were blown down.\n : 568 Mean :451149
## Trees were downed.\n : 446 3rd Qu.:676723
## Large trees and power lines were blown down.\n: 432 Max. :902297
## (Other) :588295
head(data,10)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE BGN_AZI
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO 0
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO 0
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO 0
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO 0
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO 0
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO 0
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL TORNADO 0
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL TORNADO 0
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL TORNADO 0
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL TORNADO 0
## BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F
## 1 0 NA 0 14.0 100 3
## 2 0 NA 0 2.0 150 2
## 3 0 NA 0 0.1 123 2
## 4 0 NA 0 0.0 100 2
## 5 0 NA 0 0.0 150 2
## 6 0 NA 0 1.5 177 2
## 7 0 NA 0 1.5 33 2
## 8 0 NA 0 0.0 33 1
## 9 0 NA 0 3.3 100 3
## 10 0 NA 0 2.3 100 3
## MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## 1 0 0 15 25.0 K 0 3040
## 2 0 0 0 2.5 K 0 3042
## 3 0 0 2 25.0 K 0 3340
## 4 0 0 2 2.5 K 0 3458
## 5 0 0 2 2.5 K 0 3412
## 6 0 0 6 2.5 K 0 3450
## 7 0 0 1 2.5 K 0 3405
## 8 0 0 0 2.5 K 0 3255
## 9 0 1 14 25.0 K 0 3334
## 10 0 0 0 25.0 K 0 3336
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 8812 3051 8806 1
## 2 8755 0 0 2
## 3 8742 0 0 3
## 4 8626 0 0 4
## 5 8642 0 0 5
## 6 8748 0 0 6
## 7 8631 0 0 7
## 8 8558 0 0 8
## 9 8740 3336 8738 9
## 10 8738 3337 8737 10
The first question of the task is
We see that raw data is huge and hard to process. It would be good to find a way to reduce it. Our first question is about types of events that are most harmful for population health (i.e. INJURIES and FATALITIES variables). Looking at the summary we see that even the 3rd quartile of both variables equals zero. Therefore, we can get rid of a big share of observations while working on this question.
Let’s confirm these thoughts by some exploration.
zero.harm.obs<-nrow(data[data$FATALITIES==0 & data$INJURIES==0,])
zero.harm.obs
## [1] 880368
Now we see that 98% of all observations are not necessary for our first task. Therefore, we make another data.frame:
data2<-data[data$FATALITIES!=0 | data$INJURIES!=0,]
dim(data2)
## [1] 21929 37
Let’s do a little check: subtract the whole number of injuries in the raw data from the number of injuries in the new dataset data2
sum(data$INJURIES)-sum(data2$INJURIES)
## [1] 0
We see that we didn’t lose any valuable information.
Now we reduce our data2 even more, moving along with the first question. In this code chunk, we boil down data to two aggregated variables. After that we sort it by FATALITIES and then by INJURIES variables both descending.
harm_sum<-aggregate(cbind(data2$FATALITIES,data2$INJURIES), by=list(data2$EVTYPE), FUN=sum)
harm_sum<-droplevels(harm_sum)
harm_sum<-harm_sum[order(-harm_sum[,2], -harm_sum[,3]),]
names(harm_sum) <- c("EVTYPE","FATALITIES","INJURIES")
str(harm_sum)
## 'data.frame': 220 obs. of 3 variables:
## $ EVTYPE : Factor w/ 220 levels "AVALANCE","AVALANCHE",..: 184 32 42 69 123 191 47 147 93 2 ...
## $ FATALITIES: num 5633 1903 978 937 816 ...
## $ INJURIES : num 91346 6525 1777 2100 5230 ...
Note that we dropped unused levels of EVTYPE factor variable by droplevels() function.
And there is another important note.
We assume that fatality is a much more severe consequence than injury. That’s why we sort data first by FATALITIES variable and then by INJURIES.
head(harm_sum,20)
## EVTYPE FATALITIES INJURIES
## 184 TORNADO 5633 91346
## 32 EXCESSIVE HEAT 1903 6525
## 42 FLASH FLOOD 978 1777
## 69 HEAT 937 2100
## 123 LIGHTNING 816 5230
## 191 TSTM WIND 504 6957
## 47 FLOOD 470 6789
## 147 RIP CURRENT 368 232
## 93 HIGH WIND 248 1137
## 2 AVALANCHE 224 170
## 214 WINTER STORM 206 1321
## 148 RIP CURRENTS 204 297
## 71 HEAT WAVE 172 309
## 37 EXTREME COLD 160 231
## 173 THUNDERSTORM WIND 133 1488
## 77 HEAVY SNOW 127 1021
## 38 EXTREME COLD/WIND CHILL 125 24
## 167 STRONG WIND 103 280
## 4 BLIZZARD 101 805
## 89 HIGH SURF 101 152
Now we must recall that the raw data was gathered by hands and may contain typos and other uncertainties. At the beginning, we had 985 levels of EVTYPE variable. After our preparation, we have 220 levels. But there are only 48 Storm Data Event types in NATIONAL WEATHER SERVICE INSTRUCTION 10-1605
First of all we spotted that in one of the top records we have the TSTM abbreviation. Obviously, its meaning is THUNDERSTORM. Since this record stands high in the list, it’s very important and we can fix it “by-hand”
harm_sum$EVTYPE <-as.character(harm_sum$EVTYPE)
harm_sum$EVTYPE <- gsub("TSTM","THUNDERSTORM",harm_sum$EVTYPE)
Now we just copy the table with the event types from the Instruction and paste it into a standalone text file event.types.txt and then read it into R dataset. We use TAB as separator to avoid any separation, e.g. we need only one column.
event.types<-read.table("../RepData_PeerAssessment2/event.types.txt",sep="\t")
head(event.types)
## V1
## 1 Astronomical Low Tide Z
## 2 Avalanche Z
## 3 Blizzard Z
## 4 Coastal Flood Z
## 5 Cold/Wind Chill Z
## 6 Debris Flow C
We need to bring the event types to upper case to compare them with our data. Also we see there is a single letter (designator) at the end of each row. Let’s get rid of it.
event.types$V1 <- toupper(as.character(event.types$V1))
event.types$V1 <- substr(event.types$V1,1,nchar(event.types$V1)-2)
Our strategy of handling the typos will be as follows. We will use two functions. The first is LenLongStr() which counts the length of Longest Common Substring.
LenLongStr <- function(String1, String2) {
s1 <- unlist(strsplit(String1,split=""))
s2 <- unlist(strsplit(String2,split=""))
num <- matrix(0,nchar(String1),nchar(String2) )
maxlen <- 0
for (i in 1:nchar(String1)) {
for (j in 1:nchar(String2)) {
if (s1[i] == s2[j]) {
if ((i==1) || (j==1)) {
num[i,j] <- 1
}
else {
num[i,j] <- 1+num[i-1,j-1]
}
if (num[i,j] > maxlen) {
maxlen <- num[i,j]
}
}
}
}
maxlen
}
## this code is taken from http://snipplr.com/view/52222/longest-common-substring/
The second is adist() which counts the distance (generalized Levenshtein (edit) distance) between two strings.
As a measure of similarity of two strings we take the ratio of the length of their longest substring to the distance between these strings:
Similarity (String1, String2) = LenLongStr (String1,String2) / adist (String1,String2)
The logic is straight. The longer the common substring, the more similar are the strings. The more distance between the strings, the less similar are the strings.
Now we find the closest Event type from the Instruction for every record in the events column of our data (harm_sum dataset)
for(i in 1:nrow(harm_sum))
{
long.com.str<-matrix(ncol=2)
long.com.str[1,1] <- LenLongStr(event.types[1,1], harm_sum[i,1])
long.com.str[1,2] <- as.numeric(adist(event.types[1,1], harm_sum[i,1]))
for(j in 2:nrow(event.types))
{
long.com.str <- rbind(long.com.str, c(LenLongStr(event.types[j,1], harm_sum[i,1]),
as.numeric(adist(event.types[j,1], harm_sum[i,1]))))
}
harm_sum[i,4] <- event.types[which.max(long.com.str[,1]/long.com.str[,2]),1]
}
Here is the outcome. In the first column is what we had before. In the fourth column are the results of our matching. As we can see, the outcome is pretty good. Almost all of the top records have found their appropriate options from the instruction list.
head(harm_sum,50)
## EVTYPE FATALITIES INJURIES V4
## 184 TORNADO 5633 91346 TORNADO
## 32 EXCESSIVE HEAT 1903 6525 EXCESSIVE HEAT
## 42 FLASH FLOOD 978 1777 FLASH FLOOD
## 69 HEAT 937 2100 HEAT
## 123 LIGHTNING 816 5230 LIGHTNING
## 191 THUNDERSTORM WIND 504 6957 THUNDERSTORM WIND
## 47 FLOOD 470 6789 FLOOD
## 147 RIP CURRENT 368 232 RIP CURRENT
## 93 HIGH WIND 248 1137 HIGH WIND
## 2 AVALANCHE 224 170 AVALANCHE
## 214 WINTER STORM 206 1321 WINTER STORM
## 148 RIP CURRENTS 204 297 RIP CURRENT
## 71 HEAT WAVE 172 309 HEAT
## 37 EXTREME COLD 160 231 EXTREME COLD/WIND CHILL
## 173 THUNDERSTORM WIND 133 1488 THUNDERSTORM WIND
## 77 HEAVY SNOW 127 1021 HEAVY SNOW
## 38 EXTREME COLD/WIND CHILL 125 24 EXTREME COLD/WIND CHILL
## 167 STRONG WIND 103 280 STRONG WIND
## 4 BLIZZARD 101 805 BLIZZARD
## 89 HIGH SURF 101 152 HIGH SURF
## 74 HEAVY RAIN 98 251 HEAVY RAIN
## 39 EXTREME HEAT 96 155 EXCESSIVE HEAT
## 21 COLD/WIND CHILL 95 12 COLD/WIND CHILL
## 117 ICE STORM 89 1975 ICE STORM
## 210 WILDFIRE 75 911 WILDFIRE
## 109 HURRICANE/TYPHOON 64 1275 HURRICANE (TYPHOON)
## 176 THUNDERSTORM WINDS 64 908 THUNDERSTORM WIND
## 52 FOG 62 734 DENSE FOG
## 101 HURRICANE 61 46 HURRICANE (TYPHOON)
## 189 TROPICAL STORM 58 340 TROPICAL STORM
## 85 HEAVY SURF/HIGH SURF 42 48 HIGH SURF
## 120 LANDSLIDE 38 52 AVALANCHE
## 98 HIGH WINDS 35 302 HIGH WIND
## 16 COLD 35 48 COLD/WIND CHILL
## 217 WINTER WEATHER 33 398 WINTER WEATHER
## 196 TSUNAMI 33 129 TSUNAMI
## 200 UNSEASONABLY WARM AND DRY 29 0 THUNDERSTORM WIND
## 202 URBAN/SML STREAM FLD 28 79 MARINE STRONG WIND
## 219 WINTER WEATHER/MIX 28 72 WINTER WEATHER
## 187 TORNADOES, THUNDERSTORM WIND, HAIL 25 0 MARINE THUNDERSTORM WIND
## 211 WIND 23 86 HIGH WIND
## 31 DUST STORM 22 440 DUST STORM
## 44 FLASH FLOODING 19 8 FLASH FLOOD
## 23 DENSE FOG 18 342 DENSE FOG
## 49 FLOOD/FLASH FLOOD 17 15 FLASH FLOOD
## 40 EXTREME WINDCHILL 17 5 EXTREME COLD/WIND CHILL
## 146 RECORD/EXCESSIVE HEAT 17 0 EXCESSIVE HEAT
## 67 HAIL 15 1361 HAIL
## 131 MARINE STRONG WIND 14 22 MARINE STRONG WIND
## 17 COLD AND SNOW 14 0 HEAVY SNOW
str(harm_sum)
## 'data.frame': 220 obs. of 4 variables:
## $ EVTYPE : chr "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT" ...
## $ FATALITIES: num 5633 1903 978 937 816 ...
## $ INJURIES : num 91346 6525 1777 2100 5230 ...
## $ V4 : chr "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT" ...
Let’s finish the first question and make the resulting dataset.
harm_result<-aggregate(cbind(harm_sum$FATALITIES,harm_sum$INJURIES), by=list(harm_sum$V4), FUN=sum)
harm_result<-harm_result[order(-harm_result[,2], -harm_result[,3]),]
names(harm_result)<-c("EVTYPE","FATALITIES","INJURIES")
head(harm_result,20)
## EVTYPE FATALITIES INJURIES
## 38 TORNADO 5633 91364
## 12 EXCESSIVE HEAT 2022 6753
## 20 HEAT 1118 2415
## 14 FLASH FLOOD 1041 1804
## 27 LIGHTNING 819 5233
## 37 THUNDERSTORM WIND 741 9510
## 32 RIP CURRENT 577 529
## 15 FLOOD 476 6791
## 24 HIGH WIND 327 1675
## 13 EXTREME COLD/WIND CHILL 303 261
## 2 AVALANCHE 271 439
## 44 WINTER STORM 219 1430
## 22 HEAVY SNOW 169 1154
## 23 HIGH SURF 156 215
## 5 COLD/WIND CHILL 137 67
## 25 HURRICANE (TYPHOON) 133 1331
## 36 STRONG WIND 111 302
## 21 HEAVY RAIN 107 292
## 3 BLIZZARD 101 805
## 26 ICE STORM 93 2015
Looking at this table, it’s clear that tornado is the most harmful event by a wide margin. Let’s see how we could better get ready for it. It would be good to know where to focus medical personnel and lifeguard forces to save people’s lives. For that purpose, we will draw the map of the U.S.Counties with different levels of tornado casualties.
# load ggplot2, maps and color packages
library(ggplot2)
library(maps)
library(mapproj)
library(RColorBrewer)
# Load base of county fips
data(county.fips)
Now we will perform some data explorations to get necessary data. First, we subset data by the event type ‘tornado’ and aggregate it by counties.
tornado.county<-data2[data2$EVTYPE=="TORNADO",]
tornado.county<-aggregate(cbind(tornado.county$FATALITIES,tornado.county$INJURIES), by=list(tornado.county$COUNTY,tornado.county$COUNTYNAME), FUN=sum)
names(tornado.county)<-c("COUNTY","COUNTYNAME","FATALITIES","INJURIES")
tornado.county$COUNTY<-tornado.county$COUNTY+1000
Now we build quantiles by fatalities variable.
quant<-quantile(tornado.county[tornado.county$FATALITIES!=0, 3], probs=seq(0,1,1/8),names=F)
quant<-c(-1,quant)
quant<-unique(quant)
It turns out that we have 6 quantiles. Now we distribute counties by “buckets” and assign colors.
tornado.county$colorBuckets <- as.numeric(cut(tornado.county$FATALITIES, quant))
table(tornado.county$colorBuckets)
##
## 1 2 3 4 5 6
## 1517 201 89 154 96 112
cnty.fips <- county.fips$fips[match(map("county", plot=FALSE)$names, county.fips$polyname)]
colorsmatched <- tornado.county$colorBuckets[match(cnty.fips, tornado.county$COUNTY)]
colorsmatched[is.na(colorsmatched)]<-0
colors <- brewer.pal(length(quant)-1, "OrRd")
And here is the MAP
map("county", col = colors[colorsmatched], fill = TRUE, resolution = 0, lty = 0, projection = "polyconic")
leg.txt <- array()
quant[1]<-0
quant[length(quant)]<-quant[length(quant)]+1
for(i in 1:(length(quant)-1))
{
if(quant[i+1]==quant[i]+1)
leg.txt[i]<-as.character(quant[i])
else
leg.txt[i]<-paste0(quant[i],"-",quant[i+1]-1)
}
title("Tornado fatalities by U.S. Counties, 1950-2011")
legend("bottomright", leg.txt, horiz = F, fill = colors, cex = 0.8)
Thus, we see the places where they should focus forces and services that save people’s lives during a tornado. Here is the top list of the places to pay attention.
head(tornado.county[order(-tornado.county$FATALITIES, -tornado.county$INJURIES), -c(1,5)], 10)
## COUNTYNAME FATALITIES INJURIES
## 941 JASPER 163 1258
## 682 GENESEE 120 911
## 1264 MCLENNAN 114 610
## 953 JEFFERSON 107 1529
## 2142 WORCESTER 92 1253
## 423 COWLEY 77 290
## 576 ELKHART 62 510
## 1944 TUSCALOOSA 59 1057
## 1185 MADISON 59 1011
## 2092 WHITE 59 429
Now let’s turn to the second question of the task.
There are two variables in data with estimate of economic consequences, PROPDMG and CROPDMG. Similarly to the first part, we’ll try to simplify the data.
zero.cons.obs<-nrow(data[data$PROPDMG==0 & data$CROPDMG==0,])
zero.cons.obs
## [1] 657266
Again, more than a half of the observations can be omitted while working on the second question.
data3<-data[data$PROPDMG!=0 | data$CROPDMG!=0,]
data3$EVTYPE<-droplevels(data3$EVTYPE)
dim(data3)
## [1] 245031 37
Additionally, we have two variables PROPDMGEXP and CROPDMGEXP. Due to NATIONAL WEATHER SERVICE INSTRUCTION 10-1605 we know that “Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.”
Let’s look closer at this two variables.
table(data3$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6 7 8 B
## 4357 1 0 5 209 0 1 1 4 18 3 2 0 40
## h H K m M
## 1 6 229057 7 11319
table(data3$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 145037 6 17 0 7 21 97960 1 1982
As we can see, there are some entries that are different from the documented options (“K” for thousands, “M” for millions, and “B” for billions). We can make a guess about them. For example, it can be a degree, e.g. 5 means “..multiply by 10^5”. But since there are only a few such unknown values, we prefer to omit them to avoid spoiling the data.
Now let’s get actual numbers in PROPDMG and CROPDMG variables.
data3$PROPDMGEXP<-toupper(data3$PROPDMGEXP)
data3$CROPDMGEXP<-toupper(data3$CROPDMGEXP)
data3[data3$PROPDMGEXP=="K",]$PROPDMG <- data3[data3$PROPDMGEXP=="K",]$PROPDMG*1000
data3[data3$PROPDMGEXP=="M",]$PROPDMG <- data3[data3$PROPDMGEXP=="M",]$PROPDMG*1000000
data3[data3$PROPDMGEXP=="B",]$PROPDMG <- data3[data3$PROPDMGEXP=="B",]$PROPDMG*1000000000
data3[data3$CROPDMGEXP=="K",]$CROPDMG <- data3[data3$CROPDMGEXP=="K",]$CROPDMG*1000
data3[data3$CROPDMGEXP=="M",]$CROPDMG <- data3[data3$CROPDMGEXP=="M",]$CROPDMG*1000000
data3[data3$CROPDMGEXP=="B",]$CROPDMG <- data3[data3$CROPDMGEXP=="B",]$CROPDMG*1000000000
Further analysis will be very similar to the previous one with one exception. This time, we will estimate the damage by summarizing two variables, PROPDMG and CROPDMG, not only one as we did before.
dmg_sum<-aggregate(cbind(data3$PROPDMG,data3$CROPDMG), by=list(data3$EVTYPE), FUN=sum)
dmg_sum<-dmg_sum[order(-(dmg_sum[,2]+dmg_sum[,3])),]
names(dmg_sum)<-c("EVTYPE","PROPDMG","CROPDMG")
str(dmg_sum)
## 'data.frame': 431 obs. of 3 variables:
## $ EVTYPE : Factor w/ 431 levels " HIGH SURF ADVISORY",..: 72 197 354 299 116 59 39 189 262 206 ...
## $ PROPDMG: num 1.45e+11 6.93e+10 5.69e+10 4.33e+10 1.57e+10 ...
## $ CROPDMG: num 5.66e+09 2.61e+09 4.15e+08 5.00e+03 3.03e+09 ...
head(dmg_sum,10)
## EVTYPE PROPDMG CROPDMG
## 72 FLOOD 144657709807 5661968450
## 197 HURRICANE/TYPHOON 69305840000 2607872800
## 354 TORNADO 56937160779 414953270
## 299 STORM SURGE 43323536000 5000
## 116 HAIL 15732267048 3025954473
## 59 FLASH FLOOD 16140812067 1421317100
## 39 DROUGHT 1046106000 13972566000
## 189 HURRICANE 11868319010 2741910000
## 262 RIVER FLOOD 5118945500 5029459000
## 206 ICE STORM 3944927860 5022113500
Thus, we have the desired dataset, but again, with non-official event types. We already have a solution for that. So we will fix it in the same way as earlier.
Again, we start from correcting the abbreviation TSTM to THUNDERSTORM. Also let’s trim the spaces on the edges of the strings.
dmg_sum$EVTYPE <-as.character(dmg_sum$EVTYPE)
dmg_sum$EVTYPE <- gsub("TSTM","THUNDERSTORM",dmg_sum$EVTYPE)
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dmg_sum$EVTYPE <- trim(dmg_sum$EVTYPE)
We find the closest Event type from the Instruction for every record in the events column of our data (this time dmg_sum dataset)
for(i in 1:nrow(dmg_sum))
{
long.com.str<-matrix(ncol=2)
long.com.str[1,1] <- LenLongStr(event.types[1,1], dmg_sum[i,1])
long.com.str[1,2] <- as.numeric(adist(event.types[1,1], dmg_sum[i,1]))
for(j in 2:nrow(event.types))
{
long.com.str <- rbind(long.com.str, c(LenLongStr(event.types[j,1], dmg_sum[i,1]),
as.numeric(adist(event.types[j,1], dmg_sum[i,1]))))
}
dmg_sum[i,4] <- event.types[which.max(long.com.str[,1]/long.com.str[,2]),1]
}
Here is the outcome. In the first column is what we had before. In the fourth column are the results of our matching. And again, the outcome is pretty good. Almost all of the top records have found their appropriate options from the instruction list.
head(dmg_sum,50)
## EVTYPE PROPDMG CROPDMG V4
## 72 FLOOD 144657709807 5661968450 FLOOD
## 197 HURRICANE/TYPHOON 69305840000 2607872800 HURRICANE (TYPHOON)
## 354 TORNADO 56937160779 414953270 TORNADO
## 299 STORM SURGE 43323536000 5000 STORM SURGE/TIDE
## 116 HAIL 15732267048 3025954473 HAIL
## 59 FLASH FLOOD 16140812067 1421317100 FLASH FLOOD
## 39 DROUGHT 1046106000 13972566000 DROUGHT
## 189 HURRICANE 11868319010 2741910000 HURRICANE (TYPHOON)
## 262 RIVER FLOOD 5118945500 5029459000 FLASH FLOOD
## 206 ICE STORM 3944927860 5022113500 ICE STORM
## 363 TROPICAL STORM 7703890550 678346000 TROPICAL STORM
## 424 WINTER STORM 6688497251 26944000 WINTER STORM
## 174 HIGH WIND 5270046295 638571300 HIGH WIND
## 414 WILDFIRE 4765114000 295472800 WILDFIRE
## 369 THUNDERSTORM WIND 4484928495 554007350 THUNDERSTORM WIND
## 300 STORM SURGE/TIDE 4641188000 850000 STORM SURGE/TIDE
## 313 THUNDERSTORM WIND 3483121284 414843050 THUNDERSTORM WIND
## 195 HURRICANE OPAL 3172846000 19000000 HURRICANE (TYPHOON)
## 412 WILD/FOREST FIRE 3001829500 106796830 WILDFIRE
## 142 HEAVY RAIN/SEVERE WEATHER 2500000000 0 HEAVY RAIN
## 328 THUNDERSTORM WINDS 1735959023 190654788 THUNDERSTORM WIND
## 360 TORNADOES, THUNDERSTORM WIND, HAIL 1600000000 2500000 MARINE THUNDERSTORM WIND
## 138 HEAVY RAIN 694248090 733399800 HEAVY RAIN
## 54 EXTREME COLD 67737400 1292973000 EXTREME COLD/WIND CHILL
## 269 SEVERE THUNDERSTORM 1205360000 200000 MARINE THUNDERSTORM WIND
## 98 FROST/FREEZE 9480000 1094086000 FROST/FREEZE
## 148 HEAVY SNOW 932589142 134653100 HEAVY SNOW
## 225 LIGHTNING 928659447 12092090 LIGHTNING
## 12 BLIZZARD 659213950 112060000 BLIZZARD
## 182 HIGH WINDS 608323748 40720600 HIGH WIND
## 411 WILD FIRES 624100000 0 WILDFIRE
## 388 TYPHOON 600230000 825000 HURRICANE (TYPHOON)
## 49 EXCESSIVE HEAT 7753700 492402000 EXCESSIVE HEAT
## 87 FREEZE 205000 446225000 FROST/FREEZE
## 132 HEAT 1797000 401461500 HEAT
## 192 HURRICANE ERIN 258100000 136010000 HURRICANE (TYPHOON)
## 214 LANDSLIDE 324596000 20017000 AVALANCHE
## 68 FLASH FLOODING 307763604 15116050 FLASH FLOOD
## 66 FLASH FLOOD/FLOOD 272450006 555000 FLASH FLOOD
## 35 DAMAGING FREEZE 8000000 262100000 FROST/FREEZE
## 76 FLOOD/FLASH FLOOD 174039009 95034000 FLASH FLOOD
## 130 HAILSTORM 241000000 0 ICE STORM
## 302 STRONG WIND 175241450 64953500 STRONG WIND
## 21 COASTAL FLOOD 237665560 0 COASTAL FLOOD
## 386 TSUNAMI 144062000 20000 TSUNAMI
## 51 EXCESSIVE WETNESS 0 142000000 EXCESSIVE HEAT
## 263 River Flooding 106155000 28020000 DENSE FOG
## 23 COASTAL FLOODING 126640500 56000 COASTAL FLOOD
## 186 HIGH WINDS/COLD 110500000 7000000 HIGH WIND
## 81 FLOODING 108255006 8855500 FLOOD
Let’s finish the second question and make the resulting dataset.
dmg_result<-aggregate(cbind(dmg_sum$PROPDMG,dmg_sum$CROPDMG), by=list(dmg_sum$V4), FUN=sum)
dmg_result<-dmg_result[order(-(dmg_result[,2]+dmg_result[,3])),]
names(dmg_result)<-c("EVTYPE","PROPDMG","CROPDMG")
dmg_result$SUM.DMG <- dmg_result$PROPDMG + dmg_result$CROPDMG
# For neat comma-separated numbers..
library(scales)
comma(head(dmg_result,20))
## EVTYPE PROPDMG CROPDMG SUM.DMG
## 15 FLOOD 144,773,014,813 5,794,173,950 150,567,188,763
## 25 HURRICANE (TYPHOON) 85,256,411,010 5,506,117,800 90,762,528,810
## 40 TORNADO 56,942,036,579 414,963,070 57,356,999,649
## 37 STORM SURGE/TIDE 47,965,074,000 855,000 47,965,929,000
## 14 FLASH FLOOD 22,164,957,807 6,562,951,150 28,727,908,957
## 19 HAIL 15,738,269,598 3,026,794,473 18,765,064,071
## 9 DROUGHT 1,053,188,600 13,972,631,000 15,025,819,600
## 39 THUNDERSTORM WIND 9,762,981,078 1,225,463,988 10,988,445,066
## 26 ICE STORM 4,187,799,360 5,022,114,300 9,209,913,660
## 46 WILDFIRE 8,496,643,500 403,281,630 8,899,925,130
## 42 TROPICAL STORM 7,714,395,550 694,896,000 8,409,291,550
## 47 WINTER STORM 6,750,057,251 32,444,000 6,782,501,251
## 24 HIGH WIND 6,002,941,193 687,357,050 6,690,298,243
## 21 HEAVY RAIN 3,237,229,842 804,265,800 4,041,495,642
## 33 MARINE THUNDERSTORM WIND 2,813,902,400 48,750,000 2,862,652,400
## 17 FROST/FREEZE 19,200,000 1,910,431,000 1,929,631,000
## 13 EXTREME COLD/WIND CHILL 77,190,400 1,310,023,000 1,387,213,400
## 22 HEAVY SNOW 977,125,252 134,683,100 1,111,808,352
## 29 LIGHTNING 936,187,447 12,092,090 948,279,537
## 3 BLIZZARD 659,413,950 112,060,000 771,473,950
We see that in the economic sense, the most severe event is FLOOD. Let’s draw the map again to understand which places are in the risk zone.
Now we will perform some data explorations to get necessary data. First, we subset data by the event type flood and aggregate it by counties.
flood.county<-data3[grepl(".*FLOOD.*",data3$EVTYPE),]
flood.county<-aggregate(cbind(flood.county$PROPDMG,flood.county$CROPDMG), by=list(flood.county$COUNTY,flood.county$COUNTYNAME), FUN=sum)
names(flood.county)<-c("COUNTY","COUNTYNAME","PROPDMG","CROPDMG")
flood.county$COUNTYNAME <- droplevels(flood.county$COUNTYNAME)
flood.county$SUM.DMG <- flood.county$PROPDMG + flood.county$CROPDMG
flood.county$COUNTY<-flood.county$COUNTY+1000
Now we build quantiles by SUM.DMG variable.
quant<-quantile(flood.county[, 5], probs=seq(0,1,1/7),names=F)
quant[1]<-0
# Already not necessary, but..
quant<-unique(quant)
We have 7 quantiles. Now we distribute counties by “buckets” and assign colors.
flood.county$colorBuckets <- as.numeric(cut(flood.county$SUM.DMG, quant))
table(flood.county$colorBuckets)
##
## 1 2 3 4 5 6 7
## 657 631 693 595 644 644 644
cnty.fips <- county.fips$fips[match(map("county", plot=FALSE)$names, county.fips$polyname)]
colorsmatched <- flood.county$colorBuckets[match(cnty.fips, flood.county$COUNTY)]
colorsmatched[is.na(colorsmatched)]<-0
colors <- brewer.pal(length(quant)-1, "GnBu")
And here is the MAP
map("county", col = colors[colorsmatched], fill = TRUE, resolution = 0, lty = 0, projection = "polyconic")
leg.txt <- array()
quant[1]<-0
quant[length(quant)]<-quant[length(quant)]+1
for(i in 1:(length(quant)-1))
{
if(quant[i+1]==quant[i]+1)
leg.txt[i]<-as.character(comma(quant[i]))
else
leg.txt[i]<-paste(comma(quant[i]),"-",comma(quant[i+1]-1),"USD")
}
title("Economic consequences of flood by U.S. Counties, 1996-2011")
legend("bottomleft", leg.txt, horiz = F, fill = colors, cex = 0.7, bg="transparent", bty="n")
Thus, we see the places suffering from flood at most. Here is the top list of the places to pay attention in saving the property and crop from flood.
comma(head(flood.county[order(-flood.county$SUM.DMG), -c(1,6)], 10))
## COUNTYNAME PROPDMG CROPDMG SUM.DMG
## 2648 NAPA 115,116,210,000 66,900,000 115,183,110,000
## 9 ADAMS, CALHOUN AND JERSEY 5,000,000,000 5,000,000,000 10,000,000,000
## 2713 NDZ027 3,004,200,000 0 3,004,200,000
## 3685 SHELBY 2,081,050,520 0 2,081,050,520
## 773 DAVIDSON 1,525,390,500 5,000 1,525,395,500
## 1737 JEFFERSON 1,011,505,000 5,000 1,011,510,000
## 3937 TUNICA 1,000,331,000 0 1,000,331,000
## 1028 FLZ072 - 074 450,000,000 500,000,000 950,000,000
## 3721 SOMERSET 853,000,000 0 853,000,000
## 316 BROOME 813,296,500 0 813,296,500
In this research we studied the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Our goal was to retrieve helpful information for resistance and better reaction on most severe weather events.
head(harm_result)
## EVTYPE FATALITIES INJURIES
## 38 TORNADO 5633 91364
## 12 EXCESSIVE HEAT 2022 6753
## 20 HEAT 1118 2415
## 14 FLASH FLOOD 1041 1804
## 27 LIGHTNING 819 5233
## 37 THUNDERSTORM WIND 741 9510
head(tornado.county[order(-tornado.county$FATALITIES, -tornado.county$INJURIES), -c(1,5)])
## COUNTYNAME FATALITIES INJURIES
## 941 JASPER 163 1258
## 682 GENESEE 120 911
## 1264 MCLENNAN 114 610
## 953 JEFFERSON 107 1529
## 2142 WORCESTER 92 1253
## 423 COWLEY 77 290
comma(head(dmg_result))
## EVTYPE PROPDMG CROPDMG SUM.DMG
## 15 FLOOD 144,773,014,813 5,794,173,950 150,567,188,763
## 25 HURRICANE (TYPHOON) 85,256,411,010 5,506,117,800 90,762,528,810
## 40 TORNADO 56,942,036,579 414,963,070 57,356,999,649
## 37 STORM SURGE/TIDE 47,965,074,000 855,000 47,965,929,000
## 14 FLASH FLOOD 22,164,957,807 6,562,951,150 28,727,908,957
## 19 HAIL 15,738,269,598 3,026,794,473 18,765,064,071
comma(head(flood.county[order(-flood.county$SUM.DMG), -c(1,6)]))
## COUNTYNAME PROPDMG CROPDMG SUM.DMG
## 2648 NAPA 115,116,210,000 66,900,000 115,183,110,000
## 9 ADAMS, CALHOUN AND JERSEY 5,000,000,000 5,000,000,000 10,000,000,000
## 2713 NDZ027 3,004,200,000 0 3,004,200,000
## 3685 SHELBY 2,081,050,520 0 2,081,050,520
## 773 DAVIDSON 1,525,390,500 5,000 1,525,395,500
## 1737 JEFFERSON 1,011,505,000 5,000 1,011,510,000
Since there is no need of our recommendations in this analysis, we just wanna wish strong health and great weather for everybody!