In this report, we aim to find what type of severe weather events are most harmful with respect to population health, and what types of severe weather events have the greatest economic consequences, accorss the United States, from 1950 to November 2011. To investigate these questions, we obtained and explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storm and weather events in the United States, and has been collected from 1950 to November 2011. From these data, we found that the most harmful weather events in terms of population health are tornados, and that the weather events that had the greatest economic consequences are floods, hurricanes/typhoons, and tornados.
From the NOAA website, we obtained and explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storm and weather events in the United States, and has been collected from year 1950 to November 2011.
The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. We first download the file with the appropriate weblink https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and then read the csv file into R. We can read directly the csv.bz2 file with the read.csv function, without having to uncompress the file before. That’s what we do here.
data <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)
Let’s have a look at the dimensions of the dataset.
dim(data) # 902297 * 37
## [1] 902297 37
So the data is composed of 902 297 rows and 37 variables.
We can have a look at the general structure of the data to know what are those variables and what is their class.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We can also have a quick look at the first rows in this dataset.
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
First we can have a look at the missing values if there are any. We use for this purpose the function summary so that we can check the variables.
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI
## Min. : 0.000 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character
## Median : 0.000 Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_DATE END_TIME COUNTY_END COUNTYENDN
## Length:902297 Length:902297 Min. :0 Mode:logical
## Class :character Class :character 1st Qu.:0 NA's:902297
## Mode :character Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
## END_RANGE END_AZI END_LOCATI
## Min. : 0.0000 Length:902297 Length:902297
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.9862
## 3rd Qu.: 0.0000
## Max. :925.0000
##
## LENGTH WIDTH F MAG
## Min. : 0.0000 Min. : 0.000 Min. :0.0 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.:0.0 1st Qu.: 0.0
## Median : 0.0000 Median : 0.000 Median :1.0 Median : 50.0
## Mean : 0.2301 Mean : 7.503 Mean :0.9 Mean : 46.9
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.:1.0 3rd Qu.: 75.0
## Max. :2315.0000 Max. :4400.000 Max. :5.0 Max. :22000.0
## NA's :843563
## FATALITIES INJURIES PROPDMG
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
##
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
Some variables have missing values. For the purpose of our study here, we are going to use the following variables:
As those variables don’t present any missing value in the dataset, we are not going to operate any transformation to deal with the missing values. But we are going to make a subset of the initial dataset containing only those variables.
data_new <- data[,c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
We check the dimensions of this new dataset.
dim(data_new) # 902297 * 8
## [1] 902297 8
The type of events is given by the variable EVTYPE. The harmfulness of the event is given by 2 variables : FATALITIES and INJURIES.
Let’s first sum Injuries and Fatalities variables. Then we aggregate this sum with types of event.
data_new$harm <- data_new$INJURIES + data_new$FATALITIES
harm <- aggregate(harm ~ EVTYPE, data= data_new, sum)
harm <- harm[order(harm$harm, decreasing = TRUE),]
The economic consequences can be assessed with the variables PROPDMG and CROPDMG. These are numerical values. But these values must be multiplied by the following variables PROPDMGEXPand CROPDMGEXPrespectively. Let’s have a look at those variables.
unique(data_new$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data_new$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Thse characters are used to signify the magnitude of the number, i.e., 1.55B for $1,550,000,000 : “K” for thousands, “M” for millions, and “B” for billions. But we can observe that these variables are like “mixed”, as there is for example an “m” character and an “M” character, which mean the same thing. We first have to transform those characters so that to clean them up.
data_new$PROPDMGEXP <- toupper(data_new$PROPDMGEXP)
unique(data_new$PROPDMGEXP)
## [1] "K" "M" "" "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"
data_new$CROPDMGEXP <- toupper(data_new$CROPDMGEXP)
unique(data_new$CROPDMGEXP)
## [1] "" "M" "K" "B" "?" "0" "2"
We know from the documentation the meaning of the letters in PROPDMGEXP and CROPDMGEXP, but we don’t have any information regarding the numbers in those variables, supposingly representing an old format of conversion maybe not used anymore. We can see by using the function table that those numbers are really less represented in the set than the letters.
table(data_new$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B H K M
## 4 5 1 40 7 424665 11337
table(data_new$CROPDMGEXP)
##
## ? 0 2 B K M
## 618413 7 19 1 9 281853 1995
So we’re not going to take those numbers into account, and we only consider the letters, as we are sure of their meaning.
First, we create a function to convert all the values, and then we create a new variable to store the converted values.
conv <- function(dmg, dmgexp){
dmg * switch(dmgexp, H = 100, K = 1000, M = 10^6, B = 10^9, 1)
}
data_new$cProp <- mapply(conv, data_new$PROPDMG, data_new$PROPDMGEXP)
data_new$cCROP <- mapply(conv, data_new$CROPDMG, data_new$CROPDMGEXP)
Now we are going to aggregate the dataset to prepare the plotting. And first, we cumulate the PROP and CROP damages.
data_new$cost <- data_new$cProp + data_new$cCROP
eco <- aggregate(cost ~ EVTYPE, data= data_new, sum)
eco <- eco[order(eco$cost, decreasing = TRUE),]
library(ggplot2)
harm15 <- harm[1:15,]
g <- ggplot(data = harm15, aes(EVTYPE, harm, fill = harm))
g <- g + geom_bar(stat = "identity")
g <- g + xlab("Top 15 events")
g <- g + ylab("harmful measurement (Fatalities + Injuries)")
g <- g + ggtitle("15 most harmful events \n from 1950 to November 2011")
g <- g + coord_flip()
g
As we can see from the plot above, tornados have the greatest impact on health. We can also observe that some other categories have a significant impact on health, such as thunderstorm winds, lightnings, floods, and excessive heats.
eco15 <- eco[1:15,]
g <- ggplot(data = eco15, aes(EVTYPE, cost, fill = cost))
g <- g + geom_bar(stat = "identity")
g <- g + xlab("Top 15 events")
g <- g + ylab("economic consequence measurement in $\n (Property damages + Crop damages)")
g <- g + ggtitle("15 events with greatest economic consequences \n from 1950 to November 2011")
g <- g + coord_flip()
g
As we can see from above, the most costly category of weather events is floods. We can see that hurricanes/typhoons, storm surges and tornados are also very costly.
This project has been conducted with the following tools and systems :