The goal of this project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States between 1950 and 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
In summary how it affects the communities from two differents points of view:
Harmful with respect to population health.
The negative greatest economic consequences.
First of all let’s load the database in a data frame. We will download the file from the url if it doesn’t exist in the working directory.
if(!file.exists("StormData.csv.bz2"))
{
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "./StormData.csv.bz2" )
}
DataRaw <- read.csv("StormData.csv.bz2")
For this section, we will need to load the following R packets:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Now we can inspect the content of the data frame to know the different variables that we need to make the analysis of.
str(DataRaw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels ""," "," "," ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The data frame has a lot of variables but to answer the two main questions (see the synopsis) we will need only the following: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. In order to make it, we will create a new data frame with only the interesting variables and will remove the initial data frame to save RAM memory.
DataInterest <- select(DataRaw, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
rm(DataRaw)
And now we can see only the interesting variables for our analysis.
str(DataInterest)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
We will group the events by type and sum up the fatalities or injuries. The results will be saved in two new data frames.
# manipulate the data frame to sumarise events with fatalities and save in a new data frame
EventFatalities <- DataInterest %>% select(EVTYPE, FATALITIES) %>%
group_by(EVTYPE) %>% summarise(Total_Fatalities=sum(FATALITIES)) %>% arrange(-Total_Fatalities)
# manipulate the data frame to sumarise events with injuries and save in a new data frame
EventInjuries <- DataInterest %>% select(EVTYPE, INJURIES) %>%
group_by(EVTYPE) %>% summarise(Total_Injuries=sum(INJURIES)) %>% arrange(-Total_Injuries)
Now we can see the Top Ten atalities and injuries events between 1950 and 2011.
head(EventFatalities, 10) #show top ten events fatalities
## # A tibble: 10 x 2
## EVTYPE Total_Fatalities
## <fct> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
head(EventInjuries, 10) #show top ten events injuries
## # A tibble: 10 x 2
## EVTYPE Total_Injuries
## <fct> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
To get the amount of dollars from the property and crop damages, we need to extract the numeric exponent information from PROPDMGEXP and CROPDMGEXP variables and create two new columns with this information. But first, we will inspect the unique values of these variables:
unique(DataInterest$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(DataInterest$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
We can see different values (numerics and characters). We will need to transform this information into a numeric exponent to get the final amount of dollars. With numeric values it is not neccesary to make anything but with the characters we will need to make transformation like this:
The following R sentences make it for us:
# convert character to capital
DataInterest$PROPDMGEXP <- toupper(DataInterest$PROPDMGEXP)
DataInterest$CROPDMGEXP <- toupper(DataInterest$CROPDMGEXP)
# convert the undefined simbol to 0
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("", "-", "+", "?", "0")] <- "0"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("", "-", "+", "?", "0")] <- "0"
# convert the character 'H' to '2'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("H")] <- "2"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("H")] <- "2"
# convert the character 'K' to '3'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("K")] <- "3"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("K")] <- "3"
# convert the character 'M' to '6'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("M")] <- "6"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("M")] <- "6"
# convert the character 'B' to '9'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("B")] <- "9"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("B")] <- "9"
Now, we will add two new variables with the amount of dollars:
DataInterest$PROPDMGDOLLARS <- DataInterest$PROPDMG*10^as.numeric(DataInterest$PROPDMGEXP)
DataInterest$CROPDMGDOLLARS <- DataInterest$CROPDMG*10^as.numeric(DataInterest$CROPDMGEXP)
We will group the events by type and sum up the property or crop damages. The results will be saved in two new data frames. One for property damages named EventPropertyDamage and other for crop damages named EventCropDamage.
# summarise for properties damage
EventPropertyDamage <- DataInterest %>% select(EVTYPE, PROPDMGDOLLARS) %>%
group_by(EVTYPE) %>% summarise(Total_Property_Damages=sum(PROPDMGDOLLARS)) %>% arrange(-Total_Property_Damages)
# summarise for crop damage
EventCropDamage <- DataInterest %>% select(EVTYPE, CROPDMGDOLLARS) %>%
group_by(EVTYPE) %>% summarise(Total_Crop_Damages=sum(CROPDMGDOLLARS)) %>% arrange(-Total_Crop_Damages)
And finally , we can inspect the TOP TEN impact events in properties and crops in these two tables:
head(EventPropertyDamage, 10)
## # A tibble: 10 x 2
## EVTYPE Total_Property_Damages
## <fct> <dbl>
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56947380676
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16822673978
## 6 HAIL 15735267513
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497251
## 10 HIGH WIND 5270046295
head(EventCropDamage, 10)
## # A tibble: 10 x 2
## EVTYPE Total_Crop_Damages
## <fct> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 HURRICANE 2741910000
## 7 HURRICANE/TYPHOON 2607872800
## 8 FLASH FLOOD 1421317100
## 9 EXTREME COLD 1292973000
## 10 FROST/FREEZE 1094086000
The final study shows in the Figure 1 a barplot with the Top Ten fatalities and injuries events:
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.6)
barplot(EventFatalities$Total_Fatalities[1:10], names.arg = EventFatalities$EVTYPE[1:10], las=3, col="red", main="Top Ten Fatalities Events", ylab="Numbers of Fatalities")
barplot(EventInjuries$Total_Injuries[1:10], names.arg = EventInjuries$EVTYPE[1:10], las=3, col="yellow", main="Top Ten Injuries Events", ylab="Numbers of Injuries")
Figure 1
And in the Figure 2 we can see the economic impact of these events:
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.6)
barplot(EventPropertyDamage$Total_Property_Damages[1:10], names.arg = EventPropertyDamage$EVTYPE[1:10], col = "green", main = "Top Ten Ecomomic Impact in Properties", ylab="Dollars", las=3)
barplot(EventCropDamage$Total_Crop_Damages[1:10], names.arg = EventCropDamage$EVTYPE[1:10], col = "green", main = "Top Ten Ecomomic Impact in Crops", ylab="Dollars", las=3)
Figure 2