This project consists in explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and look up the severe weather events causing the greatest economic consequences and the most harmful for the human health. To give answer to those questions, processing the data will be needed. Weather events will be grouped into categories and the answer to those questions will be given in function of these groups.
Let’s start loading the data from the working directory and to identify the key variables for this project as wel as loading the required packages.
require(knitr)
## Loading required package: knitr
raw_data<-read.csv("repdata_data_StormData.csv.bz2")
str(raw_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Most harmful human health key variables:
EVTYPE: Describes the weather event
FATALITIES: Number of dead people
INJURIES: Number of people injured
Greatest economic consequences:
EVTYPE: Describes the weather event
PROPDMG: Property damage estimated cost
CROPDMG: Cropping damage estimated cost
I will create a new data set called data containing only those variables
to_use <- c(grep("EVTYPE",names(raw_data)),grep("FATALITIES",names(raw_data)),grep("INJURIES",names(raw_data)),grep("PROPDMG",names(raw_data)),grep("CROPDMG",names(raw_data)))
data <- raw_data[,to_use]
If we explore the events type, we will see that many events are repeated due to a lower and upper cases spelling. The number of unique events are:
print(length(unique(raw_data$EVTYPE)))
## [1] 985
To solve that problem, I will use the function tolower() to make all the weather events to lower case.
data$EVTYPE <- tolower(data$EVTYPE)
After the conversion, the number of unique events we do have in our edited data set is:
print(length(unique(data$EVTYPE)))
## [1] 898
This number is still very high.
I am going to create categories for the weather events to be able to group them into 8 groups:
1. Water: damage caused mainly by water as precipitation, floodings, etc.
2. Wind: damage caused mainly by air as hurricane, wind, blizzard, etc.
3. Storm: damage caused mainly by storm, ligntning, etc.
4. Snow: damage caused mainly by snow, avalanche, etc.
5. Fire: damage caused mainly by fire, smoke, etc.
6. LowTemp: damage caused mainly by low temperature, frost, hyperthermia, etc.
7. HighTemp: damage caused mainly by record high temperatures, heat, etc.
8. Other: Rest of the events excluded the above groups
The next step is to create a new variable called Damage_g in data , naming events with the mentioned groups
#each variable is created for a specific group. Those variables have the key words to look up
Wind_damage <- c("tornado","tornados", "wind","hurricane","winds","blizzard","severe turbulence","blowing dust","microburst","beach erosin",
"blow-out tide","wnd","torndao","typhoon")
Water_damage <- c("hail", "rain","rains","funnel cloud","funnel","cloud","clouds","record rainfall","precipitation","water",
"rip current", "flooding","flood","floods","waterspout","high tides","surf","water","marine mishap","high seas",
"urban/small","unseasonably wet","wayterspout" ,"stream","excessive wetness","waves","wet","flash floooding",
"heavy seas","showers","small stream","marine accident","precip","coastal erosion","tsunami","astronomical low tide",
"coastal surge","heavy shower","beach erosion","remnants of floyd","rough seas","rogue wave","drowning")
Storm_damage <- c("storm","lighting","thunderstorm","thunderstorms","tstm","rainstorm","ligntning","urban/small strm fldg","northern lights")
LowTemp_damage <- c("ice","cold", "cool","freeze","record low","low temperature record","freezing drizzle","glaze","freezing",
"wintry mix","winter weather","frost","icy","hypothermia","low temperature","frost","hypothermia/exposure","low temp",
"hyperthermia/exposure")
HighTemp_damage <- c("heat","high temperature","record high","record high temperatures","record warmth","dry","warm","drought","hot")
Fire_damage <- c("fire","fires","smoke")
Snow_damage <- c("avalanche","snow","record snowfall","sleet","snowpack","avalance")
Other_damage <- c("apache county","dust devil","high","mudslides","urban and small","downburst","gustnado and","mud slide","mud slides","other",
"record temperatures","southeast","mud/rock slide","landslide" ,"landslides","excessive","saharan dust","mild pattern",
"swells","urban small","temperature record","landslump","record temperature","summary","no severe weather","dam break",
"none","landspout","winter weather mix", "rock slide","dam break","gustnado","mudslide","heavy mix","dam failure",
"?","winter mix","wintery mix","seiche","monthly temperature","vog","tropical depression","driest month","red flag criteria",
"dust devel","volcanic ash","volcanic ashfall","volcanic","volcanic ash plume","volcanic eruption","fog")
damage <- list("Other" = Other_damage,"Wind"=Wind_damage,"Water" = Water_damage,"Storm" = Storm_damage,"LowTemp" = LowTemp_damage,
"HighTemp" = HighTemp_damage,"Fire" = Fire_damage,"Snow" = Snow_damage)
#damage is a list containg all the values for all the groups. Each element of the list is a group
data$Damage_g <- rep("",dim(data)[1]) # Define the new variable with the groups
for(i in seq_along(damage)){ #This function creates the new variable Damage_g and allocate the corresponding group
for(j in seq_along(damage[[i]])){
data$Damage_g[grep(damage[[i]][j],data$EVTYPE)] <- names(damage)[[i]]
}
}
Now, let’s calculate the total number of fatalities and injuries per events group. This new data set will be stored in a data frame called health:
fatal <- aggregate(FATALITIES ~ Damage_g, data= data, sum)
injur <- aggregate(INJURIES ~ Damage_g, data = data, sum)
health <- merge(fatal, injur, "Damage_g")
Ordering health by total number of fatalities by events group.
health <- health[order(health$FATALITIES, decreasing = TRUE),]
print(health)
## Damage_g FATALITIES INJURIES
## 8 Wind 6333 95397
## 2 HighTemp 3181 9276
## 7 Water 2478 11359
## 6 Storm 1088 11765
## 4 Other 944 6410
## 3 LowTemp 635 3377
## 5 Snow 396 1336
## 1 Fire 90 1608
To calculate the total Property damage and the total Crop damage, we just follow a analogous process to previously described
prop <- aggregate(PROPDMG ~ Damage_g, data = data, sum)
crop <- aggregate(CROPDMG ~ Damage_g, data= data, sum)
economic <- merge(prop, crop, "Damage_g")
economic$Total <- economic$PROPDMG + economic$CROPDMG #A new variable of "totals"" is created to order this data frame by it
economic <- economic[order(economic$Total, decreasing = TRUE),]
print(economic)
## Damage_g PROPDMG CROPDMG Total
## 7 Water 3226169.69 961333.34 4187503.03
## 8 Wind 3717726.33 133088.72 3850815.05
## 6 Storm 2893713.27 211303.49 3105016.76
## 4 Other 646000.41 4653.56 650653.97
## 5 Snow 154726.53 2195.72 156922.25
## 1 Fire 125323.29 9565.74 134889.03
## 3 LowTemp 111770.98 20289.95 132060.93
## 2 HighTemp 9069.51 35396.80 44466.31
The answer to the proposed questions are detailed below:
kable(health[,], row.names=F)
| Damage_g | FATALITIES | INJURIES |
|---|---|---|
| Wind | 6333 | 95397 |
| HighTemp | 3181 | 9276 |
| Water | 2478 | 11359 |
| Storm | 1088 | 11765 |
| Other | 944 | 6410 |
| LowTemp | 635 | 3377 |
| Snow | 396 | 1336 |
| Fire | 90 | 1608 |
barplot(t(health[,2:3]), names.arg = health$Damage_g, legend = c("FATALITIES", "INJURIES"),
main = "Total fatalities and injures by groups of events", col = c("red","yellow"),
las=2)
Result are extremes, Wind events is the main weather event that cause more injuries and fatalities across the U.S. (more than 10,000 since 1950)
Independently, Wind events is the major cause of dead for severe weather events and the main cause for injuries as well
kable(economic[,], row.names=F)
| Damage_g | PROPDMG | CROPDMG | Total |
|---|---|---|---|
| Water | 3226169.69 | 961333.34 | 4187503.03 |
| Wind | 3717726.33 | 133088.72 | 3850815.05 |
| Storm | 2893713.27 | 211303.49 | 3105016.76 |
| Other | 646000.41 | 4653.56 | 650653.97 |
| Snow | 154726.53 | 2195.72 | 156922.25 |
| Fire | 125323.29 | 9565.74 | 134889.03 |
| LowTemp | 111770.98 | 20289.95 | 132060.93 |
| HighTemp | 9069.51 | 35396.80 | 44466.31 |
barplot(t(economic[,2:3]), names.arg = economic$Damage_g, legend = c("PROPERTY", "CROP"),
main = "Total property and crop damage by groups of events", col = c("blue","green"),
las=2)
The aggregation of crop and property damage drops that Water, Wind and Storm as the main weather events with the greatest economic consecuences across the U.S. from 1950 Wind events as the main event for properties and Water events for crops