The exploration of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was conducted in less than a week’s time, and most remarkable was the time required to tidy the data. To perform a thorough analysis of the data would require another week of cleansing the data.
However, this is an academic exercise, and despite the limited time and quality of the data, there were some interesting facts retrieved. It is obvious that the frequency and severity of relevant weather events differ across the United States and between regions. The quality of the data, and time, limited this level of analysis. If this study were to continue, it would be very helpful to develop a data dictionary defining each data element, as well as standardizing the reference values (especially the Event column). Regardless of the data issues, there is some interesting information on the impact of weather events across the nation described below.
This data certainly reinforces the idea that Mother Nature is extremely powerful. Probably most surprising in the review was the fact that so many lives have been lost due to heat related weather incidents. This seems to be the easiest of weather phenomena to conquer. Having fans, portable air-conditioning units, and public shelters available is a low cost solution to prevent the loss of life.
However, that said, severe weather events cause a significant amount of damage across the United States resulting in the loss of life and property. The more we know about the frequency, and likelihood of one of these severe events occurring, the more prepared we can be to protect both person and property. Observation of events begins in the year 1950, and end in November 2011. As you might expect, the completeness of the data improves with the more recent events.
The first step in the process is to load, and study the data for problems such as missing data elements, invalid data elements (like times provided that are out of range and Alpha characters in numeric fields). One thing that I immediately noticed is that there are no definitions for the fields as one would normally expect. Also, the Event field is not standardized, so events like Flood, and Flash Flood are counted separately, and Heat and Extreme Heat are counted separately. Due to time, this latter problem was not totally addressed, but it would be extremely helpful if there was a Standardized Category for grouping types of events.
library(knitr)
##
## Attaching package: 'knitr'
##
## The following object is masked _by_ '.GlobalEnv':
##
## opts_chunk
opts_chunk$set(echo = TRUE, results = "show")
opts_chunk$fig.height=6
opts_chunk$fig.width=8
library(dplyr)
NOAA <- tbl_df (read.csv(bzfile("repdata-data-StormData.csv.bz2"),
na.strings = "NA", stringsAsFactors = FALSE))
To answer the two questions about storm damages resulting in loss of life and/or injuries, and economic impact, this study will review the following fields from the data:
- BGN_DATE: Renamed Date
- BGN_TIME: Renamed Time
- EVTYPE: Renamed Event
- FATALITIES: Renamed Fatalities
- INJURIES: Renamed Injuries
- PROPDMG: Renamed PropDmg
- PROPDMGEXP: Renamed PUnits
- CROPDMG: Renamed CropDmg
- CROPDMGEXP: Renamed CUnits
- Combined Date, and Time to create a dateTime field
- LATITUDE: Combined with LATITUDE_E and renamed lat
- LATITUDE_E: Combined with LATITUDE_E and renamed lat
- LONGITUDE: Combined with LONGITUDE_ and called long
- LONMGITUDE_: Combined with LONGITUDE_ and called long
This data set is extremely messy, and to get it to the point it is truly accurate would require more time that is available in this course.
As a result, some assumptions were made:
- The first fields that required attention were the LONGITUDE and LATITUDE since there is data presented in each section on maps below. In many cases, the LONG/LAT data was not provided. In those cases, the Center of Mass for the state was calculated using the “maps” package, and that value was placed in the LONG/LAT fields. This is not the most precise method, but in this version of R (3.1.3) the packages for calculating accurate center of mass were not available, so I used the median which approximated very closely.
- The date and time were provided as two separate character fields and needed to be converted.
- The units of measure for damages was inconsistently applied, and there was limited documentation available to define the data elements. As a result, the following multipliers were used (upper and lower case values were used):
- K = 1,000
- M = 1,000,000
- B = 1,000,000,000
library(dplyr)
library(lubridate)
# There are 49 records with missing latitude or inconsistent latitudes
# They are all in GU, and AS -- both US Territory.
# LONGITUDE should be negative, but it is up to the user to convert it
# to negative. These two are the only two that have negative numbers
# and the lat/lomng coordinates put it the middle of the Pacific -
# not near Guam which is where these two observations are from.
FIELDS <- select(NOAA, REFNUM, STATE , BGN_DATE, BGN_TIME, EVTYPE, FATALITIES,
INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,
LATITUDE, LATITUDE_E, LONGITUDE, LONGITUDE_)%>%
filter(!is.na(LATITUDE) & LONGITUDE >= 0 )
names(FIELDS) <- c("REFNUM", "ST", "Date", "Time", "Event", "Fatalities",
"Injuries", "PropDmg", "PUnits", "CropDmg", "CUnits",
"lat", "lat1", "long", "long1")
FIELDS$Date <- mdy_hms(FIELDS$Date)
FIELDS <- mutate(FIELDS, lat1 = ifelse( lat1 == 0, "0000", as.character(lat1)))
FIELDS$temp <- paste(as.character(FIELDS$lat), as.character(FIELDS$lat1), sep="")
FIELDS$lat <- as.numeric(sprintf("%.6f", as.numeric(FIELDS$temp, length = 9) * 10^-6))
FIELDS <- mutate(FIELDS, long1 = ifelse( long1 == 0, "0000", as.character(long1)))
FIELDS$temp <- paste(as.character(FIELDS$long), as.character(FIELDS$long1), sep="")
FIELDS$long <- as.numeric(sprintf("%.6f", as.numeric(FIELDS$temp, length = 9) * -10^-6))
FIELDS<-select (FIELDS, -temp, -lat1, -long1)
sum(is.na(FIELDS))
## [1] 0
The 49 records filtered above have missing lat or invalid long. All of these observations are outside of the continental US in one of the US Territories.
In combining the Date and Time fields to create the dateTime field, it was discovered that the Time field has inconsistent Time formats. The majority of the Time data is on 12 hour format, and some of the Time is in 24 hour format, but also includes the PM designation (75,757 records).
There are 248,767 that have a 24 hour format with 4 digits (character field), while 653,536 are in a 12 hour format designating AM and PM. Additionally, there are six records in the 24 hour format with invalid values, such as numbers greater than 5 in the third digit and the letter “O” used instead of the number zero (0). The invalid times are replaced below with a time of “0000”.
head(FIELDS$Time)
## [1] "0130" "0145" "1600" "0900" "1500" "2000"
tail(FIELDS$Time)
## [1] "03:00:00 PM" "10:30:00 PM" "02:48:00 PM" "02:58:00 PM" "10:21:00 AM"
## [6] "08:00:00 PM"
The next step in tidying the data is to convert the date and time to a POSIXct field.
library(dplyr)
library(lubridate)
# Combined Date and Time in 24 hour format into dateTime Field
FIELDS$dateTime <- paste(FIELDS$Date, FIELDS$Time, sp = " ")
FIELDS$dateTime <- parse_date_time(FIELDS$dateTime,
c("%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S"))
## Warning: 5 failed to parse.
Here the 5 malformed 24 hour Time values discovered and replaced with “0000”:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
FIELDS <- mutate(FIELDS, Time = ifelse(!is.na(dateTime), Time, "0000"))
FIELDS$dateTime <- paste(FIELDS$Date, FIELDS$Time, sp = " ")
FIELDS$dateTime <- parse_date_time(FIELDS$dateTime,
c("%y%m%d %I%M%S %p", "%y%m%d %H%M", "%y%m%d %H%M%S"))
All missing or invalid values are replaced, and the dateTime field is in the desired data type.
sum(is.na(FIELDS))
## [1] 0
class(FIELDS$dateTime)
## [1] "POSIXct" "POSIXt"
The last task to perform in tidying the data is to convert damages to dollars.
As mentioned above, the data is not in good shape, and some assumptions have been made for the sake of time. The following units will be used to convert the property and crop damages into dollars:
- K = 1,000
- M = 1,000,000
- B = 1,000,000,000
If the fields PROPDMGEXP (renamed PUnits) and CROPDMGEXP (renamed CUnits) do not contain one of the above values in either upper or lower case, the observation will be ignored. Without a data dictionary, it is impractical to know what the various values respresent.
library(dplyr)
unique(FIELDS$PUnits)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(FIELDS$CUnits)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Units <- c("K", "k", "M", "m", "B")
Damages<- select(FIELDS, ST, Event, PropDmg, PUnits,
CropDmg, CUnits, Fatalities, Injuries, long, lat, dateTime)%>%
filter(PUnits %in% Units | CUnits %in% Units )%>%
mutate(PropDmg = as.numeric(ifelse(PUnits == "K" |
PUnits == "k", PropDmg * 1000, PropDmg)))%>%
mutate(PropDmg = as.numeric(ifelse(PUnits == "M" |
PUnits == "m", PropDmg * 1000000, PropDmg)))%>%
mutate(PropDmg = as.numeric(ifelse(PUnits == "B" |
PUnits == "b", PropDmg * 1000000000, PropDmg)))%>%
mutate(CropDmg = as.numeric(ifelse(CUnits == "K" |
CUnits == "k", CropDmg * 1000, CropDmg)))%>%
mutate(CropDmg = as.numeric(ifelse(CUnits == "M" |
CUnits == "m", CropDmg * 1000000, CropDmg)))%>%
mutate(CropDmg = as.numeric(ifelse(CUnits == "B" |
CUnits == "b", CropDmg * 1000000000, CropDmg)))
sum(is.na(Damages))
## [1] 0
Now the data is ready to work with.
1. Across the United States, which types of events (as indicated in the Event variable) are most harmful with respect to population health?
Tornados are among the top killers in the nation, but as you can see in the following figure, not the top killer everywhere. The map below clearly shows that tornados tend to occur in a fairly large band from the Southwest up through the Midwest, and into the Northeast of the country.
To be considered in this category, the observation must list either a fatality or injury from the event.
library(dplyr)
library(ggplot2)
library(maps)
all_states <- map_data("state") # map_data {ggplot2}
# Developing Center of Mass LAT/LONG for States to put in Events where State is
# provided, but LAT/LONG is not provided.
States_COM <- select (all_states, lat, long, region)%>%
group_by(region)%>%
summarize(median(lat), median(long))
names(States_COM) <- c("State", "latCOM", "longCOM")
States_COM <- inner_join(States_COM, All49, by = "State")%>%
select(State, ST, latCOM, longCOM)
# The records in FIELDS that do not have a matching ST code
# are not joined in. These are all observations outside of the
# continental US. Including Alaska and Hawaii.
# Going to get the observation with the Highest Faltalities
# Per State - then join that list to the Center of Mass
# Long/Lat for that State and show on the map the Event and Fatality
# Should see different events in different regions
TornadoFatality<- select(FIELDS, ST, Event, long, lat, Fatalities)%>%
filter(Event == "TORNADO" & Fatalities != 0 )%>%
group_by(ST, Event)
#tally(PropDmg)%>%
#top_n(1)
TornadoInjury<- select(FIELDS, ST, Event, long, lat, Injuries)%>%
filter(Event == "TORNADO" & Injuries != 0 )%>%
group_by(ST, Event)
#tally(CropDmg)%>%
#top_n(1)
#Filter LAT/LONG that are outside of the map boundaries
TornadoFatality <- filter(TornadoFatality, lat < 48.98)
TornadoFatality <- filter(TornadoFatality, lat > 18.00)
TornadoFatality <- filter(TornadoFatality, long > -126.62)
TornadoFatality<- filter(TornadoFatality, long < -62.36)
TornadoFatality<- filter(TornadoFatality, lat != 0 | long !=0 )
#Filter LAT/LONG that are outside of the map boundaries
TornadoInjury <- filter(TornadoInjury, lat < 48.98)
TornadoInjury <- filter(TornadoInjury, lat > 18.00)
TornadoInjury <- filter(TornadoInjury, long > -126.62)
TornadoInjury<- filter(TornadoInjury, long < -62.36)
TornadoInjury<- filter(TornadoInjury, lat != 0 | long !=0 )
Fatal <- c("Fatalities")
Injury <- c("Injuries")
p <- ggplot()
p <- p + geom_polygon( data=all_states, aes(x=long, y=lat,
group = group),colour="black", fill="steelblue4" ) +
labs(title="Tornado Event Fatalities and Injuries") +
xlab("Longitude") + ylab("Latitude")
p <- p + geom_jitter( data=TornadoInjury,
position=position_jitter(width=0.5, height=0.5),
aes(x=long, y=lat, color=Injury)) +
geom_jitter( data=TornadoFatality,
position=position_jitter(width=0.5, height=0.5),
aes(x=long, y=lat, size = Fatalities,color=Fatal)) +
scale_size(name="Number Fat/Inj")
p
The output from the table below depicts the top event in fatalities per state across the continental United States. RI had two Events with the same total so there are 51 records.
In general you can tell that the top events causing fatalities are (not in order of numbers but regionally):
- Avalanche
- Heat
- Flood
- High Surf
- High Wind
- Lightning
- RIP Current
- Tornado
- Thunderstorm Wind
library(dplyr)
All51 <- data.frame(ST = c("CT", "ME", "MA", "NH", "RI", "VT","NJ", "NY",
"PA","IL", "IN", "MI", "OH", "WI","IA", "KS", "MN", "MO", "NE",
"ND", "SD","DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV",
"AL","KY", "MS", "TN","AR", "LA", "OK", "TX","AZ", "CO", "ID",
"MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR","WA"),
State = c("connecticut", "maine", "massachusetts", "new hampshire",
"rhode island", "vermont", "new jersey", "new york", "pennsylvania",
"illinois", "indiana", "michigan", "ohio", "wisconsin", "iowa",
"kansas", "minnesota", "missouri", "nebraska", "north dakota",
"south dakota", "delaware", "florida", "georgia", "maryland",
"north carolina", "south carolina", "virginia", "district of columbia",
"west virginia", "alabama", "kentucky", "mississippi", "tennessee",
"arkansas", "louisiana", "oklahoma", "texas", "arizona", "colorado",
"idaho", "montana", "nevada", "new mexico", "utah", "wyoming",
"alaska", "california", "hawaii", "oregon", "washington"),
stringsAsFactors = FALSE)
WeatherFatality <- select(FIELDS, ST, Event, Fatalities, Injuries)
WeatherFatality <- left_join(All51, WeatherFatality, by = "ST")
WeatherFatalityF<- group_by(WeatherFatality, ST, Event)%>%
filter(Fatalities != 0)%>%
tally(Fatalities)%>%
top_n(1)
## Selecting by n
WeatherFatalityI<- group_by(WeatherFatality, ST, Event)%>%
filter( Injuries != 0)%>%
tally(Injuries)%>%
top_n(1)
## Selecting by n
WeatherFatalityR <- inner_join(WeatherFatalityF, WeatherFatalityI, by = c("ST"))
names(WeatherFatalityR) <- c("ST", "Event", "Fatalities",
"Event", "Injuries")
WeatherFatalityR <- WeatherFatalityR[order(WeatherFatalityR$ST),]
print(tbl_df(WeatherFatalityR), n=52)
## Source: local data frame [53 x 5]
##
## ST Event Fatalities Event.1 Injuries
## 1 AK AVALANCHE 33 ICE STORM 34
## 2 AL TORNADO 617 TORNADO 7929
## 3 AR TORNADO 379 TORNADO 5116
## 4 AZ FLASH FLOOD 62 DUST STORM 179
## 5 CA EXCESSIVE HEAT 110 WILDFIRE 623
## 6 CO AVALANCHE 48 TORNADO 261
## 7 CO LIGHTNING 48 TORNADO 261
## 8 CT HIGH WIND 6 TORNADO 703
## 9 DC EXCESSIVE HEAT 20 EXCESSIVE HEAT 316
## 10 DE EXCESSIVE HEAT 7 TORNADO 73
## 11 FL RIP CURRENT 172 TORNADO 3340
## 12 GA TORNADO 180 TORNADO 3926
## 13 HI HIGH SURF 21 STRONG WIND 20
## 14 IA TORNADO 81 TORNADO 2208
## 15 ID AVALANCHE 16 THUNDERSTORM WIND 74
## 16 IL HEAT 653 TORNADO 4145
## 17 IN TORNADO 252 TORNADO 4224
## 18 KS TORNADO 236 TORNADO 2721
## 19 KY TORNADO 125 TORNADO 2806
## 20 LA TORNADO 153 TORNADO 2637
## 21 MA TORNADO 108 TORNADO 1758
## 22 MD EXCESSIVE HEAT 88 EXCESSIVE HEAT 461
## 23 ME LIGHTNING 6 LIGHTNING 70
## 24 MI TORNADO 243 TORNADO 3362
## 25 MN TORNADO 99 TORNADO 1976
## 26 MO TORNADO 388 TORNADO 4330
## 27 MS TORNADO 450 TORNADO 6244
## 28 MT LIGHTNING 9 WILD/FOREST FIRE 33
## 29 NC TORNADO 126 TORNADO 2536
## 30 ND TORNADO 25 TORNADO 326
## 31 NE TORNADO 54 TORNADO 1158
## 32 NH TSTM WIND 6 LIGHTNING 85
## 33 NJ EXCESSIVE HEAT 39 EXCESSIVE HEAT 300
## 34 NM FLASH FLOOD 16 TORNADO 155
## 35 NV HEAT 54 FLOOD 50
## 36 NY EXCESSIVE HEAT 93 TORNADO 315
## 37 OH TORNADO 191 TORNADO 4438
## 38 OK TORNADO 296 TORNADO 4829
## 39 OR HIGH WIND 19 HIGH WIND 50
## 40 PA EXCESSIVE HEAT 359 TORNADO 1241
## 41 RI HEAT 2 TORNADO 23
## 42 RI HIGH SURF 2 TORNADO 23
## 43 SC TORNADO 59 TORNADO 1314
## 44 SD TORNADO 18 TORNADO 452
## 45 TN TORNADO 368 TORNADO 4748
## 46 TX TORNADO 538 TORNADO 8207
## 47 UT AVALANCHE 44 WINTER STORM 415
## 48 VA TORNADO 36 TORNADO 914
## 49 VT FLOOD 4 TSTM WIND 24
## 50 WA AVALANCHE 35 TORNADO 303
## 51 WI TORNADO 96 TORNADO 1601
## 52 WV FLASH FLOOD 24 TSTM WIND 142
## .. .. ... ... ... ...
Fatalities and injuries appear to be closely associated with the seasons as can be seen in the barplot below:
library(ggplot2)
library(lubridate)
x <- select (FIELDS, dateTime, Fatalities) %>%
filter(... = !is.na(Fatalities) & Fatalities != 0) %>%
group_by(month(dateTime, label=TRUE, abbr=TRUE)) %>%
summarize(mean(Fatalities))
names(x) <- c("Month", "Fatalities")
g <- ggplot(x, mapping = aes(x = Month, y = round(Fatalities, digits = 2))) +
geom_bar(stat= "identity", fill = "red")
g <- g + labs(title="Weather Related Fatalities by Month (mean)
from all Events Across United States") +
xlab("Month") + ylab("Fatalities (mean by month)") +
theme( axis.line = element_line(colour = "black"),
axis.ticks = element_line( color = "blue", size = 2),
axis.text = element_text( angle = 90),
panel.background = element_rect(fill = "whitesmoke"))
print(g)
2. Across the United States, which types of events have the greatest economic consequences?
The following shows the Property and Crop Damages across the United States due to Tornados alone:
library(dplyr)
library(ggplot2)
library(maps)
all_states <- map_data("state") # map_data {ggplot2}
# COM Not accurate but best I can do on R 3.1.3 and limited time.
# Developing Center of Mass LAT/LONG for States to put in Events where State is
# provided, but LAT/LONG is not provided.
States_COM <- select (all_states, lat, long, region)%>%
group_by(region)%>%
summarize(median(lat), median(long))
names(States_COM) <- c("State", "latCOM", "longCOM")
States_COM <- inner_join(States_COM, All49, by = "State")%>%
select(State, ST, latCOM, longCOM)
# The records in Damages that do not have a matching ST code
# are not joined in. These are all observations outside of the
# continental US. Including Alaska and Hawaii.
# Going to get the observation with the Highest Propertty and Crop
# Damages Per State caused by Tornados - then join that list to the Center of Mass
# Long/Lat for that State and show on the map the Event and Damages
TornadoDamagesProp<- select(Damages, ST, Event, long, lat, PropDmg)%>%
filter(Event == "TORNADO" & PropDmg != 0 )%>%
group_by(ST, Event)
#tally(PropDmg)%>%
#top_n(1)
TornadoDamagesCrop<- select(Damages, ST, Event, long, lat, CropDmg)%>%
filter(Event == "TORNADO" & CropDmg != 0 )%>%
group_by(ST, Event)
#tally(CropDmg)%>%
#top_n(1)
#Filter LAT/LONG that are outside of the map boundaries
TornadoDamagesProp <- filter(TornadoDamagesProp, lat < 48.98)
TornadoDamagesProp <- filter(TornadoDamagesProp, lat > 18.00)
TornadoDamagesProp <- filter(TornadoDamagesProp, long > -126.62)
TornadoDamagesProp<- filter(TornadoDamagesProp, long < -62.36)
TornadoDamagesProp<- filter(TornadoDamagesProp, lat != 0 | long !=0 )
#Filter LAT/LONG that are outside of the map boundaries
TornadoDamagesCrop <- filter(TornadoDamagesCrop, lat < 48.98)
TornadoDamagesCrop <- filter(TornadoDamagesCrop, lat > 18.00)
TornadoDamagesCrop <- filter(TornadoDamagesCrop, long > -126.62)
TornadoDamagesCrop<- filter(TornadoDamagesCrop, long < -62.36)
TornadoDamagesCrop<- filter(TornadoDamagesCrop, lat != 0 | long !=0 )
PropDmgC <- c("Prop Dmg")
TypeDmg <- c("Crop Dmg")
p <- ggplot()
p <- p + geom_polygon( data=all_states, aes(x=long, y=lat,
group = group),colour="black", fill="steelblue4" ) +
labs(title="Tornado Event Damages") +
xlab("Longitude") + ylab("Latitude")
p <- p + geom_jitter( data=TornadoDamagesCrop,
position=position_jitter(width=0.5, height=0.5),
aes(x=long, y=lat, color=TypeDmg)) +
scale_size(name="US $") +
geom_jitter( data=TornadoDamagesProp,
position=position_jitter(width=0.5, height=0.5),
aes(x=long, y=lat, size = PropDmg,color=PropDmgC))
p
The output from the table below depicts the top events in damages per state across the continental United States.
library(dplyr)
All51 <- data.frame(ST = c("CT", "ME", "MA", "NH", "RI", "VT","NJ", "NY",
"PA","IL", "IN", "MI", "OH", "WI","IA", "KS", "MN", "MO", "NE",
"ND", "SD","DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV",
"AL","KY", "MS", "TN","AR", "LA", "OK", "TX","AZ", "CO", "ID",
"MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR","WA"),
State = c("connecticut", "maine", "massachusetts", "new hampshire",
"rhode island", "vermont", "new jersey", "new york", "pennsylvania",
"illinois", "indiana", "michigan", "ohio", "wisconsin", "iowa",
"kansas", "minnesota", "missouri", "nebraska", "north dakota",
"south dakota", "delaware", "florida", "georgia", "maryland",
"north carolina", "south carolina", "virginia", "district of columbia",
"west virginia", "alabama", "kentucky", "mississippi", "tennessee",
"arkansas", "louisiana", "oklahoma", "texas", "arizona", "colorado",
"idaho", "montana", "nevada", "new mexico", "utah", "wyoming",
"alaska", "california", "hawaii", "oregon", "washington"),
stringsAsFactors = FALSE)
printCurrency <- function(value, currency.sym="$",
digits=2, sep=",", decimal=".") {
v1 <- as.character(value)
for(i in 1:length(value)){
if (is.na(value[i])){Unit<- "UNK"; value[i]<-0; Multiplier<-0
} else if(value[i] < 1000000){
Unit <- "K"
Multiplier <- 0.001
} else if(value[i] < 1000000000){
Unit <- "M"
Multiplier <- 0.000001
} else if(value[i] >= 1000000000){
Unit <- "B"
Multiplier <- 0.000000001
}
x <- paste(currency.sym, formatC(value[i]*Multiplier, format = "f",
big.mark = sep, digits=digits, decimal.mark=decimal), sep="")
v1[i] <- paste(x, Unit)
}
v1
}
WeatherDamages <- select(Damages, ST, Event, PropDmg, CropDmg)
WeatherDamages <- left_join(All51, WeatherDamages, by = "ST")
WeatherDamagesP<- group_by(WeatherDamages, ST, Event)%>%
tally(PropDmg)%>%
top_n(1)
## Selecting by n
WeatherDamagesC<- group_by(WeatherDamages, ST, Event)%>%
filter( CropDmg != 0)%>%
tally(CropDmg)%>%
top_n(1)
## Selecting by n
WeatherDamagesR <- left_join(WeatherDamagesP, WeatherDamagesC, by = c("ST"))
names(WeatherDamagesR) <- c("ST", "EventPropDmg", "PropCost",
"EventCropDmg", "CropCost")
# The following is necessary to remove the NULL values from the LEFT JOIN
# R returns an NA instead of a NULL
WeatherDamagesR <- mutate( WeatherDamagesR,
EventCropDmg = ifelse(is.na(EventCropDmg), "", EventCropDmg))
WeatherDamagesR <- mutate( WeatherDamagesR,
CropCost = ifelse(is.na(CropCost), 0 , CropCost))
WeatherDamagesR <- WeatherDamagesR[order(WeatherDamagesR$ST),]
WeatherDamagesR$PropCost <- printCurrency(WeatherDamagesR$PropCost)
WeatherDamagesR$CropCost <- printCurrency(WeatherDamagesR$CropCost)
print(tbl_df(WeatherDamagesR), n=53)
## Source: local data frame [51 x 5]
##
## ST EventPropDmg PropCost EventCropDmg CropCost
## 1 AK FLOOD $157.13 M HIGH WIND $157.00 K
## 2 AL TORNADO $6.32 B HEAT $400.10 M
## 3 AR TORNADO $2.59 B FLOOD $141.06 M
## 4 AZ HAIL $2.83 B TROPICAL STORM $200.00 M
## 5 CA FLOOD $116.75 B EXTREME COLD $731.16 M
## 6 CO HAIL $1.42 B HAIL $116.49 M
## 7 CT TORNADO $596.24 M HAIL $30.00 K
## 8 DC TROPICAL STORM $127.60 M DROUGHT $5.00 K
## 9 DE COASTAL FLOOD $40.15 M DROUGHT $29.10 M
## 10 FL HURRICANE/TYPHOON $27.60 B HURRICANE/TYPHOON $955.20 M
## 11 GA TORNADO $3.26 B DROUGHT $717.28 M
## 12 HI FLASH FLOOD $156.51 M HIGH WIND $2.60 M
## 13 IA TORNADO $2.29 B DROUGHT $2.01 B
## 14 ID FLOOD $114.19 M TSTM WIND/HAIL $6.00 M
## 15 IL RIVER FLOOD $5.02 B RIVER FLOOD $5.01 B
## 16 IN TORNADO $2.59 B FLOOD $698.20 M
## 17 KS TORNADO $2.67 B HAIL $259.41 M
## 18 KY TORNADO $888.77 M DROUGHT $226.00 M
## 19 LA STORM SURGE $31.74 B DROUGHT $587.43 M
## 20 MA TORNADO $756.04 M THUNDERSTORM WIND $1.25 M
## 21 MD TROPICAL STORM $538.50 M DROUGHT $99.72 M
## 22 ME ICE STORM $318.23 M HEAVY RAINS $500.00 K
## 23 MI TORNADO $1.07 B DROUGHT $150.00 M
## 24 MN TORNADO $1.90 B HAIL $140.70 M
## 25 MO TORNADO $4.80 B FLOOD $593.04 M
## 26 MS HURRICANE/TYPHOON $13.49 B ICE STORM $5.00 B
## 27 MT HAIL $94.73 M HAIL $34.34 M
## 28 NC HURRICANE $4.98 B HURRICANE $1.43 B
## 29 ND FLOOD $3.92 B HAIL $189.36 M
## 30 NE TORNADO $1.72 B HAIL $737.99 M
## 31 NH ICE STORM $64.93 M FLOOD $200.00 K
## 32 NJ FLOOD $2.11 B DROUGHT $80.00 M
## 33 NM WILD/FOREST FIRE $1.51 B DROUGHT $14.40 M
## 34 NV FLOOD $677.94 M FLOOD $6.00 M
## 35 NY FLASH FLOOD $1.83 B DROUGHT $100.20 M
## 36 OH TORNADO $2.28 B DROUGHT $200.00 M
## 37 OK TORNADO $3.27 B DROUGHT $1.10 B
## 38 OR FLOOD $722.17 M HAIL $36.03 M
## 39 PA TORNADO $1.79 B DROUGHT $539.40 M
## 40 RI FLOOD $92.86 M $0.00 K
## 41 SC TORNADO $531.51 M HURRICANE $20.30 M
## 42 SD TORNADO $231.21 M HAIL $64.27 M
## 43 TN FLOOD $4.25 B TSTM WIND $9.15 M
## 44 TX TROPICAL STORM $5.49 B DROUGHT $6.37 B
## 45 UT FLOOD $331.76 M HIGH WIND $1.11 M
## 46 VA HURRICANE/TYPHOON $512.00 M DROUGHT $297.48 M
## 47 VT FLOOD $1.10 B FLASH FLOOD $14.72 M
## 48 WA FLOOD $212.68 M HAIL $209.50 M
## 49 WI HAIL $961.72 M FLOOD $466.89 M
## 50 WV FLASH FLOOD $485.23 M DROUGHT $19.75 M
## 51 WY HAIL $111.22 M HAIL $1.88 M
Below is the unique list of Events that resulted in Property Damages:
unique(WeatherDamagesR$EventPropDmg)
## [1] "FLOOD" "TORNADO" "HAIL"
## [4] "TROPICAL STORM" "COASTAL FLOOD" "HURRICANE/TYPHOON"
## [7] "FLASH FLOOD" "RIVER FLOOD" "STORM SURGE"
## [10] "ICE STORM" "HURRICANE" "WILD/FOREST FIRE"
Below is the unique list of Events that resulted in Crop Damages:
unique(WeatherDamagesR$EventCropDmg)
## [1] "HIGH WIND" "HEAT" "FLOOD"
## [4] "TROPICAL STORM" "EXTREME COLD" "HAIL"
## [7] "DROUGHT" "HURRICANE/TYPHOON" "TSTM WIND/HAIL"
## [10] "RIVER FLOOD" "THUNDERSTORM WIND" "HEAVY RAINS"
## [13] "ICE STORM" "HURRICANE" ""
## [16] "TSTM WIND" "FLASH FLOOD"
It would be interesting to see if, over time, the early warning systems have reduced the number of Fatalities and Injuries, and if the property and crop damage continues to rise regardless of early warning systems due to inflation and population expansion. Also, we might find that the improved building standards and codes have helped limit the damages caused by storms. Who knows what patterns might be identified that would help improve our understanding of the weather and ways we can better prepare for events to reduce loss of life and property.
The code below was not used in this review due to a lack of time. However, it provides the foundation for further parsing and dissecting this data to improve our understanding.
library(dplyr)
library(lubridate)
library(scales)
# Divisions as broken up into 4 Regions.
NewEngland <- c("CT", "ME", "MA", "NH", "RI", "VT")
MidAtlantic <- c("NJ", "NY", "PA")
MidwestEastNC <- c("IL", "IN", "MI", "OH", "WI")
MidwestWestNC <- c("IA", "KS", "MN", "MO", "NE", "ND", "SD")
SouthSouthAtlantic <- c("DE", "FL", "GA", "MD", "NC", "SC", "VA", "DC", "WV")
SouthEastSC <- c("AL","KY", "MS", "TN")
SouthWestSC <- c("AR", "LA", "OK", "TX")
WestMountain <- c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY")
WestPacific <- c("AK", "CA", "HI", "OR","WA")
RegionsDivsions<- data.frame ( RegDiv = c("R1DIV1", "R1DIV2", "R2DIV3",
"R2DIV4", "R3DIV5", "R3DIV6", "R3DIV7", "R4DIV8", "R4DIV9"),
RegNm = c("NewEngland","MidAtlantic","MidwestEastNC",
"MidwestWestNC","SouthSouthAtlantic","SouthEastSC",
"SouthWestSC","WestMountain","WestPacific"),
stringsAsFactors = FALSE)
for (i in 1:nrow(RegionsDivsions)){
# Create a Data Frame for each Division of the nation for Property Damage
DataName <-paste(RegionsDivsions$RegDiv[i], "P", sep="")
df <- select(Damages, ST, dateTime, PropDmg)%>%
filter(ST %in% get(RegionsDivsions$RegNm[i]) & PropDmg !=0 )%>%
group_by(ST, dateTime)%>%
tally(PropDmg)
names(df)<- c("ST","dateTime", "PropDmg")
df <- mutate(df, Region = RegionsDivsions$RegNm[i])
assign(DataName,df)
# Create a Data Frame for each Region of the nation for Crop Damage
DataName <-paste(RegionsDivsions$RegDiv[i], "C", sep="")
df <- select(Damages, ST, dateTime, CropDmg)%>%
filter(ST %in% get(RegionsDivsions$RegNm[i]) & CropDmg !=0 )%>%
group_by(ST, dateTime)%>%
tally(CropDmg)
names(df)<- c("ST","dateTime", "CropDmg")
df <- mutate(df, Region = RegionsDivsions$RegNm[i])
assign(DataName,df)
# Create a Data Frame for each Region of the nation for Fatalities
DataName <-paste(RegionsDivsions$RegDiv[i], "F", sep="")
df <- select(Damages, ST, dateTime, Fatalities)%>%
filter(ST %in% get(RegionsDivsions$RegNm[i]) & Fatalities !=0 )%>%
group_by(ST, dateTime)%>%
tally(Fatalities)
names(df)<- c("ST","dateTime", "Fatalities")
df <- mutate(df, Region = RegionsDivsions$RegNm[i])
assign(DataName,df)
# Create a Data Frame for each Region of the nation for Injures
DataName <-paste(RegionsDivsions$RegDiv[i], "I", sep="")
df <- select(Damages, ST, dateTime, Injuries)%>%
filter(ST %in% get(RegionsDivsions$RegNm[i]) & Injuries !=0 )%>%
group_by(ST, dateTime)%>%
tally(Injuries)
names(df)<- c("ST","dateTime", "Injuries")
df <- mutate(df, Region = RegionsDivsions$RegNm[i])
assign(DataName,df)
}
#The various Regions and Divisions can be grouped as needed to analyze further
Prop <- rbind(R1DIV1P, R1DIV2P,R2DIV3P,R2DIV4P,R3DIV5P,R3DIV6P,R3DIV7P,R4DIV8P,R4DIV9P)
Crop<- rbind(R1DIV1C, R1DIV2C,R2DIV3C,R2DIV4C,R3DIV5C,R3DIV6C,R3DIV7C,R4DIV8C,R4DIV9C)
Fatal<- rbind(R1DIV1F, R1DIV2F,R2DIV3F,R2DIV4F,R3DIV5F,R3DIV6F,R3DIV7F,R4DIV8F,R4DIV9F)
Inj<- rbind(R1DIV1I, R1DIV2I,R2DIV3I,R2DIV4I,R3DIV5I,R3DIV6I,R3DIV7I,R4DIV8I,R4DIV9I)
Regions of the United States:
Region 1: Northeast Division 1: New England
(Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont)
Division 2: Mid-Atlantic
(New Jersey, New York, and Pennsylvania)
Region 2: Midwest
Division 3: East North Central
(Illinois, Indiana, Michigan, Ohio, and Wisconsin)
Division 4: West North Central
(Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota)
Region 3: South Division 5: South Atlantic
(Delaware, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, Washington D.C., and West Virginia)
Division 6: East South Central
(Alabama, Kentucky, Mississippi, and Tennessee)
Division 7: West South Central
(Arkansas, Louisiana, Oklahoma, and Texas)
Region 4: West
Division 8: Mountain
(Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, and Wyoming)
Division 9: Pacific
(Alaska, California, Hawaii, Oregon, and Washington)
Positive latitude is above the equator (N) (Which the United States is), and negative latitude is below the equator (S).
Positive longitude is east of the prime meridian (Which the United States is), while negative longitude is west of the prime meridian (a north-south line that runs through a point in England).
Latitude and Longitude boundaries of the US 48.987386 is the northern most latitude +18.005611 is the southern most latitude -124.626080 is the west most longitude -62.361014 is a east most longitude