Severe weather conditions and events can have a deep impact on public health with an increase in fatalities and injuries as well as on economy, causing temporary and permanent damage. With this analysis we explore the NOAA Storm Database containing data on atmospheric events from 1950 to 2011 in the US and we try to determine the aspects most harmful to population health and economy.
The focus of this section is to download, load and prepare the data for the analysis. We first download the file, available at this URL:
file.url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file.out <- "data.csv.bz2"
if(!file.exists(file.out)) {download.file(file.url, file.out, method = "curl")}
We read the full file using read.csv (which can read bz2 compressed files) and then we convert it to a data.table:
library(data.table)
data <- read.csv(file.out, header = TRUE, stringsAsFactors = FALSE)
data <- data.table(data)
First of all we notice the dimensions of the dataset, since it will directly influence processing time for every transformation we will perform:
print(paste("No. of samples:", dim(data)[1]))
## [1] "No. of samples: 902297"
print(paste("No. of columns:", dim(data)[2]))
## [1] "No. of columns: 37"
The dataset is therefore made of 37 columns named:
colnames(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
whose classes are:
col.class <- sapply(data, class)
In this analysis we wil be particularly focused on public health related consequences and economic damage per event type. In order to speed up some computations we restrict the data we use to the date of the occurrence (the column BGN_DATE), the event type (EVTYPE), fatalities and injuries (FATALITIES and INJURIES respectively), property and crop damage (PROPDMG and CROPDMG, each expressed in billion of US$):
data.analysis <- data[, c("BGN_DATE",
"EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"CROPDMG"
)
]
We then transform the date column into a Date class and rename the columns:
data.analysis[, BGN_DATE := as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S"),]
colnames(data.analysis) <- make.names(c("Date",
"Type",
"Fatalities",
"Injuries",
"Property.damage",
"Crop.damage"
)
)
The new dataset has now 6 columns whose classes are:
col.class.analysis <- sapply(data.analysis, class)
col.class.analysis
## Date Type Fatalities Injuries Property.damage
## "Date" "character" "numeric" "numeric" "numeric"
## Crop.damage
## "numeric"
The tidied database can now be used for the analysis. We first check the presence of missing values:
for(column in colnames(data.analysis)) {
na.val = sum(as.numeric(is.na(column)))
print(paste("Missing values in ", column, ": ", na.val, sep = ""))
}
## [1] "Missing values in Date: 0"
## [1] "Missing values in Type: 0"
## [1] "Missing values in Fatalities: 0"
## [1] "Missing values in Injuries: 0"
## [1] "Missing values in Property.damage: 0"
## [1] "Missing values in Crop.damage: 0"
We can therefore provide a summary of the dataset without worrying about any strategy to replace missing values:
summary(data.analysis)
## Date Type Fatalities Injuries
## Min. :1950-01-03 Length:902297 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:1995-04-20 Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :2002-03-18 Mode :character Median : 0.0000 Median : 0.0000
## Mean :1998-12-27 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:2007-07-28 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :2011-11-30 Max. :583.0000 Max. :1700.0000
## Property.damage Crop.damage
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
We also provide the plot of the collected data per year to establish the significance of the study:
library(ggplot2)
n.events <- data.analysis[, .N, by = year(Date)] # count the events per year
g <- ggplot(data = n.events, aes(x = year, y = N))
g + geom_bar(stat = "identity") +
xlab("year") +
ylab("reported events") +
ggtitle("Time Evolution of Reported Atmospheric Events in the US")
Reported events from 1950 to 2011
We see that the number of reported events has grown in time most probably due to more rigorous records and time series. This will clearly affect the results of the analysis.
This section is focused on results. We provide a panoramic view on what had the largest impact on public health and economy using the previously introduced dataset.
From the data, we can recover the total amount of fatalities and injuries grouped by the type of atmospheric event:
data.type <- data.analysis[, .(Total.fatalities = sum(Fatalities),
Total.injuries = sum(Injuries)
),
by = Type
]
# other than the total amount, we add the percentage of fatalities and injuries
data.type[, Perc.fatalities := Total.fatalities / sum(Total.fatalities)]
data.type[, Perc.injuries := Total.injuries / sum(Total.injuries)]
We can then investigate the 10 most influencial causes of deaths and damage to people:
# order by percentage
data.fatalities <- data.type[order(-data.type$Perc.fatalities),]
data.injuries <- data.type[order(-data.type$Perc.injuries),]
The ordered datasets can themselves deliver a pretty good summary of the situation in terms of fatalities:
data.fatalities[1:10, .(Type, Total.fatalities)]
## Type Total.fatalities
## 1: TORNADO 5633
## 2: EXCESSIVE HEAT 1903
## 3: FLASH FLOOD 978
## 4: HEAT 937
## 5: LIGHTNING 816
## 6: TSTM WIND 504
## 7: FLOOD 470
## 8: RIP CURRENT 368
## 9: HIGH WIND 248
## 10: AVALANCHE 224
and injuries:
data.injuries[1:10, .(Type, Total.injuries)]
## Type Total.injuries
## 1: TORNADO 91346
## 2: TSTM WIND 6957
## 3: FLOOD 6789
## 4: EXCESSIVE HEAT 6525
## 5: LIGHTNING 5230
## 6: HEAT 2100
## 7: ICE STORM 1975
## 8: FLASH FLOOD 1777
## 9: THUNDERSTORM WIND 1488
## 10: HAIL 1361
We can the plot the results:
# create the plots
library(gridExtra)
g1 <- ggplot(data = data.fatalities[1:10,], aes(x = Type, y = Perc.fatalities)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Fraction of fatalities on total no.") +
scale_x_discrete(limits = data.fatalities$Type[1:10]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
g2 <- ggplot(data = data.injuries[1:10,], aes(x = Type, y = Perc.injuries)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Fraction of injuries on total no.") +
scale_x_discrete(limits = data.injuries$Type[1:10]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(g1, g2, ncol = 2)
Casualties and injuries due to severe weather conditions from 1950 to 2011 in the US
It seems therefore that across the US, the largest impact on public health was due to tornado events (and wind-related reports) with a minor component correlated with excessive heat and floods.
The same kind of analysis can be performed on the economic consequences of severe weather conditions. In this case we analyse data on property and crop damage:
data.type2 <- data.analysis[, .(Total.property.damage = sum(Property.damage),
Total.crop.damage = sum(Crop.damage)
),
by = Type
]
# other than the total amount, we add the percentage of property and crop damage
data.type2[, Perc.property.damage := Total.property.damage / sum(Total.property.damage)]
data.type2[, Perc.crop.damage := Total.crop.damage / sum(Total.crop.damage)]
As before, we rank the 10 most important causes of damage and plot the results:
# order by percentage
data.property <- data.type2[order(-data.type2$Total.property.damage),]
data.crop <- data.type2[order(-data.type2$Total.crop.damage),]
which can already be a good metric of the analysis in terms of property damage (in billions of US$):
data.property[1:10, .(Type, Total.property.damage)]
## Type Total.property.damage
## 1: TORNADO 3212258.2
## 2: FLASH FLOOD 1420124.6
## 3: TSTM WIND 1335965.6
## 4: FLOOD 899938.5
## 5: THUNDERSTORM WIND 876844.2
## 6: HAIL 688693.4
## 7: LIGHTNING 603351.8
## 8: THUNDERSTORM WINDS 446293.2
## 9: HIGH WIND 324731.6
## 10: WINTER STORM 132720.6
and crop damage (in billions of US$):
data.crop[1:10, .(Type, Total.crop.damage)]
## Type Total.crop.damage
## 1: HAIL 579596.28
## 2: FLASH FLOOD 179200.46
## 3: FLOOD 168037.88
## 4: TSTM WIND 109202.60
## 5: TORNADO 100018.52
## 6: THUNDERSTORM WIND 66791.45
## 7: DROUGHT 33898.62
## 8: THUNDERSTORM WINDS 18684.93
## 9: HIGH WIND 17283.21
## 10: HEAVY RAIN 11122.80
Another interesting detail can be the fraction of the cost of weather factors with respect to the top element (i.e. normalised to the top cause of expenses) for property damage:
data.property[1:10, .(Type, Perc.cost = Total.property.damage / data.property$Total.property.damage[1])]
## Type Perc.cost
## 1: TORNADO 1.00000000
## 2: FLASH FLOOD 0.44209541
## 3: TSTM WIND 0.41589609
## 4: FLOOD 0.28015758
## 5: THUNDERSTORM WIND 0.27296815
## 6: HAIL 0.21439540
## 7: LIGHTNING 0.18782792
## 8: THUNDERSTORM WINDS 0.13893441
## 9: HIGH WIND 0.10109136
## 10: WINTER STORM 0.04131691
and crop damage:
data.crop[1:10, .(Type, Perc.cost = Total.crop.damage / data.crop$Total.crop.damage[1])]
## Type Perc.cost
## 1: HAIL 1.00000000
## 2: FLASH FLOOD 0.30918152
## 3: FLOOD 0.28992229
## 4: TSTM WIND 0.18841149
## 5: TORNADO 0.17256584
## 6: THUNDERSTORM WIND 0.11523789
## 7: DROUGHT 0.05848661
## 8: THUNDERSTORM WINDS 0.03223784
## 9: HIGH WIND 0.02981939
## 10: HEAVY RAIN 0.01919060
We finally plot the results:
# create the plots
g1 <- ggplot(data = data.property[1:10,], aes(x = Type, y = Total.property.damage)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Property damage [billion of US$]") +
scale_x_discrete(limits = data.property$Type[1:10]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
g2 <- ggplot(data = data.crop[1:10,], aes(x = Type, y = Total.crop.damage)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Crop damage [billion of US$]") +
scale_x_discrete(limits = data.crop$Type[1:10]) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(g1, g2, ncol = 2)
Property and crop damage due to severe weather conditions from 1950 to 2011 in the US
It seems therefore that tornado events are again responsible for a large part of the damage to properties, while the floods and hail have definitely a larger impact on agricultural-related damage.
The study is definitely not conclusive, but it may suggest that tornado and high speed winds play a central role in producing damage and public health issues, while most other factors have definitely more marginal parts. In terms of crop damage hail and heavy rain leading to floods seem to be the cause of most of the damage, leading the US government to spend more than threetimes the money for hail than for floods.