The aim of this report is to asses population health and economic effects of severe weather across the United Sates. Data stored in the NOAA Storm Database is used. The database covers the time period between 1950 and November 2011, but years prior to 1992 are excluded as the data are deemed incomplete. For the purposes of this report, impact on population health is measured as the number of injuries and fatalities, whilst economic impact is the US$ value of the damage to property and crops. The top ten events for each measure are presented. Tornados and excessive heat are most harmful with respect to population health, whilst floods and hurricanes have the greatest economic ipmact.
The working directory is set and the required R packages attached to the programme.
From the course website, the raw Storm Data is downloaded. More information about the data can be found in Storm Data Documentation provided by the National Weather Service.
dataLink <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dest <- "./dataStorms/storm.bz2"
dir <- "./dataStorms"
if(!file.exists(dir)){
dir.create(dir)
download.file(dataLink,dest)
#bunzip2("dataStorms/storm.csv.bz2", overwrite=T, remove=F)
downloadDate <- date()
downloadDate
}
Tha data is read in to the programme.
stormData <- read.csv("./dataStorms/storm.bz2",header = T)
There are 902,297 rows and 37 variables in the data.
dim(stormData)
## [1] 902297 37
However, this report is only interested in the fields that measure the economic and public health impact. Thus, only these fields are selected and only instances where at least one of the measures is greater than zero are retained, leaving only 8 fields with 254,633 rows.
df <- stormData %>%
select(c(2,8,23:28)) %>%
filter(FATALITIES >0 |INJURIES>0|PROPDMG>0|CROPDMG>0)
dim(df)
## [1] 254633 8
Next, from the BGN_DATE field, the year of the event is extracted and the distribution of total number of events by year is plotted.
len <- str_length(df$BGN_DATE)
df$years <- substr(df$BGN_DATE,len - 11,10)
df$years <- substr(df$years,1,4)
df$year <- as.numeric(df$years)
# check distribution of events by year
barplot(table(df$years),col="white")
The chart shows that the number of events recorded each year. There is a massive jump in the number of reported events from 1992 onwards. This suggests a shift in the recording pattern, probably due to lack of enough data prior to 1992 and/or a change in recrording policy and criteria. this leaves 228,247 rows.
#ignore evenys from prior 1992
df2 <- df %>%
select(10,2:8) %>%
filter(year >= 1992)
dim(df2)
## [1] 228247 8
A look at the field that contains the events (EVTYPE) reveals that there are 488 different events. However, table 2.1.1 in the Storm Data Documentation suggests a total of 48 “allowed events”. But, it appears there is no control of this list(via drop down list, for example) during data capturing into the database. Thus, some events are abreviated and others are mis-spelt. For example, a look at the first 20 shows “THUNDERSTORM WINDS” and “TSTM WIND”. Both of these should have been recorded as “Thunderstorm Wind”.
A proper cleaning strategy for this field requires more knowledge of the weather data, effort and time than is feasible within the time scales of this report. Thus, a simple “layman” attempt is adopted.
length(unique(df2$EVTYPE))
## [1] 488
unique(df2$EVTYPE)[20]
## [1] THUNDERSTORM WINDS/HAIL
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
A function (cleanCol) is created. This searches for parttens within the event type and allocates it to a more appropriate group by the analyst’s fair judgement. Applictaion of this function reduces to number of events to 98.
df2$EVTYPE <- toupper(df2$EVTYPE)
#function to clean even type columns
cleanCol <- function(colToClean){
if(grepl("HAIL",colToClean)){
clean <- "HAIL"
}else if(grepl("HURRICANE|TYPHOON",colToClean)){
clean <- "HURRICANE (TYPHOON)"
}else if(grepl("THUNDER|TSTM|STORM",colToClean)){
clean <- "STORMS"
}else if(grepl("FLOOD|FLD|WATER|STREAM|SURGE",colToClean)){
clean <- "FLOODS"
}else if(grepl("TOR|NADO",colToClean)){
clean <- "TORNADOS"
}else if(grepl("SNOW",colToClean)){
clean <- "SNOW"
}else if(grepl("HEAT|WARM",colToClean)){
clean <- "EXCESSIVE HEAT"
}else if(grepl("BLIZ",colToClean)){
clean <- "BLIZZARD"
}else if(grepl("COLD|CHILL",colToClean)){
clean <- "COLD/WIND CHILL"
}else if(grepl("FREEZE|FROST|ICE",colToClean)){
clean <- "FROST/FREEZE"
}else if(grepl("FOG",colToClean)){
clean <- "FOG"
}else if(grepl("RAIN|SHOWER|WETNESS|PREC|DRIZ",colToClean)){
clean <- "RAIN"
}else if(grepl("FIRE",colToClean)){
clean <- "WILD FIRE"
}else if(grepl("AVA",colToClean)){
clean <- "AVALANCHE"
}else if(grepl("LIG",colToClean)){
clean <- "LIGHTINING"
}else if(grepl("HIGH WIND|WINDS",colToClean)){
clean <- "LIGHTINING"
}
else{
colToClean
}
}
df2$EVTYPE <- sapply(df2$EVTYPE,cleanCol)
For each of the measures of economic impacts, the data contains seperate columns for exponents of the values stored in seperate columns. A llok at these shows that different values can represent the same exponent, Thousands, Millions, Billions, etc. Others, “-”,“?” and “+” are deemed invalid and given a value of zero.
unique(df2$PROPDMGEXP)
## [1] M K B m + 0 5 6 4 h 2 7 3 H -
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df2$CROPDMGEXP)
## [1] M K m B ? 0 k
## Levels: ? 0 2 B k K m M
Seperate vectors to group these exponents values are created.
invalid <- c("-","?","+")
billion <- c("B","b","9")
hundredMil <- c("8")
tenMil <- c("7")
million <- c("M","m","6")
hunderdThou <- c("5")
thousand <- c("K","k","3")
hundered <- c("H","h","2")
ten <- c("1","0")
A function (expClean) is created which checks the value of these exponent columns and creates a new column with a corresponding numeric exponent value to be multiplied by the respective columns.
expClean <- function(expCol){
if(is.element(expCol,billion)){
colVal <- 1000000000
}else if(is.element(expCol,hundredMil)){
colVal <- 100000000
}else if(is.element(expCol,tenMil)){
colVal <- 10000000
}else if(is.element(expCol,million)){
colVal <- 1000000
}else if(is.element(expCol,hunderdThou)){
colVal <- 100000
}else if(is.element(expCol,thousand)){
colVal <- 1000
}else if(is.element(expCol,hundered)){
colVal <- 100
}else if(is.element(expCol,ten)){
colVal <- 10
}else{
colVal <- 0
}
}
df2$propDMGExp <- sapply(df2$PROPDMGEXP,expClean)
df2$cropDMGExp <- sapply(df2$CROPDMGEXP,expClean)
Actual values of property damage and crop damage are caluculated and added together to give the total economic loss. Injuries and fatalities are also added together to give total casualties.
df2$cropLoss <- df2$cropDMGExp * df2$CROPDMG
df2$propertyLoss <- df2$propDMGExp * df2$PROPDMG
df2$TotalEconomicLoss <- df2$cropLoss + df2$propertyLoss
df2$TotalEconomicLoss <- df2$TotalEconomicLoss
df2$TotalCasualties <- df2$INJURIES + df2$FATALITIES
A final tidy dataset with only the required fields is then created.
myData <- df2 %>%
select(year,EVTYPE,FATALITIES,INJURIES,TotalCasualties,cropLoss,propertyLoss,TotalEconomicLoss)
names(myData) <- c("YEAR","EVENT","FATALITIES","INJURIES","TOTALCASUALTIES","CROPLOSS","PROPERTYLOSS","TOTALECONOMICLOSS")
myData$FATALITIES <- myData$FATALITIES
myData$INJURIES <- myData$INJURIES
myData$TOTALCASUALTIES <- myData$TOTALCASUALTIES
casualities <- myData %>%
group_by(EVENT)%>%
summarise_each(funs(sum),TOTALCASUALTIES,FATALITIES,INJURIES)
tot <- casualities %>%
arrange(desc(TOTALCASUALTIES)) %>%
top_n(10)%>%
select(1:2)
## Selecting by INJURIES
fat <- casualities %>%
arrange(desc(FATALITIES)) %>%
top_n(10)%>%
select(1,3)
## Selecting by INJURIES
inj <- casualities %>%
arrange(desc(INJURIES)) %>%
top_n(10)%>%
select(1,4)
## Selecting by INJURIES
plot_grid <- matrix(c(1,1,2,3),nrow = 2,ncol = 2,byrow = T)
layout(plot_grid)
barplot(tot$TOTALCASUALTIES,col = rainbow(length(tot$EVENT)),main = "Total Casualties",
ylab = "Casualties")
legend("topright", legend = tot$EVENT,fill = rainbow(length(tot$EVENT)),
ncol = 2,cex = 0.5)
barplot(fat$FATALITIES,col = rainbow(length(fat$EVENT)),main = "Fatalities",
ylab = "Fatalities")
legend("topright", legend = fat$EVENT,fill = rainbow(length(fat$EVENT)),
ncol = 2,cex = 0.5)
barplot(inj$INJURIES,
col = rainbow(length(inj$EVENT)),main = "Injuries",ylab = "Injuries")
legend("topright", legend = inj$EVENT,fill = rainbow(length(inj$EVENT)),
ncol = 2,cex = 0.5)
The charts show the ten most devastating events with respect to population health impact. These are split into the total number of casualties, fatalities and injuries. Tornados are most harmfull with respect to population health followed by excessive heat. However, excessive heat results in more fatalities than tornados, which cause the most injuries.
economic <- myData %>%
group_by(EVENT)%>%
summarise_each(funs(sum),TOTALECONOMICLOSS,PROPERTYLOSS,CROPLOSS)
tot_eco <- economic %>%
arrange(desc(TOTALECONOMICLOSS)) %>%
top_n(10)%>%
select(1:2)
## Selecting by CROPLOSS
prop <- economic %>%
arrange(desc(PROPERTYLOSS)) %>%
top_n(10)%>%
select(1,3)
## Selecting by CROPLOSS
crop <- economic %>%
arrange(desc(CROPLOSS)) %>%
top_n(10)%>%
select(1,4)
## Selecting by CROPLOSS
plot_grid <- matrix(c(1,1,2,3),nrow = 2,ncol = 2,byrow = T)
layout(plot_grid)
barplot(tot_eco$TOTALECONOMICLOSS,col = rainbow(length(tot_eco$EVENT)),main = "Total Economic Loss",
ylab = "Economic Loss USD")
legend("topright", legend = tot_eco$EVENT,fill = rainbow(length(tot_eco$EVENT)),
ncol = 2,cex = 0.5)
barplot(prop$PROPERTYLOSS,col = rainbow(length(prop$EVENT)),main = "Property Loss",
ylab = "Property Loss USD")
legend("topright", legend = prop$EVENT,fill = rainbow(length(prop$EVENT)),
ncol = 2,cex = 0.5)
barplot(crop$CROPLOSS,
col = rainbow(length(crop$EVENT)),main = "Crop Loss",ylab = "Crop Loss USD")
legend("topright", legend = crop$EVENT,fill = rainbow(length(crop$EVENT)),
ncol = 2,cex = 0.5)
The charts show the ten most devastating events with respect to economic damage. These are split into total loss, property loss and crop loss. Floods cause the most total economic loss followed by hurricanes. Floods cause the most property damage and are also just second to droughts with respect to crop loss.