In this study we aim to analyse the National Weather Data across the U.S.
We will first try to identify the relevant set of data to use for better accuracy. Then we will try to assess which type of event have the greatest impact on the health population first, then economic. To do so, we will tidy up the database from National Weather Service, by realigning EVTYPE under EVCAT and remapping the exponent in order to overcome the input data quality.
We observe that events with higher human impacts are different than the one with higher economical impact.
This is done in the framework of coursera assessment using the data described under https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
This code section is used to: - load various library that might be required for processing or plotting the data - clear the working environment
commented code lines will be useful (in uncommented) to - install (name of package) - get details of the environment for reproducible research.
rm (list= ls())
##install.packages("gridExtra")
##install.packages("lattice")
library("ggplot2")
##library("R.utils")
library("data.table")
## Warning: package 'data.table' was built under R version 4.0.4
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("plyr")
## Warning: package 'plyr' was built under R version 4.0.4
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library("tidyr")
library("knitr")
library("xtable")
library("lubridate")
## Warning: package 'lubridate' was built under R version 4.0.4
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library("grid")
library("gridExtra")
## Warning: package 'gridExtra' was built under R version 4.0.4
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library("lattice")
## Warning: package 'lattice' was built under R version 4.0.4
##sessionInfo()
We will be using the data from this website https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
The first section, named Loading, will be downloading the file, decompressed it using the R function bunzip2 (format being bzip2), read the data, store it in a dataframe df, and provide a quick overview/summary of the data
The second section, named Subset, is used to deep dive further in the data set. From the first histogram we will identify the years were the data seems more complete and accurate. We found out that during the last 18 years there were as many events recorded than during the first 44 years which was why we decided to choose 1994 as the first year of our working dataset.In this section we left commented some functions used to explore further the data input.
In the third section, named remapping category, we are transferring the information from EVTYPE to category and cleaning some of the manual input of the database. to do so, we start by broad classification pf category (under 9X and 0 series), then we use more specific terms to fine tune to the original event type expected. (1. to 48 as per the data documentation).
In the fourth section, we process the damages data. We start to work on the the exponent of Crop and property data. Note this is hard mapping due to lack of time to develop a clean function to map various unexpected input to correct exponent. Then we create the top 10 value for each component of our study : injuries, fatalities and both, then crop, property and total amount.
if(!file.exists("~/repdata_data_StormData.csv.bz2")){
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="~/repdata_data_StormData.csv.bz2")
}
if(!file.exists("~/repdata_data_StormData.csv")){
bunzip2(filename="~/repdata_data_StormData.csv.bz2", destname="~/repdata_data_StormData.csv", overwrite=TRUE, remove=FALSE)
}
if (!"StormData" %in% ls()){
df <- fread("~/repdata_data_StormData.csv")
}
summary(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
head(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4: 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5: 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6: 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1: TORNADO 0 0 NA
## 2: TORNADO 0 0 NA
## 3: TORNADO 0 0 NA
## 4: TORNADO 0 0 NA
## 5: TORNADO 0 0 NA
## 6: TORNADO 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1: 0 14.0 100 3 0 0 15 25.0
## 2: 0 2.0 150 2 0 0 0 2.5
## 3: 0 0.1 123 2 0 0 2 25.0
## 4: 0 0.0 100 2 0 0 2 2.5
## 5: 0 0.0 150 2 0 0 2 2.5
## 6: 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1: K 0 3040 8812
## 2: K 0 3042 8755
## 3: K 0 3340 8742
## 4: K 0 3458 8626
## 5: K 0 3412 8642
## 6: K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1: 3051 8806 1
## 2: 0 0 2
## 3: 0 0 3
## 4: 0 0 4
## 5: 0 0 5
## 6: 0 0 6
dt <- mutate(df, EVTYPE = toupper(EVTYPE), PROPDMGEXP = toupper(PROPDMGEXP), CROPDMGEXP = toupper(CROPDMGEXP), DATES = as.numeric(year(mdy_hms(BGN_DATE))))
t <- dt %>% group_by(DATES) %>% tally()
ggplot(dt,aes(DATES))+
geom_histogram(bins = 62, color="black", fill="white")+
theme_bw()+
geom_hline(aes(yintercept=mean((t$n))))
dt <- filter(dt, DATES>1993)
## sort(unique(dt$EVTYPE))
## sort(unique(dt$INJURIES))
## sort(unique(dt$FATALITIES))
## sort(unique((dt$PROPDMG))
## sort(unique(dt$PROPDMGEXP))
## sort(unique(dt$CROPDMGEXP))
dt$EVCAT<-""
dt$EVCAT[grepl("COLD|HYPERTHERMIA|SNOW|SNOWFALL|COOL|RECORD LOW|ICE|ICY|WINTER|LOW TEMPERATURE|LOW TEMP", dt$EVTYPE)] <- "90.COLD WEATHER"
dt$EVCAT[grepl("HOT|DRY|HIGH TEMPERATURES|HIGH TEMPERATURE|RECORD HIGH|WARM|WARMTH|RECORD TEMPERATURE", dt$EVTYPE)] <- "91.HOT WEATHER"
dt$EVCAT[grepl("RAIN|WET|PRECIPICATION|RAINFALL|SHOWER|PRECIPITATION|FLOOODING|PRECIPATATION|PRECIP", dt$EVTYPE)] <- "92.WET AND RAINY WEATHER"
dt$EVCAT[grepl("WIND|STORM|WND|WINDS", dt$EVTYPE)] <- "93.WIND"
dt$EVCAT[grepl("SLIDE", dt$EVTYPE)] <- "94.SLIDE"
dt$EVCAT[grepl("DUST", dt$EVTYPE)] <- "95.DUST"
dt$EVCAT[grepl("FIRE|SMOKE|FOG", dt$EVTYPE)] <- "96.FIRE SMOKE AND FOG"
dt$EVCAT[grepl("URBAN", dt$EVTYPE)] <- "96.URBAN"
dt$EVCAT[grepl("MARINE|COASTAL|BEACH|SEAS|TIDES|WAVE", dt$EVTYPE)] <- "97.MARINE BEACH COASTAL"
dt$EVCAT[grepl("LIGHTNING|LIGNTNING|LIGHTING", dt$EVTYPE)] <- "29.LIGHTNING"
dt$EVCAT[grepl("HEAT|UNSEASONABLY WARM|WARMTH", dt$EVTYPE)] <- "20.HEAT"
dt$EVCAT[grepl("HAIL", dt$EVTYPE)] <- "19.HAIL"
dt$EVCAT[grepl("FROST|FREEZE|FREEZING", dt$EVTYPE)] <- "17.FROST/FREEZE"
dt$EVCAT[grepl("FLOOD", dt$EVTYPE)] <- "15.FLOOD"
dt$EVCAT[grepl("SUMMARY|OTHER|NONE|NO SEVERE WEATHER", dt$EVTYPE)] <- "0.NA"
dt$EVCAT[grepl("THUNDERSTORM WIND|THUNDERSTORM WINDS|THUNDERSTORMS WINDS|THUNDERSTORMW WINDS|THUNDERSTROM WINDS|THUNDESTORM WINDS|THUNDERTSORM WIND|THUNDERTORM WINDS|THUNERSTORM WINDS|THUNDERSTORM|THUNDEERSTORM|THUNDERSTORMS|THUDERSTORM|THUNDERESTORM|THUNDERSTROM", dt$EVTYPE)] <- "39.THUNDERSTORM WIND"
dt$EVCAT[grepl("ASTRONOMICAL LOW TIDE|ASTRONOMICAL HIGH TIDE", dt$EVTYPE)]<- "1.ASTRONOMICAL LOW TIDE"
dt$EVCAT[grepl("AVALANCHE|AVALANCE", dt$EVTYPE)]<- "2.AVALANCHE"
dt$EVCAT[grepl("BLIZZARD", dt$EVTYPE)] <- "3.BLIZZARD"
dt$EVCAT[grepl("COASTAL FLOOD", dt$EVTYPE)] <- "4.COASTAL FLOOD"
dt$EVCAT[grepl("EXTREME COLD CHILL|EXTREME WIND CHILL", dt$EVTYPE)] <- "13.EXTREME COLD/WIND CHILL"
dt$EVCAT[grepl("COLD CHILL|WIND CHILL", dt$EVTYPE)] <- "5.COLD/WIND CHILL"
dt$EVCAT[grepl("DEBRIS FLOW", dt$EVTYPE)] <- "6.DEBRIS FLOW"
dt$EVCAT[grepl("DENSE FOG", dt$EVTYPE)] <- "7.DENSE FOG"
dt$EVCAT[grepl("DENSE SMOKE", dt$EVTYPE)] <- "8.DENSE SMOKE"
dt$EVCAT[grepl("DROUGHT", dt$EVTYPE)] <- "9.DROUGHT"
dt$EVCAT[grepl("DUST DEVIL", dt$EVTYPE)] <- "10.DUST DEVIL"
dt$EVCAT[grepl("DUST STORM", dt$EVTYPE)] <- "11.DUST STORM"
dt$EVCAT[grepl("EXCESSIVE HEAT", dt$EVTYPE)] <- "12.EXCESSIVE HEAT"
dt$EVCAT[grepl("FLASH FLOOD", dt$EVTYPE)] <- "14.FLASH FLOOD"
dt$EVCAT[grepl("FREEZING FOG", dt$EVTYPE)] <- "16.FREEZING FOG"
dt$EVCAT[grepl("FUNNEL CLOUD", dt$EVTYPE)] <- "18.FUNNEL CLOUD"
dt$EVCAT[grepl("HEAVY RAIN", dt$EVTYPE)] <- "21.HEAVY RAIN"
dt$EVCAT[grepl("HEAVY SNOW|THUNDERSNOW|RECORD SNOW|RECORD MAY SNOW", dt$EVTYPE)] <- "22.HEAVY SNOW"
dt$EVCAT[grepl("HIGH SURF|HEAVY SURF|SURF", dt$EVTYPE)] <- "23.HIGH SURF"
dt$EVCAT[grepl("HIGH WIND", dt$EVTYPE)] <- "24.HIGH WIND"
dt$EVCAT[grepl("HURRICANE|TYPHOON", dt$EVTYPE)] <- "25.HURRICANE TYPHOON"
dt$EVCAT[grepl("ICE STORM", dt$EVTYPE)] <- "26.ICE STORM"
dt$EVCAT[grepl("LAKESHORE FLOOD", dt$EVTYPE)] <- "27.LAKESHORE FLOOD"
dt$EVCAT[grepl("LAKE EFFECT SNOW|LAKE-EFFECT SNOW|HEAVY LAKE SNOW", dt$EVTYPE)] <- "28.LAKE-EFFECT SNOW"
dt$EVCAT[grepl("MARINE HAIL", dt$EVTYPE)] <- "30.MARINE HAIL"
dt$EVCAT[grepl("MARINE HIGH WIND", dt$EVTYPE)] <- "31.MARINE HIGH WIND"
dt$EVCAT[grepl("MARINE STRONG WIND", dt$EVTYPE)] <- "32.MARINE STRONG WIND"
dt$EVCAT[grepl("MARINE THUNDERSTORM WIND", dt$EVTYPE)] <- "33.MARINE THUNDERSTORM WIND"
dt$EVCAT[grepl("RIP CURRENT", dt$EVTYPE)] <- "34.RIP CURRENT"
dt$EVCAT[grepl("SEICHE", dt$EVTYPE)] <- "35.SEICHE"
dt$EVCAT[grepl("SLEET", dt$EVTYPE)] <- "36.SLEET"
dt$EVCAT[grepl("STORM TIDE", dt$EVTYPE)] <- "37.STORM TIDE"
dt$EVCAT[grepl("STRONG WIND", dt$EVTYPE)] <- "38.STRONG WIND"
dt$EVCAT[grepl("TORNADO|TORNDAO", dt$EVTYPE)] <- "40.TORNADO"
dt$EVCAT[grepl("TROPICAL DEPRESSION", dt$EVTYPE)] <- "41.TROPICAL DEPRESSION"
dt$EVCAT[grepl("TROPICAL STORM", dt$EVTYPE)] <- "42.TROPICAL STORM"
dt$EVCAT[grepl("TSUNAMI|TSTM|TSTMW", dt$EVTYPE)] <- "43.TSUNAMI"
dt$EVCAT[grepl("VOLCANIC ASH|VOLCANIC ERUPTION", dt$EVTYPE)] <- "44.VOLCANIC ASH"
dt$EVCAT[grepl("WATERSPOUT|WAYTERSPOUT", dt$EVTYPE)] <- "45.WATERSPOUT"
dt$EVCAT[grepl("WILDFIRE|BRUSH FIRE|WILD FIRES", dt$EVTYPE)] <- "46.WILDFIRE"
dt$EVCAT[grepl("WINTER STORM", dt$EVTYPE)] <- "47.WINTER STORM"
dt$EVCAT[grepl("WINTER WEATHER", dt$EVTYPE)] <- "48.WINTER WEATHER"
sort(unique(dt$EVCAT))
## [1] "" "0.NA"
## [3] "1.ASTRONOMICAL LOW TIDE" "10.DUST DEVIL"
## [5] "11.DUST STORM" "12.EXCESSIVE HEAT"
## [7] "14.FLASH FLOOD" "15.FLOOD"
## [9] "16.FREEZING FOG" "17.FROST/FREEZE"
## [11] "18.FUNNEL CLOUD" "19.HAIL"
## [13] "2.AVALANCHE" "20.HEAT"
## [15] "21.HEAVY RAIN" "22.HEAVY SNOW"
## [17] "23.HIGH SURF" "24.HIGH WIND"
## [19] "25.HURRICANE TYPHOON" "26.ICE STORM"
## [21] "27.LAKESHORE FLOOD" "28.LAKE-EFFECT SNOW"
## [23] "29.LIGHTNING" "3.BLIZZARD"
## [25] "30.MARINE HAIL" "31.MARINE HIGH WIND"
## [27] "33.MARINE THUNDERSTORM WIND" "34.RIP CURRENT"
## [29] "35.SEICHE" "36.SLEET"
## [31] "38.STRONG WIND" "39.THUNDERSTORM WIND"
## [33] "4.COASTAL FLOOD" "40.TORNADO"
## [35] "41.TROPICAL DEPRESSION" "42.TROPICAL STORM"
## [37] "43.TSUNAMI" "44.VOLCANIC ASH"
## [39] "45.WATERSPOUT" "46.WILDFIRE"
## [41] "47.WINTER STORM" "48.WINTER WEATHER"
## [43] "5.COLD/WIND CHILL" "7.DENSE FOG"
## [45] "8.DENSE SMOKE" "9.DROUGHT"
## [47] "90.COLD WEATHER" "91.HOT WEATHER"
## [49] "92.WET AND RAINY WEATHER" "93.WIND"
## [51] "94.SLIDE" "95.DUST"
## [53] "96.FIRE SMOKE AND FOG" "96.URBAN"
## [55] "97.MARINE BEACH COASTAL"
sort(unique(dt$CROPDMGEXP))
## [1] "" "?" "0" "2" "B" "K" "M"
sort(unique(dt$PROPDMGEXP))
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "H" "K" "M"
dt$CRPEXP <- mapvalues(dt$CROPDMGEXP,
c("","?","0","2","K","M","B"),
c("1","1","1","1e2","1e3","1e6","1e9"))
dt$PPTYEXP <- mapvalues(dt$PROPDMGEXP,
c("","?","-","+","0","1","2","H","3","K","4","5","6","M","7","8","B"),
c("1","1","1","1","1","1","1e2","1e2","1e3","1e3","1e4","1e5","1e6","1e6","1e7","1e8","1e9"))
dt$TOTCRP <- as.numeric(dt$CRPEXP) * dt$CROPDMG
dt$TOTPROP <- as.numeric(dt$PPTYEXP ) * dt$PROPDMG
dt$HUM <- dt$FATALITIES + dt$INJURIES
dt$TOT <- dt$TOTCRP + dt$TOTPROP
humanbycat <- head(arrange(aggregate(HUM ~ EVCAT, data=dt,sum), desc(HUM)), n=10)
totbycat <- head(arrange(aggregate(TOT ~ EVCAT, data=dt,sum), desc(TOT)), n=10)
fatalbycat <- head(arrange(aggregate(FATALITIES ~ EVCAT, data=dt,sum), desc(FATALITIES)),n=3)
injurbycat <- head(arrange(aggregate(INJURIES ~ EVCAT, data=dt,sum), desc(INJURIES)), n=3)
propbycat <- head(arrange(aggregate(TOTPROP ~ EVCAT, data=dt,sum),desc(TOTPROP)), n=3)
crpbycat <- head(arrange(aggregate(TOTCRP ~ EVCAT, data=dt,sum), desc(TOTCRP)), n=3)
We observe that Tornado are the event that have the most human impact (more than 23K people over 13 years) while for fatalities alone it is excessive heat. If we look at economical impact, then tornado will be on 4th position (25 Billions) after flood (150 Billions), Hurricane/Typhoon (80 Billions) and Wind (around 50 Billions)
humplot <- ggplot(humanbycat, aes(EVCAT, HUM)) +
geom_bar(stat = "identity", width = 0.5)+
labs( y = "TOTAL (people)", x = "Events", title="Total of human casualties per weather events in the US from 1994 to 2011")+
theme(axis.text.x = element_text(angle = 35, hjust = 1))
humplot
knitr::kables(
list(
knitr::kable(fatalbycat,valign="c",caption="FATALITIES"),
knitr::kable(injurbycat,valign="c",caption="INJURIES")
),
caption="Top 3 event types with highest injuries and fatalities"
)
|
|
totplot<- ggplot(totbycat, aes(EVCAT, TOT)) +
geom_bar(stat = "identity", width = 0.5)+
labs( y = "TOTAL (USD)", x = "Events", title="Total damage cost per weather events in the US from 1994 to 2011")+
theme(axis.text.x = element_text(angle = 35, hjust = 1))
totplot
knitr::kables(
list(
knitr::kable(propbycat,valign="c",caption="PROPERTIES"),
knitr::kable(crpbycat,valign="c",caption="CROPS")
),
caption="Top 3 event types with highest Properties and Crops"
)
|
|