bgn_date and end_date columnpropdmg (property damage) andevtype column
state and state_nr columncounty and countyname columnThe basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Your data analysis must address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
The analysis “Superstorms: The most severe weather events in the U.S.” is written by hugominze and published on Rpubs. It is based on the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It looks at the types of weather events the most dreadful for population health and economic goods in the U.S. from 1950 to 2011. The author comes to the conclusion that superstorms such as tornadoes, hurricanes and typhoons top the list of the weather events the most dreadful to both economic goods and population health.
This is a roadmap for our data preprocessing:
bgn_date and end_date columnpropdmg (property damage) and cropdmg (crop damage) in whole numbers (without B, K etc.)evtype columnstate and state_nr columncounty and countyname columnFirst of all, we set the working directory and load all packages necessary for preparing and analyzing the data.
# set working directory!
# load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(snakecase)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(XML)
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(foreign)
Then, we create a directory, download the data and read the data in. It is based on the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It looks at the types of weather events the most dreadful for population health and economic goods in the U.S. from 1950 to 2011.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("data")) {
dir.create("data")
}
if (!file.exists("./data/data.csv.bz2")) {
download.file(url, "./data/data.csv.bz2", method = "curl")
dataDownloaded <- date()
}
list.files("./data")
## [1] "data.csv.bz2" "forecastzones" "forecastzones.zip"
data1 <- read.csv("./data/data.csv.bz2")
#create tibble:
data <- as_tibble(data1)
Let’s have a look at the data.
data
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(data)
## tibble [902,297 × 37] (S3: tbl_df/tbl/data.frame)
## $ STATE__ : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr [1:902297] "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr [1:902297] "" "" "" "" ...
## $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
## $ END_DATE : chr [1:902297] "" "" "" "" ...
## $ END_TIME : chr [1:902297] "" "" "" "" ...
## $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
## $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr [1:902297] "" "" "" "" ...
## $ END_LOCATI: chr [1:902297] "" "" "" "" ...
## $ LENGTH : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
## $ WFO : chr [1:902297] "" "" "" "" ...
## $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
## $ ZONENAMES : chr [1:902297] "" "" "" "" ...
## $ LATITUDE : num [1:902297] 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
## $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
## $ REMARKS : chr [1:902297] "" "" "" "" ...
## $ REFNUM : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
There are 37 columns and 902.297 rows. For our purpose, we do not need all columns, however we need the columns that are vital to our analysis and the columns that make each entry unique. We also want to make the column names better to read.
So, we make the column names easy to read using snake case and unique names:
names(data) <- to_snake_case(names(data))
names(data) <- make.names(names(data), unique=TRUE)
data <- data %>%
rename(state_nr=state, state = "state.1") %>%
relocate(state, .after = state_nr) %>%
print
## # A tibble: 902,297 × 37
## state_nr state bgn_date bgn_time time_zone county countyname evtype bgn_range
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 AL 4/18/19… 0130 CST 97 MOBILE TORNA… 0
## 2 1 AL 4/18/19… 0145 CST 3 BALDWIN TORNA… 0
## 3 1 AL 2/20/19… 1600 CST 57 FAYETTE TORNA… 0
## 4 1 AL 6/8/195… 0900 CST 89 MADISON TORNA… 0
## 5 1 AL 11/15/1… 1500 CST 43 CULLMAN TORNA… 0
## 6 1 AL 11/15/1… 2000 CST 77 LAUDERDALE TORNA… 0
## 7 1 AL 11/16/1… 0100 CST 9 BLOUNT TORNA… 0
## 8 1 AL 1/22/19… 0900 CST 123 TALLAPOOSA TORNA… 0
## 9 1 AL 2/13/19… 2000 CST 125 TUSCALOOSA TORNA… 0
## 10 1 AL 2/13/19… 2000 CST 57 FAYETTE TORNA… 0
## # … with 902,287 more rows, and 28 more variables: bgn_azi <chr>,
## # bgn_locati <chr>, end_date <chr>, end_time <chr>, county_end <dbl>,
## # countyendn <lgl>, end_range <dbl>, end_azi <chr>, end_locati <chr>,
## # length <dbl>, width <dbl>, f <int>, mag <dbl>, fatalities <dbl>,
## # injuries <dbl>, propdmg <dbl>, propdmgexp <chr>, cropdmg <dbl>,
## # cropdmgexp <chr>, wfo <chr>, stateoffic <chr>, zonenames <chr>,
## # latitude <dbl>, longitude <dbl>, latitude_e <dbl>, longitude.1 <dbl>, …
For better readability, we convert all characters to lowercase:
data <- data %>%
mutate(countyname = tolower(countyname), evtype=tolower(evtype),
state = tolower(state)) %>%
print
## # A tibble: 902,297 × 37
## state_nr state bgn_date bgn_time time_zone county countyname evtype bgn_range
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 al 4/18/19… 0130 CST 97 mobile torna… 0
## 2 1 al 4/18/19… 0145 CST 3 baldwin torna… 0
## 3 1 al 2/20/19… 1600 CST 57 fayette torna… 0
## 4 1 al 6/8/195… 0900 CST 89 madison torna… 0
## 5 1 al 11/15/1… 1500 CST 43 cullman torna… 0
## 6 1 al 11/15/1… 2000 CST 77 lauderdale torna… 0
## 7 1 al 11/16/1… 0100 CST 9 blount torna… 0
## 8 1 al 1/22/19… 0900 CST 123 tallapoosa torna… 0
## 9 1 al 2/13/19… 2000 CST 125 tuscaloosa torna… 0
## 10 1 al 2/13/19… 2000 CST 57 fayette torna… 0
## # … with 902,287 more rows, and 28 more variables: bgn_azi <chr>,
## # bgn_locati <chr>, end_date <chr>, end_time <chr>, county_end <dbl>,
## # countyendn <lgl>, end_range <dbl>, end_azi <chr>, end_locati <chr>,
## # length <dbl>, width <dbl>, f <int>, mag <dbl>, fatalities <dbl>,
## # injuries <dbl>, propdmg <dbl>, propdmgexp <chr>, cropdmg <dbl>,
## # cropdmgexp <chr>, wfo <chr>, stateoffic <chr>, zonenames <chr>,
## # latitude <dbl>, longitude <dbl>, latitude_e <dbl>, longitude.1 <dbl>, …
Let us filter out the irrelevant columns for our analysis:
data <- data %>%
select(-bgn_range,-bgn_azi,-bgn_locati,-end_time,-county_end,-countyendn,
-end_range,-end_azi,-end_locati,-wfo,-stateoffic,-zonenames,
-latitude,-longitude,
-latitude_e,-longitude.1) %>%
print
## # A tibble: 902,297 × 21
## state_nr state bgn_date bgn_time time_zone county countyname evtype end_date
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 al 4/18/195… 0130 CST 97 mobile torna… ""
## 2 1 al 4/18/195… 0145 CST 3 baldwin torna… ""
## 3 1 al 2/20/195… 1600 CST 57 fayette torna… ""
## 4 1 al 6/8/1951… 0900 CST 89 madison torna… ""
## 5 1 al 11/15/19… 1500 CST 43 cullman torna… ""
## 6 1 al 11/15/19… 2000 CST 77 lauderdale torna… ""
## 7 1 al 11/16/19… 0100 CST 9 blount torna… ""
## 8 1 al 1/22/195… 0900 CST 123 tallapoosa torna… ""
## 9 1 al 2/13/195… 2000 CST 125 tuscaloosa torna… ""
## 10 1 al 2/13/195… 2000 CST 57 fayette torna… ""
## # … with 902,287 more rows, and 12 more variables: length <dbl>, width <dbl>,
## # f <int>, mag <dbl>, fatalities <dbl>, injuries <dbl>, propdmg <dbl>,
## # propdmgexp <chr>, cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>,
## # refnum <dbl>
bgn_date and end_date columnWe are going to create a begin and end date column in date format: bgn_date and end_date. As long as they are in a chr format, it is hard to work with them. So we convert them to date objects. As the time zones and times are not that important to our purpose, I keep the dates of the weather events but I delete the times and not the time zones:
data <- data %>% mutate(
end_date=ifelse(end_date=="", bgn_date, end_date)
) %>%
print
## # A tibble: 902,297 × 21
## state_nr state bgn_date bgn_time time_zone county countyname evtype end_date
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 1 al 4/18/195… 0130 CST 97 mobile torna… 4/18/19…
## 2 1 al 4/18/195… 0145 CST 3 baldwin torna… 4/18/19…
## 3 1 al 2/20/195… 1600 CST 57 fayette torna… 2/20/19…
## 4 1 al 6/8/1951… 0900 CST 89 madison torna… 6/8/195…
## 5 1 al 11/15/19… 1500 CST 43 cullman torna… 11/15/1…
## 6 1 al 11/15/19… 2000 CST 77 lauderdale torna… 11/15/1…
## 7 1 al 11/16/19… 0100 CST 9 blount torna… 11/16/1…
## 8 1 al 1/22/195… 0900 CST 123 tallapoosa torna… 1/22/19…
## 9 1 al 2/13/195… 2000 CST 125 tuscaloosa torna… 2/13/19…
## 10 1 al 2/13/195… 2000 CST 57 fayette torna… 2/13/19…
## # … with 902,287 more rows, and 12 more variables: length <dbl>, width <dbl>,
## # f <int>, mag <dbl>, fatalities <dbl>, injuries <dbl>, propdmg <dbl>,
## # propdmgexp <chr>, cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>,
## # refnum <dbl>
data$bgn_date <- sapply(strsplit(data$bgn_date, " "), function(x){x[1]})
data$end_date <- sapply(strsplit(data$end_date, " "), function(x){x[1]})
data$bgn_date <- mdy(data$bgn_date)
data$end_date <- mdy(data$end_date)
data <- data %>%
select(state_nr, state, bgn_date, end_date, county:refnum, - bgn_time,
-time_zone) %>%
print
## # A tibble: 902,297 × 19
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 902,287 more rows, and 10 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, propdmg <dbl>, propdmgexp <chr>,
## # cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>, refnum <dbl>
propdmg (property damage) andcropdmg (crop damage) in whole numbers (without B, K etc.)
In our data, currently the column propdmgexp stores the exponents of the damages. We want to to multiply the column propdmgexp with propdmg to get the real numbers. Only then, we can sum the damages up.
First we change exponent abbreviation such as B for billion or m for million into real numbers:
data <- data %>%
mutate(
propdmgexp= ifelse(propdmgexp=="B", 1000000000, propdmgexp),
propdmgexp=ifelse(propdmgexp=="b", 1000000000, propdmgexp),
propdmgexp=ifelse(propdmgexp=="M", 1000000, propdmgexp),
propdmgexp=ifelse(propdmgexp=="m", 1000000, propdmgexp),
propdmgexp=ifelse(propdmgexp=="K", 1000, propdmgexp),
propdmgexp=ifelse(propdmgexp=="k", 1000, propdmgexp),
propdmgexp=ifelse(propdmgexp=="H", 100, propdmgexp),
propdmgexp=ifelse(propdmgexp=="h", 100, propdmgexp),
propdmgexp=ifelse(propdmgexp=="1", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="2", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="3", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="4", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="5", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="6", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="7", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="8", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="0", 10, propdmgexp),
propdmgexp=ifelse(propdmgexp=="+", 1, propdmgexp),
propdmgexp=ifelse(propdmgexp=="-", 0, propdmgexp),
propdmgexp=ifelse(propdmgexp=="", 0, propdmgexp),
propdmgexp=ifelse(propdmgexp=="?", 0, propdmgexp),
) %>%
print
## # A tibble: 902,297 × 19
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 902,287 more rows, and 10 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, propdmg <dbl>, propdmgexp <chr>,
## # cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>, refnum <dbl>
Then, we multiply the column propdmgexp and propdmg:
data <- data %>%
mutate(propdmgexp=as.numeric(propdmgexp)) %>%
mutate(`propdmgexp*exp` = propdmg * propdmgexp)%>%
select(-propdmg, -propdmgexp) %>%
print
## # A tibble: 902,297 × 18
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 902,287 more rows, and 9 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, cropdmg <dbl>, cropdmgexp <chr>,
## # remarks <chr>, refnum <dbl>, propdmgexp*exp <dbl>
Let’s do the same with cropdmgexp and cropdmg:
data <-data %>%
mutate(
cropdmgexp= ifelse(cropdmgexp=="B", 1000000000, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="b", 1000000000, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="M", 1000000, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="m", 1000000, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="K", 1000, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="k", 1000, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="H", 100, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="h", 100, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="1", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="2", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="3", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="4", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="5", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="6", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="7", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="8", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="0", 10, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="+", 1, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="-", 0, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="", 0, cropdmgexp),
cropdmgexp=ifelse(cropdmgexp=="?", 0, cropdmgexp),
) %>%
print
## # A tibble: 902,297 × 18
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 902,287 more rows, and 9 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, cropdmg <dbl>, cropdmgexp <chr>,
## # remarks <chr>, refnum <dbl>, propdmgexp*exp <dbl>
data <- data %>%
mutate(cropdmgexp=as.numeric(cropdmgexp)) %>%
mutate(`cropdmgexp*exp` = cropdmg * cropdmgexp) %>%
select(-cropdmg, -cropdmgexp) %>%
print
## # A tibble: 902,297 × 17
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 902,287 more rows, and 8 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, remarks <chr>, refnum <dbl>,
## # propdmgexp*exp <dbl>, cropdmgexp*exp <dbl>
Ok, now we rename the new columns back into propdmg and cropdmg:
data <- data %>%
rename(propdmg = "propdmgexp*exp") %>%
rename(cropdmg = `cropdmgexp*exp`) %>%
print
## # A tibble: 902,297 × 17
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 902,287 more rows, and 8 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, remarks <chr>, refnum <dbl>,
## # propdmg <dbl>, cropdmg <dbl>
And we are done with cleaning up the propdmg and cropdmg columns.
evtype columnNext, let’s have a look at the event types:
evtype_count <- data %>%
count(evtype) %>%
arrange(desc(n)) %>%
print
## # A tibble: 898 × 2
## evtype n
## <chr> <int>
## 1 hail 288661
## 2 tstm wind 219942
## 3 thunderstorm wind 82564
## 4 tornado 60652
## 5 flash flood 54277
## 6 flood 25327
## 7 thunderstorm winds 20843
## 8 high wind 20214
## 9 lightning 15754
## 10 heavy snow 15708
## # … with 888 more rows
There are 985 event types, even though the documentary only has about 49 categories of weather event types! So a lot of the event types above can be put together under one category, or are erroneous.
It would take a lot of time to manually clean the data, for example to decide whether we want to put, for example, a notion like “tstm wind/hail” into either the category “hail” or “thunderstorm”. Instead of rectifying every row, I will make sure that it is only a small percentage of economic damage and harm to population health that gets lost when I erase rows with erroneous entries in evtype.
This is our roadmap for this undertaking: 9.1. Calculate the percentages of the event types that are not in the documentation. 9.2. Change wrong event type names into event type names that are available in the documentation until the percentage is low enough to delete the rest
Let’s check first how much percentage of economic damage and harm to population health is affected by the wrong event types.
These are the weather events from the documentation:
weatherevents <- tolower(c("Astronomical Low Tide", "Avalanche", "Blizzard",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Excessive Heat",
"Extreme Cold/Wind Chill", "Flash Flood", "Flood",
"Freezing Fog", "Frost/Freeze", "Funnel Cloud",
"Hail", "Heat", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Hurricane/Typhoon",
"Ice Storm", "Lakeshore Flood", "Lake-Effect Snow",
"Lightning", "Marine Hail", "Marine High Wind",
"Marine Strong Wind", "Marine Thunderstorm Wind",
"Rip Current", "Seiche", "Sleet", "Storm Surge/Tide",
"Strong Wind", "Thunderstorm Wind", "Tornado",
"Tropical Depression", "Tropical Storm", "Tsunami",
"Volcanic Ash", "Waterspout", "Wildfire",
"Winter Storm", "Winter Weather"))
weatherevents <- sort(weatherevents)
weatherevents
## [1] "astronomical low tide" "avalanche"
## [3] "blizzard" "coastal flood"
## [5] "cold/wind chill" "debris flow"
## [7] "dense fog" "dense smoke"
## [9] "drought" "dust devil"
## [11] "dust storm" "excessive heat"
## [13] "extreme cold/wind chill" "flash flood"
## [15] "flood" "freezing fog"
## [17] "frost/freeze" "funnel cloud"
## [19] "hail" "heat"
## [21] "heavy rain" "heavy snow"
## [23] "high surf" "high wind"
## [25] "hurricane/typhoon" "ice storm"
## [27] "lake-effect snow" "lakeshore flood"
## [29] "lightning" "marine hail"
## [31] "marine high wind" "marine strong wind"
## [33] "marine thunderstorm wind" "rip current"
## [35] "seiche" "sleet"
## [37] "storm surge/tide" "strong wind"
## [39] "thunderstorm wind" "tornado"
## [41] "tropical depression" "tropical storm"
## [43] "tsunami" "volcanic ash"
## [45] "waterspout" "wildfire"
## [47] "winter storm" "winter weather"
These are the event types that are not in the documentation:
data %>% filter(!evtype %in% weatherevents) %>%
count(evtype) %>%
arrange(desc(n))
documentation.
Let’s calculate the percentages of the event types which are not in the documentation:
#not in documentation
nodoc <- data %>% filter(!evtype %in% weatherevents) %>%
summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))
#in documentation
doc <- data %>% filter(evtype %in% weatherevents) %>%
summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))
# in and out of documentation
all <- data %>% summarize(sum(propdmg), sum(cropdmg), sum(injuries),
sum(fatalities))
test <- full_join(nodoc, doc)
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
test1 <- full_join(test, all) %>%
mutate(type= c("nan_doc", "doc", "all")) %>%
select(type, "sum(propdmg)":"sum(fatalities)") %>%
print
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
## # A tibble: 3 × 5
## type `sum(propdmg)` `sum(cropdmg)` `sum(injuries)` `sum(fatalities)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 nan_doc 82982093201 11858392160 12558 2023
## 2 doc 344336626839 37245802350 127970 13122
## 3 all 427318720040 49104194510 140528 15145
percentages <- tibble(propdmg = test1$`sum(propdmg)`[1] / test1$`sum(propdmg)`[3],
cropdmg = test1$`sum(cropdmg)`[1] / test1$`sum(cropdmg)`[3],
fatalities = test1$`sum(fatalities)`[1] / test1$`sum(fatalities)`[3],
injuries = test1$`sum(injuries)`[1] / test1$`sum(injuries)`[3]) %>%
print
## # A tibble: 1 × 4
## propdmg cropdmg fatalities injuries
## <dbl> <dbl> <dbl> <dbl>
## 1 0.194 0.241 0.134 0.0894
All in all, the percentage that would get lost if we just erased the values that do not have event types from the official documentation is a lot, so we cannot just erase these values. So, let’s look at which event types are both not documented and have most damages/harms to health. We might not be able to change all wrong event type values but we can change the most important event type values.
Let’s start with property damage:
data %>% filter(!evtype %in% weatherevents) %>%
group_by(evtype) %>%
summarize(sum_propdmg = sum(propdmg)) %>%
arrange(desc(sum_propdmg))
Wow, that’s a lot of property damage that would have been lost if we had just erase the rows (first two rows are already more than 50 billion).
in the documentation until the percentage is low enough to delete the rest
Let’s change the upper event types names to the ones in the documentation.
# "storm surge/tide"
data$evtype <- replace(data$evtype, data$evtype=="storm surge", "storm surge/tide")
# "hurricane/typhoon"
data$evtype <- replace(data$evtype, data$evtype=="hurricane opal", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="typhoon", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane erin", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane opal/high winds", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane erin", "hurricane/typhoon")
#"flood"
data$evtype <- replace(data$evtype, data$evtype=="river flood", "flood")
data$evtype <- replace(data$evtype, data$evtype=="flooding", "flood")
data$evtype <- replace(data$evtype, data$evtype=="major flood", "flood")
data$evtype <- replace(data$evtype, data$evtype=="river flooding", "flood")
data$evtype <- replace(data$evtype, data$evtype=="coastal flooding", "coastal flood")
data$evtype <- replace(data$evtype, data$evtype=="flood/rain/winds", "flood")
#"thunderstorm wind"
data$evtype <- replace(data$evtype, data$evtype=="tstm wind", "thunderstorm wind")
data$evtype <- replace(data$evtype, data$evtype=="thunderstorm winds", "thunderstorm wind")
data$evtype <- replace(data$evtype, data$evtype=="severe thunderstorm", "thunderstorm wind")
#"tornado"
data$evtype <- replace(data$evtype, data$evtype=="tornadoes, tstm wind, hail", "tornado")
#"wildfire"
data$evtype <- replace(data$evtype, data$evtype=="wild/forest fire", "wildfire")
data$evtype <- replace(data$evtype, data$evtype=="wild fires", "wildfire")
data$evtype <- replace(data$evtype, data$evtype=="wildfires", "wildfire")
#"heavy rain"
data$evtype <- replace(data$evtype, data$evtype=="heavy rain/severe weather", "heavy rain")
data$evtype <- replace(data$evtype, data$evtype=="excessive wetness", "heavy rain")
#hail
data$evtype <- replace(data$evtype, data$evtype=="hailstorm", "hail")
#"flash flood"
data$evtype <- replace(data$evtype, data$evtype=="flash flooding", "flash flood")
data$evtype <- replace(data$evtype, data$evtype=="flash flood/flood", "flash flood")
data$evtype <- replace(data$evtype, data$evtype=="flood/flash flood", "flash flood")
data$evtype <- replace(data$evtype, data$evtype=="heavy surf/high surf", "coastal flood")
#avalanche
data$evtype <- replace(data$evtype, data$evtype=="landslide", "avalanche")
#high wind
data$evtype <- replace(data$evtype, data$evtype=="high winds", "high wind")
data$evtype <- replace(data$evtype, data$evtype=="high winds/cold", "high wind")
#frost/freeze
data$evtype <- replace(data$evtype, data$evtype=="extreme cold", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="freeze", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="damaging freeze", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="ice", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="glaze", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="cold", "frost/freeze")
#rip current
data$evtype <- replace(data$evtype, data$evtype=="rip currents", "rip current")
#"excessive heat"
data$evtype <- replace(data$evtype, data$evtype=="extreme heat", "excessive heat")
data$evtype <- replace(data$evtype, data$evtype=="heat wave", "excessive heat")
data$evtype <- replace(data$evtype, data$evtype=="unseasonably warm and dry", "excessive heat")
data$evtype <- replace(data$evtype, data$evtype=="record/excessive heat", "excessive heat")
#"dense fog"
data$evtype <- replace(data$evtype, data$evtype=="fog", "dense fog")
Now, let us check again which event types have most propdmg and are not in the documentation:
data %>% filter(!evtype %in% weatherevents) %>%
group_by(evtype) %>%
summarize(sum_propdmg = sum(propdmg)) %>%
arrange(desc(sum_propdmg))
And let us do the same with the crop damage:
data %>% filter(!evtype %in% weatherevents) %>%
group_by(evtype) %>%
summarize(sum_cropdmg = sum(cropdmg)) %>%
arrange(desc(sum_cropdmg))
And fatalities:
data %>% filter(!evtype %in% weatherevents) %>%
group_by(evtype) %>%
summarize(sum_fatalities = sum(fatalities)) %>%
arrange(desc(sum_fatalities))
And injuries:
data %>% filter(!evtype %in% weatherevents) %>%
group_by(evtype) %>%
summarize(sum_injuries = sum(injuries)) %>%
arrange(desc(sum_injuries))
Let’s check how much percentage the event ypes that are not documented still have. If the percentage is low, we can just erase them from our data.
#not in documentation
nodoc <- data %>% filter(!evtype %in% weatherevents) %>%
summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))
#in documentation
doc <- data %>% filter(evtype %in% weatherevents) %>%
summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))
# in and out of documentation
all <- data %>% summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))
test <- full_join(nodoc, doc)
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
test1 <- full_join(test, all) %>%
mutate(type= c("nodoc", "doc", "all")) %>%
select(type, "sum(propdmg)":"sum(fatalities)") %>%
print
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
## # A tibble: 3 × 5
## type `sum(propdmg)` `sum(cropdmg)` `sum(injuries)` `sum(fatalities)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 nodoc 662252162 519866330 1276 360
## 2 doc 426656467878 48584328180 139252 14785
## 3 all 427318720040 49104194510 140528 15145
percentages <- tibble(propdmg = test1$`sum(propdmg)`[1] / test1$`sum(propdmg)`[3],
cropdmg = test1$`sum(cropdmg)`[1] / test1$`sum(cropdmg)`[3],
fatalities = test1$`sum(fatalities)`[1] / test1$`sum(fatalities)`[3],
injuries = test1$`sum(injuries)`[1] / test1$`sum(injuries)`[3]) %>%
print
## # A tibble: 1 × 4
## propdmg cropdmg fatalities injuries
## <dbl> <dbl> <dbl> <dbl>
## 1 0.00155 0.0106 0.0238 0.00908
The percentage of data that is not in the documentation is pretty low. Once we have a low percentage, we can then filter the evtypes out that are not in the documentation:
data <- data %>% filter(evtype %in% weatherevents) %>%
mutate(evtype=as.factor(evtype)) %>%
print
## # A tibble: 884,267 × 17
## state_nr state bgn_date end_date county countyname evtype length width
## <dbl> <chr> <date> <date> <dbl> <chr> <fct> <dbl> <dbl>
## 1 1 al 1950-04-18 1950-04-18 97 mobile tornado 14 100
## 2 1 al 1950-04-18 1950-04-18 3 baldwin tornado 2 150
## 3 1 al 1951-02-20 1951-02-20 57 fayette tornado 0.1 123
## 4 1 al 1951-06-08 1951-06-08 89 madison tornado 0 100
## 5 1 al 1951-11-15 1951-11-15 43 cullman tornado 0 150
## 6 1 al 1951-11-15 1951-11-15 77 lauderdale tornado 1.5 177
## 7 1 al 1951-11-16 1951-11-16 9 blount tornado 1.5 33
## 8 1 al 1952-01-22 1952-01-22 123 tallapoosa tornado 0 33
## 9 1 al 1952-02-13 1952-02-13 125 tuscaloosa tornado 3.3 100
## 10 1 al 1952-02-13 1952-02-13 57 fayette tornado 2.3 100
## # … with 884,257 more rows, and 8 more variables: f <int>, mag <dbl>,
## # fatalities <dbl>, injuries <dbl>, remarks <chr>, refnum <dbl>,
## # propdmg <dbl>, cropdmg <dbl>
Then, I had a first glance at the top damages for fatalities, injuries, propdmg and cropdmg. I detected that there was a “flood” entry that seemed to be a bit too high and that would completely distort the data if false (see first entry for propdmg, refnum 605943):
data %>% mutate(evtype="flood") %>%
arrange(desc(propdmg))