Preface

Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Questions

Your data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.

Synopsis

The analysis “Superstorms: The most severe weather events in the U.S.” is written by hugominze and published on Rpubs. It is based on the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It looks at the types of weather events the most dreadful for population health and economic goods in the U.S. from 1950 to 2011. The author comes to the conclusion that superstorms such as tornadoes, hurricanes and typhoons top the list of the weather events the most dreadful to both economic goods and population health.

Data Preprocessing

The steps of data preprocessing

This is a roadmap for our data preprocessing:

  1. Set working directory, load packages
  2. Download and read data in
  3. Take first glance at data
  4. Make column names easy to read and unique
  5. Convert rows to lower case
  6. Filter out columns that we do not use later on
  7. Create a bgn_date and end_date column
  8. Create columns that display propdmg (property damage) and cropdmg (crop damage) in whole numbers (without B, K etc.)
  9. Clean up the evtype column
  10. Clean up state and state_nr column
  11. Clean up county and countyname column

First of all, we set the working directory and load all packages necessary for preparing and analyzing the data.

1. Set working directory, load packages

# set working directory!

# load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(snakecase)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(XML)
library(rvest)
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(foreign)

Then, we create a directory, download the data and read the data in. It is based on the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It looks at the types of weather events the most dreadful for population health and economic goods in the U.S. from 1950 to 2011.

2. Download and read data in

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists("data")) {
        dir.create("data")
}

if (!file.exists("./data/data.csv.bz2")) {
        download.file(url, "./data/data.csv.bz2", method = "curl")
        dataDownloaded <- date()
}

list.files("./data")
## [1] "data.csv.bz2"      "forecastzones"     "forecastzones.zip"
data1 <- read.csv("./data/data.csv.bz2")
#create tibble:
data <- as_tibble(data1)

3. Take first glance at data

Let’s have a look at the data.

data
names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
str(data)
## tibble [902,297 × 37] (S3: tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] "" "" "" "" ...
##  $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ END_DATE  : chr [1:902297] "" "" "" "" ...
##  $ END_TIME  : chr [1:902297] "" "" "" "" ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] "" "" "" "" ...
##  $ END_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
##  $ WFO       : chr [1:902297] "" "" "" "" ...
##  $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
##  $ ZONENAMES : chr [1:902297] "" "" "" "" ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] "" "" "" "" ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...

4. Make column names easier to read and unique

There are 37 columns and 902.297 rows. For our purpose, we do not need all columns, however we need the columns that are vital to our analysis and the columns that make each entry unique. We also want to make the column names better to read.

So, we make the column names easy to read using snake case and unique names:

names(data) <- to_snake_case(names(data))
names(data) <- make.names(names(data), unique=TRUE)
data <- data %>%
        rename(state_nr=state, state = "state.1") %>% 
        relocate(state, .after = state_nr) %>% 
        print
## # A tibble: 902,297 × 37
##    state_nr state bgn_date bgn_time time_zone county countyname evtype bgn_range
##       <dbl> <chr> <chr>    <chr>    <chr>      <dbl> <chr>      <chr>      <dbl>
##  1        1 AL    4/18/19… 0130     CST           97 MOBILE     TORNA…         0
##  2        1 AL    4/18/19… 0145     CST            3 BALDWIN    TORNA…         0
##  3        1 AL    2/20/19… 1600     CST           57 FAYETTE    TORNA…         0
##  4        1 AL    6/8/195… 0900     CST           89 MADISON    TORNA…         0
##  5        1 AL    11/15/1… 1500     CST           43 CULLMAN    TORNA…         0
##  6        1 AL    11/15/1… 2000     CST           77 LAUDERDALE TORNA…         0
##  7        1 AL    11/16/1… 0100     CST            9 BLOUNT     TORNA…         0
##  8        1 AL    1/22/19… 0900     CST          123 TALLAPOOSA TORNA…         0
##  9        1 AL    2/13/19… 2000     CST          125 TUSCALOOSA TORNA…         0
## 10        1 AL    2/13/19… 2000     CST           57 FAYETTE    TORNA…         0
## # … with 902,287 more rows, and 28 more variables: bgn_azi <chr>,
## #   bgn_locati <chr>, end_date <chr>, end_time <chr>, county_end <dbl>,
## #   countyendn <lgl>, end_range <dbl>, end_azi <chr>, end_locati <chr>,
## #   length <dbl>, width <dbl>, f <int>, mag <dbl>, fatalities <dbl>,
## #   injuries <dbl>, propdmg <dbl>, propdmgexp <chr>, cropdmg <dbl>,
## #   cropdmgexp <chr>, wfo <chr>, stateoffic <chr>, zonenames <chr>,
## #   latitude <dbl>, longitude <dbl>, latitude_e <dbl>, longitude.1 <dbl>, …

5. Convert rows to lower case

For better readability, we convert all characters to lowercase:

data <- data %>% 
        mutate(countyname = tolower(countyname), evtype=tolower(evtype), 
               state = tolower(state)) %>% 
        print
## # A tibble: 902,297 × 37
##    state_nr state bgn_date bgn_time time_zone county countyname evtype bgn_range
##       <dbl> <chr> <chr>    <chr>    <chr>      <dbl> <chr>      <chr>      <dbl>
##  1        1 al    4/18/19… 0130     CST           97 mobile     torna…         0
##  2        1 al    4/18/19… 0145     CST            3 baldwin    torna…         0
##  3        1 al    2/20/19… 1600     CST           57 fayette    torna…         0
##  4        1 al    6/8/195… 0900     CST           89 madison    torna…         0
##  5        1 al    11/15/1… 1500     CST           43 cullman    torna…         0
##  6        1 al    11/15/1… 2000     CST           77 lauderdale torna…         0
##  7        1 al    11/16/1… 0100     CST            9 blount     torna…         0
##  8        1 al    1/22/19… 0900     CST          123 tallapoosa torna…         0
##  9        1 al    2/13/19… 2000     CST          125 tuscaloosa torna…         0
## 10        1 al    2/13/19… 2000     CST           57 fayette    torna…         0
## # … with 902,287 more rows, and 28 more variables: bgn_azi <chr>,
## #   bgn_locati <chr>, end_date <chr>, end_time <chr>, county_end <dbl>,
## #   countyendn <lgl>, end_range <dbl>, end_azi <chr>, end_locati <chr>,
## #   length <dbl>, width <dbl>, f <int>, mag <dbl>, fatalities <dbl>,
## #   injuries <dbl>, propdmg <dbl>, propdmgexp <chr>, cropdmg <dbl>,
## #   cropdmgexp <chr>, wfo <chr>, stateoffic <chr>, zonenames <chr>,
## #   latitude <dbl>, longitude <dbl>, latitude_e <dbl>, longitude.1 <dbl>, …

6. Filter out columns that we do not use later on

Let us filter out the irrelevant columns for our analysis:

data <- data %>% 
        select(-bgn_range,-bgn_azi,-bgn_locati,-end_time,-county_end,-countyendn,
                 -end_range,-end_azi,-end_locati,-wfo,-stateoffic,-zonenames,
               -latitude,-longitude,
                 -latitude_e,-longitude.1) %>%
        print
## # A tibble: 902,297 × 21
##    state_nr state bgn_date  bgn_time time_zone county countyname evtype end_date
##       <dbl> <chr> <chr>     <chr>    <chr>      <dbl> <chr>      <chr>  <chr>   
##  1        1 al    4/18/195… 0130     CST           97 mobile     torna… ""      
##  2        1 al    4/18/195… 0145     CST            3 baldwin    torna… ""      
##  3        1 al    2/20/195… 1600     CST           57 fayette    torna… ""      
##  4        1 al    6/8/1951… 0900     CST           89 madison    torna… ""      
##  5        1 al    11/15/19… 1500     CST           43 cullman    torna… ""      
##  6        1 al    11/15/19… 2000     CST           77 lauderdale torna… ""      
##  7        1 al    11/16/19… 0100     CST            9 blount     torna… ""      
##  8        1 al    1/22/195… 0900     CST          123 tallapoosa torna… ""      
##  9        1 al    2/13/195… 2000     CST          125 tuscaloosa torna… ""      
## 10        1 al    2/13/195… 2000     CST           57 fayette    torna… ""      
## # … with 902,287 more rows, and 12 more variables: length <dbl>, width <dbl>,
## #   f <int>, mag <dbl>, fatalities <dbl>, injuries <dbl>, propdmg <dbl>,
## #   propdmgexp <chr>, cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>,
## #   refnum <dbl>

7. Create a bgn_date and end_date column

We are going to create a begin and end date column in date format: bgn_date and end_date. As long as they are in a chr format, it is hard to work with them. So we convert them to date objects. As the time zones and times are not that important to our purpose, I keep the dates of the weather events but I delete the times and not the time zones:

data <- data %>% mutate(
        end_date=ifelse(end_date=="", bgn_date, end_date)
        ) %>% 
        print
## # A tibble: 902,297 × 21
##    state_nr state bgn_date  bgn_time time_zone county countyname evtype end_date
##       <dbl> <chr> <chr>     <chr>    <chr>      <dbl> <chr>      <chr>  <chr>   
##  1        1 al    4/18/195… 0130     CST           97 mobile     torna… 4/18/19…
##  2        1 al    4/18/195… 0145     CST            3 baldwin    torna… 4/18/19…
##  3        1 al    2/20/195… 1600     CST           57 fayette    torna… 2/20/19…
##  4        1 al    6/8/1951… 0900     CST           89 madison    torna… 6/8/195…
##  5        1 al    11/15/19… 1500     CST           43 cullman    torna… 11/15/1…
##  6        1 al    11/15/19… 2000     CST           77 lauderdale torna… 11/15/1…
##  7        1 al    11/16/19… 0100     CST            9 blount     torna… 11/16/1…
##  8        1 al    1/22/195… 0900     CST          123 tallapoosa torna… 1/22/19…
##  9        1 al    2/13/195… 2000     CST          125 tuscaloosa torna… 2/13/19…
## 10        1 al    2/13/195… 2000     CST           57 fayette    torna… 2/13/19…
## # … with 902,287 more rows, and 12 more variables: length <dbl>, width <dbl>,
## #   f <int>, mag <dbl>, fatalities <dbl>, injuries <dbl>, propdmg <dbl>,
## #   propdmgexp <chr>, cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>,
## #   refnum <dbl>
data$bgn_date <- sapply(strsplit(data$bgn_date, " "), function(x){x[1]})
data$end_date <- sapply(strsplit(data$end_date, " "), function(x){x[1]})

data$bgn_date <- mdy(data$bgn_date)
data$end_date <- mdy(data$end_date)

data <- data %>% 
        select(state_nr, state, bgn_date, end_date, county:refnum, - bgn_time, 
               -time_zone) %>% 
        print
## # A tibble: 902,297 × 19
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <chr>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 902,287 more rows, and 10 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, propdmg <dbl>, propdmgexp <chr>,
## #   cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>, refnum <dbl>

8. Create columns that display propdmg (property damage) and

cropdmg (crop damage) in whole numbers (without B, K etc.)

In our data, currently the column propdmgexp stores the exponents of the damages. We want to to multiply the column propdmgexp with propdmg to get the real numbers. Only then, we can sum the damages up.

First we change exponent abbreviation such as B for billion or m for million into real numbers:

data <- data %>% 
        mutate(
                propdmgexp= ifelse(propdmgexp=="B", 1000000000, propdmgexp), 
                propdmgexp=ifelse(propdmgexp=="b", 1000000000, propdmgexp), 
                propdmgexp=ifelse(propdmgexp=="M", 1000000, propdmgexp), 
                propdmgexp=ifelse(propdmgexp=="m", 1000000, propdmgexp), 
                propdmgexp=ifelse(propdmgexp=="K", 1000, propdmgexp), 
                propdmgexp=ifelse(propdmgexp=="k", 1000, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="H", 100, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="h", 100, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="1", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="2", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="3", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="4", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="5", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="6", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="7", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="8", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="0", 10, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="+", 1, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="-", 0, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="", 0, propdmgexp),
                propdmgexp=ifelse(propdmgexp=="?", 0, propdmgexp),
                ) %>%
        print
## # A tibble: 902,297 × 19
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <chr>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 902,287 more rows, and 10 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, propdmg <dbl>, propdmgexp <chr>,
## #   cropdmg <dbl>, cropdmgexp <chr>, remarks <chr>, refnum <dbl>

Then, we multiply the column propdmgexp and propdmg:

data <- data %>% 
        mutate(propdmgexp=as.numeric(propdmgexp)) %>% 
        mutate(`propdmgexp*exp` = propdmg * propdmgexp)%>%
        select(-propdmg, -propdmgexp) %>%
        print
## # A tibble: 902,297 × 18
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <chr>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 902,287 more rows, and 9 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, cropdmg <dbl>, cropdmgexp <chr>,
## #   remarks <chr>, refnum <dbl>, propdmgexp*exp <dbl>

Let’s do the same with cropdmgexp and cropdmg:

data <-data %>% 
        mutate(
                cropdmgexp= ifelse(cropdmgexp=="B", 1000000000, cropdmgexp), 
                cropdmgexp=ifelse(cropdmgexp=="b", 1000000000, cropdmgexp), 
                cropdmgexp=ifelse(cropdmgexp=="M", 1000000, cropdmgexp), 
                cropdmgexp=ifelse(cropdmgexp=="m", 1000000, cropdmgexp), 
                cropdmgexp=ifelse(cropdmgexp=="K", 1000, cropdmgexp), 
                cropdmgexp=ifelse(cropdmgexp=="k", 1000, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="H", 100, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="h", 100, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="1", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="2", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="3", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="4", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="5", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="6", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="7", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="8", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="0", 10, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="+", 1, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="-", 0, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="", 0, cropdmgexp),
                cropdmgexp=ifelse(cropdmgexp=="?", 0, cropdmgexp),
                ) %>%
        print
## # A tibble: 902,297 × 18
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <chr>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 902,287 more rows, and 9 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, cropdmg <dbl>, cropdmgexp <chr>,
## #   remarks <chr>, refnum <dbl>, propdmgexp*exp <dbl>
data <- data %>% 
        mutate(cropdmgexp=as.numeric(cropdmgexp)) %>% 
        mutate(`cropdmgexp*exp` = cropdmg * cropdmgexp) %>%
        select(-cropdmg, -cropdmgexp) %>%
        print
## # A tibble: 902,297 × 17
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <chr>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 902,287 more rows, and 8 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, remarks <chr>, refnum <dbl>,
## #   propdmgexp*exp <dbl>, cropdmgexp*exp <dbl>

Ok, now we rename the new columns back into propdmg and cropdmg:

data <- data %>% 
        rename(propdmg = "propdmgexp*exp") %>% 
        rename(cropdmg = `cropdmgexp*exp`) %>% 
        print
## # A tibble: 902,297 × 17
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <chr>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 902,287 more rows, and 8 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, remarks <chr>, refnum <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>

And we are done with cleaning up the propdmg and cropdmg columns.

9. Clean up the evtype column

Next, let’s have a look at the event types:

evtype_count <- data %>% 
        count(evtype) %>% 
        arrange(desc(n)) %>% 
        print
## # A tibble: 898 × 2
##    evtype                  n
##    <chr>               <int>
##  1 hail               288661
##  2 tstm wind          219942
##  3 thunderstorm wind   82564
##  4 tornado             60652
##  5 flash flood         54277
##  6 flood               25327
##  7 thunderstorm winds  20843
##  8 high wind           20214
##  9 lightning           15754
## 10 heavy snow          15708
## # … with 888 more rows

There are 985 event types, even though the documentary only has about 49 categories of weather event types! So a lot of the event types above can be put together under one category, or are erroneous.

It would take a lot of time to manually clean the data, for example to decide whether we want to put, for example, a notion like “tstm wind/hail” into either the category “hail” or “thunderstorm”. Instead of rectifying every row, I will make sure that it is only a small percentage of economic damage and harm to population health that gets lost when I erase rows with erroneous entries in evtype.

This is our roadmap for this undertaking: 9.1. Calculate the percentages of the event types that are not in the documentation. 9.2. Change wrong event type names into event type names that are available in the documentation until the percentage is low enough to delete the rest

Let’s check first how much percentage of economic damage and harm to population health is affected by the wrong event types.

These are the weather events from the documentation:

weatherevents <- tolower(c("Astronomical Low Tide", "Avalanche", "Blizzard", 
                           "Coastal Flood", "Cold/Wind Chill", "Debris Flow", 
                           "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", 
                           "Dust Storm", "Excessive Heat", 
                           "Extreme Cold/Wind Chill", "Flash Flood", "Flood", 
                           "Freezing Fog", "Frost/Freeze", "Funnel Cloud", 
                           "Hail", "Heat", "Heavy Rain", "Heavy Snow", 
                           "High Surf", "High Wind", "Hurricane/Typhoon", 
                           "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", 
                           "Lightning", "Marine Hail", "Marine High Wind", 
                           "Marine Strong Wind", "Marine Thunderstorm Wind", 
                           "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", 
                           "Strong Wind", "Thunderstorm Wind", "Tornado", 
                           "Tropical Depression", "Tropical Storm", "Tsunami", 
                           "Volcanic Ash", "Waterspout", "Wildfire", 
                           "Winter Storm", "Winter Weather"))
weatherevents <- sort(weatherevents)
weatherevents
##  [1] "astronomical low tide"    "avalanche"               
##  [3] "blizzard"                 "coastal flood"           
##  [5] "cold/wind chill"          "debris flow"             
##  [7] "dense fog"                "dense smoke"             
##  [9] "drought"                  "dust devil"              
## [11] "dust storm"               "excessive heat"          
## [13] "extreme cold/wind chill"  "flash flood"             
## [15] "flood"                    "freezing fog"            
## [17] "frost/freeze"             "funnel cloud"            
## [19] "hail"                     "heat"                    
## [21] "heavy rain"               "heavy snow"              
## [23] "high surf"                "high wind"               
## [25] "hurricane/typhoon"        "ice storm"               
## [27] "lake-effect snow"         "lakeshore flood"         
## [29] "lightning"                "marine hail"             
## [31] "marine high wind"         "marine strong wind"      
## [33] "marine thunderstorm wind" "rip current"             
## [35] "seiche"                   "sleet"                   
## [37] "storm surge/tide"         "strong wind"             
## [39] "thunderstorm wind"        "tornado"                 
## [41] "tropical depression"      "tropical storm"          
## [43] "tsunami"                  "volcanic ash"            
## [45] "waterspout"               "wildfire"                
## [47] "winter storm"             "winter weather"

These are the event types that are not in the documentation:

data %>% filter(!evtype %in% weatherevents) %>% 
        count(evtype) %>% 
        arrange(desc(n))

9.1. Calculate the percentages of the event types that are not in the

documentation.

Let’s calculate the percentages of the event types which are not in the documentation:

#not in documentation
nodoc <- data %>% filter(!evtype %in% weatherevents) %>% 
        summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))

#in documentation
doc <- data %>% filter(evtype %in% weatherevents) %>% 
        summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))

# in and out of documentation
all <- data %>% summarize(sum(propdmg), sum(cropdmg), sum(injuries), 
                          sum(fatalities))

test <- full_join(nodoc, doc)
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
test1 <- full_join(test, all) %>% 
        mutate(type= c("nan_doc", "doc", "all")) %>% 
        select(type, "sum(propdmg)":"sum(fatalities)") %>% 
        print
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
## # A tibble: 3 × 5
##   type    `sum(propdmg)` `sum(cropdmg)` `sum(injuries)` `sum(fatalities)`
##   <chr>            <dbl>          <dbl>           <dbl>             <dbl>
## 1 nan_doc    82982093201    11858392160           12558              2023
## 2 doc       344336626839    37245802350          127970             13122
## 3 all       427318720040    49104194510          140528             15145
percentages <- tibble(propdmg = test1$`sum(propdmg)`[1] / test1$`sum(propdmg)`[3], 
          cropdmg = test1$`sum(cropdmg)`[1] / test1$`sum(cropdmg)`[3],
          fatalities = test1$`sum(fatalities)`[1] / test1$`sum(fatalities)`[3],
          injuries = test1$`sum(injuries)`[1] / test1$`sum(injuries)`[3]) %>% 
        print
## # A tibble: 1 × 4
##   propdmg cropdmg fatalities injuries
##     <dbl>   <dbl>      <dbl>    <dbl>
## 1   0.194   0.241      0.134   0.0894

All in all, the percentage that would get lost if we just erased the values that do not have event types from the official documentation is a lot, so we cannot just erase these values. So, let’s look at which event types are both not documented and have most damages/harms to health. We might not be able to change all wrong event type values but we can change the most important event type values.

Let’s start with property damage:

data %>% filter(!evtype %in% weatherevents) %>% 
        group_by(evtype) %>% 
        summarize(sum_propdmg = sum(propdmg)) %>% 
        arrange(desc(sum_propdmg))

Wow, that’s a lot of property damage that would have been lost if we had just erase the rows (first two rows are already more than 50 billion).

9.2. Change wrong event type names into event type names that are available

in the documentation until the percentage is low enough to delete the rest

Let’s change the upper event types names to the ones in the documentation.

# "storm surge/tide"

data$evtype <- replace(data$evtype, data$evtype=="storm surge", "storm surge/tide")


# "hurricane/typhoon"
data$evtype <- replace(data$evtype, data$evtype=="hurricane opal", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="typhoon", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane erin", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane opal/high winds", "hurricane/typhoon")
data$evtype <- replace(data$evtype, data$evtype=="hurricane erin", "hurricane/typhoon")

#"flood"
data$evtype <- replace(data$evtype, data$evtype=="river flood", "flood")
data$evtype <- replace(data$evtype, data$evtype=="flooding", "flood")
data$evtype <- replace(data$evtype, data$evtype=="major flood", "flood")
data$evtype <- replace(data$evtype, data$evtype=="river flooding", "flood")
data$evtype <- replace(data$evtype, data$evtype=="coastal flooding", "coastal flood")
data$evtype <- replace(data$evtype, data$evtype=="flood/rain/winds", "flood")

#"thunderstorm wind" 
data$evtype <- replace(data$evtype, data$evtype=="tstm wind", "thunderstorm wind")
data$evtype <- replace(data$evtype, data$evtype=="thunderstorm winds", "thunderstorm wind")
data$evtype <- replace(data$evtype, data$evtype=="severe thunderstorm", "thunderstorm wind")

#"tornado"
data$evtype <- replace(data$evtype, data$evtype=="tornadoes, tstm wind, hail", "tornado")


#"wildfire" 
data$evtype <- replace(data$evtype, data$evtype=="wild/forest fire", "wildfire")
data$evtype <- replace(data$evtype, data$evtype=="wild fires", "wildfire")
data$evtype <- replace(data$evtype, data$evtype=="wildfires", "wildfire")


#"heavy rain"
data$evtype <- replace(data$evtype, data$evtype=="heavy rain/severe weather", "heavy rain")
data$evtype <- replace(data$evtype, data$evtype=="excessive wetness", "heavy rain")

#hail
data$evtype <- replace(data$evtype, data$evtype=="hailstorm", "hail")

#"flash flood"
data$evtype <- replace(data$evtype, data$evtype=="flash flooding", "flash flood")
data$evtype <- replace(data$evtype, data$evtype=="flash flood/flood", "flash flood")
data$evtype <- replace(data$evtype, data$evtype=="flood/flash flood", "flash flood")
data$evtype <- replace(data$evtype, data$evtype=="heavy surf/high surf", "coastal flood")
    
#avalanche
data$evtype <- replace(data$evtype, data$evtype=="landslide", "avalanche")

#high wind
data$evtype <- replace(data$evtype, data$evtype=="high winds", "high wind")
data$evtype <- replace(data$evtype, data$evtype=="high winds/cold", "high wind")

#frost/freeze
data$evtype <- replace(data$evtype, data$evtype=="extreme cold", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="freeze", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="damaging freeze", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="ice", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="glaze", "frost/freeze")
data$evtype <- replace(data$evtype, data$evtype=="cold", "frost/freeze")

#rip current
data$evtype <- replace(data$evtype, data$evtype=="rip currents", "rip current")

#"excessive heat"  
data$evtype <- replace(data$evtype, data$evtype=="extreme heat", "excessive heat")
data$evtype <- replace(data$evtype, data$evtype=="heat wave", "excessive heat")
data$evtype <- replace(data$evtype, data$evtype=="unseasonably warm and dry", "excessive heat")
data$evtype <- replace(data$evtype, data$evtype=="record/excessive heat", "excessive heat")
#"dense fog"
data$evtype <- replace(data$evtype, data$evtype=="fog", "dense fog")

Now, let us check again which event types have most propdmg and are not in the documentation:

data %>% filter(!evtype %in% weatherevents) %>% 
        group_by(evtype) %>% 
        summarize(sum_propdmg = sum(propdmg)) %>% 
        arrange(desc(sum_propdmg))

And let us do the same with the crop damage:

data %>% filter(!evtype %in% weatherevents) %>% 
        group_by(evtype) %>% 
        summarize(sum_cropdmg = sum(cropdmg)) %>% 
        arrange(desc(sum_cropdmg))

And fatalities:

data %>% filter(!evtype %in% weatherevents) %>% 
        group_by(evtype) %>% 
        summarize(sum_fatalities = sum(fatalities)) %>% 
        arrange(desc(sum_fatalities))

And injuries:

data %>% filter(!evtype %in% weatherevents) %>% 
        group_by(evtype) %>% 
        summarize(sum_injuries = sum(injuries)) %>% 
        arrange(desc(sum_injuries))

Let’s check how much percentage the event ypes that are not documented still have. If the percentage is low, we can just erase them from our data.

#not in documentation
nodoc <- data %>% filter(!evtype %in% weatherevents) %>% 
        summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))

#in documentation
doc <- data %>% filter(evtype %in% weatherevents) %>% 
        summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))

# in and out of documentation
all <- data %>% summarize(sum(propdmg), sum(cropdmg), sum(injuries), sum(fatalities))

test <- full_join(nodoc, doc)
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
test1 <- full_join(test, all) %>% 
        mutate(type= c("nodoc", "doc", "all")) %>% 
        select(type, "sum(propdmg)":"sum(fatalities)") %>% 
        print
## Joining, by = c("sum(propdmg)", "sum(cropdmg)", "sum(injuries)", "sum(fatalities)")
## # A tibble: 3 × 5
##   type  `sum(propdmg)` `sum(cropdmg)` `sum(injuries)` `sum(fatalities)`
##   <chr>          <dbl>          <dbl>           <dbl>             <dbl>
## 1 nodoc      662252162      519866330            1276               360
## 2 doc     426656467878    48584328180          139252             14785
## 3 all     427318720040    49104194510          140528             15145
percentages <- tibble(propdmg = test1$`sum(propdmg)`[1] / test1$`sum(propdmg)`[3], 
          cropdmg = test1$`sum(cropdmg)`[1] / test1$`sum(cropdmg)`[3],
          fatalities = test1$`sum(fatalities)`[1] / test1$`sum(fatalities)`[3],
          injuries = test1$`sum(injuries)`[1] / test1$`sum(injuries)`[3]) %>% 
        print
## # A tibble: 1 × 4
##   propdmg cropdmg fatalities injuries
##     <dbl>   <dbl>      <dbl>    <dbl>
## 1 0.00155  0.0106     0.0238  0.00908

The percentage of data that is not in the documentation is pretty low. Once we have a low percentage, we can then filter the evtypes out that are not in the documentation:

data <- data %>% filter(evtype %in% weatherevents) %>% 
        mutate(evtype=as.factor(evtype)) %>% 
        print
## # A tibble: 884,267 × 17
##    state_nr state bgn_date   end_date   county countyname evtype  length width
##       <dbl> <chr> <date>     <date>      <dbl> <chr>      <fct>    <dbl> <dbl>
##  1        1 al    1950-04-18 1950-04-18     97 mobile     tornado   14     100
##  2        1 al    1950-04-18 1950-04-18      3 baldwin    tornado    2     150
##  3        1 al    1951-02-20 1951-02-20     57 fayette    tornado    0.1   123
##  4        1 al    1951-06-08 1951-06-08     89 madison    tornado    0     100
##  5        1 al    1951-11-15 1951-11-15     43 cullman    tornado    0     150
##  6        1 al    1951-11-15 1951-11-15     77 lauderdale tornado    1.5   177
##  7        1 al    1951-11-16 1951-11-16      9 blount     tornado    1.5    33
##  8        1 al    1952-01-22 1952-01-22    123 tallapoosa tornado    0      33
##  9        1 al    1952-02-13 1952-02-13    125 tuscaloosa tornado    3.3   100
## 10        1 al    1952-02-13 1952-02-13     57 fayette    tornado    2.3   100
## # … with 884,257 more rows, and 8 more variables: f <int>, mag <dbl>,
## #   fatalities <dbl>, injuries <dbl>, remarks <chr>, refnum <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>

Then, I had a first glance at the top damages for fatalities, injuries, propdmg and cropdmg. I detected that there was a “flood” entry that seemed to be a bit too high and that would completely distort the data if false (see first entry for propdmg, refnum 605943):

data %>% mutate(evtype="flood") %>% 
        arrange(desc(propdmg))

When I could not find the flood as a significant event in US history on Google, I concluded there must be an error in the data and deleted it.

data <- data[!data$refnum==605943,]
data

We are done with cleaning up the evtype column.

10. Clean up state and state_nr column

When having a first glance at the state and state_nr column, I realized that there 71 state and 70 state_nr types in the database, but there are actually only 50 states in the US:

#there are 70 "state_nr" types in the database:
data %>% group_by(state_nr) %>% summarize(n())
#there are 71 "state" types in the database:
data %>% group_by(state) %>% summarize(n())

After doing a bit of research, I realized that the column state does not only contain the 50 states of the United States but actually states as well as territories. For example, the “District of Columbia” (federal state, “dc” in dataset) and “Guam” (territory, “gu” in dataset) are both in the database.

data %>% filter(state == "gu" | state == "dc") %>% 
        summarize(unique(state))

I next want to have a look at the states that actually exist in the U.S. and then see if the data significantly changes if I just use the data for states and not territories. So, first, I parse information about the U.S. states from Wikipedia:

html <- read_html("https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations") %>% 
        html_elements(".mw-parser-output , td:nth-child(1)") %>% 
        html_table(fill = TRUE)

coln <- html[[1]] %>% select(X1:X5) %>% 
        slice(11:12) %>%
        unlist(., use.names = FALSE)

rows <- html[[1]] %>% select(X1:X5) %>% 
        slice(13:81)

realstates <- tibble(rows) %>% 
        rename(territory = X1, territory_type = X2, ansi_alphabetic = X4, ansi_numeric = X5) %>%
        select(-X3) %>% 
        mutate(territory = tolower(territory), territory_type= tolower(territory_type), ansi_alphabetic = tolower(ansi_alphabetic), ansi_numeric = as.numeric(ansi_numeric)) %>%
        filter(ansi_alphabetic != "") %>%
        mutate(both = paste(ansi_numeric, ansi_alphabetic)) %>% 
        print
## # A tibble: 60 × 5
##    territory            territory_type   ansi_alphabetic ansi_numeric both 
##    <chr>                <chr>            <chr>                  <dbl> <chr>
##  1 alabama              state            al                         1 1 al 
##  2 alaska               state            ak                         2 2 ak 
##  3 arizona              state            az                         4 4 az 
##  4 arkansas             state            ar                         5 5 ar 
##  5 california           state            ca                         6 6 ca 
##  6 colorado             state            co                         8 8 co 
##  7 connecticut          state            ct                         9 9 ct 
##  8 delaware             state            de                        10 10 de
##  9 district of columbia federal district dc                        11 11 dc
## 10 florida              state            fl                        12 12 fl
## # … with 50 more rows

Now, I change the column names from state into ansi_alphabetic and state_nr into ansi_numeric because the notions is more precise. I create another column with both the ansi_numeric and ansi_alphabetic values in it, in order to prepare the data for comparison:

data <- data %>% rename(ansi_alphabetic="state",ansi_numeric="state_nr") %>% 
        mutate(both= paste(ansi_numeric, ansi_alphabetic))

Now I calculate the sums of the damages with and without the territories. without_join are the sums of all territories and states, inner_join are the sums of just the states:

without_join <- data %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities)) %>% 
        print
## # A tibble: 1 × 4
##        propdmg     cropdmg injuries fatalities
##          <dbl>       <dbl>    <dbl>      <dbl>
## 1 311656467878 48551828180   139252      14785
inner_join <- data %>% inner_join(realstates) %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities)) %>% 
        print
## Joining, by = c("ansi_numeric", "ansi_alphabetic", "both")
## # A tibble: 1 × 4
##        propdmg     cropdmg injuries fatalities
##          <dbl>       <dbl>    <dbl>      <dbl>
## 1 311648870438 48551728180   139193      14760

Now I check if there is any major difference between without_join and inner_join:

without_join-inner_join

The amount of information that might got lost is not that high, we can live with it. Thus, let’s join realstates and the data:

data <- data %>% inner_join(realstates) %>% 
        print
## Joining, by = c("ansi_numeric", "ansi_alphabetic", "both")
## # A tibble: 875,666 × 20
##    ansi_numeric ansi_alphabetic bgn_date   end_date   county countyname evtype 
##           <dbl> <chr>           <date>     <date>      <dbl> <chr>      <fct>  
##  1            1 al              1950-04-18 1950-04-18     97 mobile     tornado
##  2            1 al              1950-04-18 1950-04-18      3 baldwin    tornado
##  3            1 al              1951-02-20 1951-02-20     57 fayette    tornado
##  4            1 al              1951-06-08 1951-06-08     89 madison    tornado
##  5            1 al              1951-11-15 1951-11-15     43 cullman    tornado
##  6            1 al              1951-11-15 1951-11-15     77 lauderdale tornado
##  7            1 al              1951-11-16 1951-11-16      9 blount     tornado
##  8            1 al              1952-01-22 1952-01-22    123 tallapoosa tornado
##  9            1 al              1952-02-13 1952-02-13    125 tuscaloosa tornado
## 10            1 al              1952-02-13 1952-02-13     57 fayette    tornado
## # … with 875,656 more rows, and 13 more variables: length <dbl>, width <dbl>,
## #   f <int>, mag <dbl>, fatalities <dbl>, injuries <dbl>, remarks <chr>,
## #   refnum <dbl>, propdmg <dbl>, cropdmg <dbl>, both <chr>, territory <chr>,
## #   territory_type <chr>

We are done with cleaning up the state columns.

Now I make the data a bit more pretty by relocating the columns and deleting columns that are not useful anymore:

data <- data %>% select(-both, -length, -width, -f, -mag) %>% 
        relocate(c(territory, territory_type), .after=ansi_alphabetic) %>% 
  relocate(c(propdmg, cropdmg), .after=injuries) %>% 
        print
## # A tibble: 875,666 × 15
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1950-04-18 1950-04-18
##  2            1 al              alabama   state          1950-04-18 1950-04-18
##  3            1 al              alabama   state          1951-02-20 1951-02-20
##  4            1 al              alabama   state          1951-06-08 1951-06-08
##  5            1 al              alabama   state          1951-11-15 1951-11-15
##  6            1 al              alabama   state          1951-11-15 1951-11-15
##  7            1 al              alabama   state          1951-11-16 1951-11-16
##  8            1 al              alabama   state          1952-01-22 1952-01-22
##  9            1 al              alabama   state          1952-02-13 1952-02-13
## 10            1 al              alabama   state          1952-02-13 1952-02-13
## # … with 875,656 more rows, and 9 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>

11. Clean up county and countyname column

Let us take a look at the county columns county and countyname. I realized that the countyname column does actually contain not only the county names but sometimes instead the forecast zone names. They begin with 3 letters, the third of them is a z (for example alz for Alabama zone) and being followed by a number. We will later parse some data about counties and zone codes from the web.

But first, let’s have a look at the data and see, how many rows have zone names instead of county names and how much damage/harm would be lost if we just deleted them.

data %>% filter(grepl("^..z[0:9]", countyname))
zone_code_sum <- data %>% filter(grepl("^..z[0:9]", countyname)) %>% 
        summarise(propdmg = sum(propdmg), cropdmg = sum(cropdmg), 
                  injuries = sum(injuries), fatalities = sum(fatalities)) %>% 
        print
## # A tibble: 1 × 4
##        propdmg     cropdmg injuries fatalities
##          <dbl>       <dbl>    <dbl>      <dbl>
## 1 161456906136 31172916550    18662       5020
data_sum <- data %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities)) %>% 
        print
## # A tibble: 1 × 4
##        propdmg     cropdmg injuries fatalities
##          <dbl>       <dbl>    <dbl>      <dbl>
## 1 311648870438 48551728180   139193      14760
zone_code_sum/data_sum

There is too much data that would be lost if we just deleted these rows. We need to find another solution.

I realized that there are a couple of events that cause a lot of damage and harm to health that have have “county” value 0 and a “countyname” with the patter “^..z[0-9]”.

data %>% filter((county==0) & (grepl("^..z[0-9]", countyname)))

There is no county in the U.S. that has 0 as number. And often, when county == 0, countyname equals the zone code pattern “^..z[0-9]”. But the number in the zone is often the county number of a state. So let’s delete the 0 in the county column, extract the number of the zone code in countyname and put it in the county column instead:

county_fix <- data %>% 
        filter((county == 0) & (grepl("^..z[0-9][0-9][0-9]", countyname))) %>% 
        mutate(county = str_sub(countyname, 4L, 6L), county = as.numeric(county))

data <- data %>% full_join(county_fix) %>% 
        filter(!((county == 0) & (grepl("^..z[0-9][0-9][0-9]", countyname))))
## Joining, by = c("ansi_numeric", "ansi_alphabetic", "territory", "territory_type", "bgn_date", "end_date", "county", "countyname", "evtype", "fatalities", "injuries", "propdmg", "cropdmg", "remarks", "refnum")

I downloaded some data about forecast zones and county codes from the web:

# general: https://www.weather.gov/gis/AWIPSShapefiles

#forecast zones from: https://www.weather.gov/gis/PublicZones, download link: https://www.weather.gov/source/gis/Shapefiles/WSOM/z_08se21.zip

if (!file.exists("./data/forecastzones.zip")) {
  download.file("https://www.weather.gov/source/gis/Shapefiles/WSOM/z_08se21.zip", "./data/forecastzones.zip")
unzip("./data/forecastzones.zip", exdir = "./data/forecastzones")
}

forecast_zones <- read.dbf("./data/forecastzones/z_08se21.dbf")
forecast_zones <-as_tibble(forecast_zones)
forecast_zones <- forecast_zones %>% 
  select(STATE, NAME, ZONE)
forecast_zones <- forecast_zones %>% 
  rename(ansi_alphabetic=STATE, zone=ZONE, zonename=NAME) %>% 
  mutate(ansi_alphabetic=tolower(ansi_alphabetic), zone=tolower(zone), zonename=tolower(zonename)) %>% 
  mutate(zone=as.numeric(zone)) %>% 
        print
## # A tibble: 3,908 × 3
##    ansi_alphabetic zonename      zone
##    <chr>           <chr>        <dbl>
##  1 gu              kayangel        13
##  2 gu              koror           11
##  3 gu              pohnpei         41
##  4 gu              yap             21
##  5 gu              sorol           24
##  6 gu              ngulu           22
##  7 gu              lukunor         33
##  8 gu              ulithi          23
##  9 gu              satawal         26
## 10 gu              ailinglaplap    64
## # … with 3,898 more rows
#county zones from https://www.weather.gov/NWR/counties

file_counties <- "https://www.weather.gov/source/nwr/SameCode.txt" 
counties <- read_csv(file_counties, col_names = FALSE) %>% 
  rename(county = X1, countyname = X2, ansi_alphabetic = X3) %>% 
  mutate(county = str_sub(county, -3, -1), county = as.numeric(county), 
         countyname = tolower(countyname), ansi_alphabetic = tolower(ansi_alphabetic)) %>%
  print
## Rows: 3295 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3,295 × 3
##    county countyname ansi_alphabetic
##     <dbl> <chr>      <chr>          
##  1      1 autauga    al             
##  2      3 baldwin    al             
##  3      5 barbour    al             
##  4      7 bibb       al             
##  5      9 blount     al             
##  6     11 bullock    al             
##  7     13 butler     al             
##  8     15 calhoun    al             
##  9     17 chambers   al             
## 10     19 cherokee   al             
## # … with 3,285 more rows

Now we create in each dataset (data, counties and forecast_zones) 2 separate columns with the combination of ansi_alphabetic and county/zone number.

data <- data %>% 
  mutate(ansi_alphabetic_county=paste(ansi_alphabetic, county), 
         ansi_alphabetic_countyname = paste(ansi_alphabetic, countyname)) %>% 
  print
## # A tibble: 875,666 × 17
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1950-04-18 1950-04-18
##  2            1 al              alabama   state          1950-04-18 1950-04-18
##  3            1 al              alabama   state          1951-02-20 1951-02-20
##  4            1 al              alabama   state          1951-06-08 1951-06-08
##  5            1 al              alabama   state          1951-11-15 1951-11-15
##  6            1 al              alabama   state          1951-11-15 1951-11-15
##  7            1 al              alabama   state          1951-11-16 1951-11-16
##  8            1 al              alabama   state          1952-01-22 1952-01-22
##  9            1 al              alabama   state          1952-02-13 1952-02-13
## 10            1 al              alabama   state          1952-02-13 1952-02-13
## # … with 875,656 more rows, and 11 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>,
## #   ansi_alphabetic_county <chr>, ansi_alphabetic_countyname <chr>
counties <- counties %>% 
  mutate(ansi_alphabetic_county=paste(ansi_alphabetic, county), 
         ansi_alphabetic_countyname = paste(ansi_alphabetic, countyname)) %>% 
  print
## # A tibble: 3,295 × 5
##    county countyname ansi_alphabetic ansi_alphabetic_county ansi_alphabetic_cou…
##     <dbl> <chr>      <chr>           <chr>                  <chr>               
##  1      1 autauga    al              al 1                   al autauga          
##  2      3 baldwin    al              al 3                   al baldwin          
##  3      5 barbour    al              al 5                   al barbour          
##  4      7 bibb       al              al 7                   al bibb             
##  5      9 blount     al              al 9                   al blount           
##  6     11 bullock    al              al 11                  al bullock          
##  7     13 butler     al              al 13                  al butler           
##  8     15 calhoun    al              al 15                  al calhoun          
##  9     17 chambers   al              al 17                  al chambers         
## 10     19 cherokee   al              al 19                  al cherokee         
## # … with 3,285 more rows
forecast_zones <- forecast_zones %>% 
  mutate(ansi_alphabetic_zone=paste(ansi_alphabetic, zone), ansi_alphabetic_zonename=paste(ansi_alphabetic, zonename)) 

Now we filter out the data in our “data” dataset, that does not contain valid county numbers or countynames. First, we check if the data that gets thrown out is too relevant to be deleted:

data_test <- data %>% 
        filter(!((ansi_alphabetic_county %in% counties$ansi_alphabetic_county) | 
                                       (ansi_alphabetic_county %in% forecast_zones$ansi_alphabetic_zone) | 
                                       (ansi_alphabetic_countyname %in% counties$ansi_alphabetic_countyname) | 
                                       (ansi_alphabetic_countyname %in% forecast_zones$ansi_alphabetic_zonename))) %>% 
        print
## # A tibble: 8,446 × 17
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1962-06-08 1962-06-08
##  2            1 al              alabama   state          1964-04-27 1964-04-27
##  3            1 al              alabama   state          1964-06-17 1964-06-17
##  4            1 al              alabama   state          1967-03-06 1967-03-06
##  5            1 al              alabama   state          1967-07-05 1967-07-05
##  6            1 al              alabama   state          1967-10-30 1967-10-30
##  7            1 al              alabama   state          1967-10-30 1967-10-30
##  8            1 al              alabama   state          1968-06-12 1968-06-12
##  9            1 al              alabama   state          1968-06-12 1968-06-12
## 10            1 al              alabama   state          1969-04-23 1969-04-23
## # … with 8,436 more rows, and 11 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>,
## #   ansi_alphabetic_county <chr>, ansi_alphabetic_countyname <chr>
data_test_sum <- data_test %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities)) %>% 
        print
## # A tibble: 1 × 4
##       propdmg    cropdmg injuries fatalities
##         <dbl>      <dbl>    <dbl>      <dbl>
## 1 34064705150 1286301800     3362        719
sum_data <- data %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities)) %>% 
        print
## # A tibble: 1 × 4
##        propdmg     cropdmg injuries fatalities
##          <dbl>       <dbl>    <dbl>      <dbl>
## 1 311648870438 48551728180   139193      14760
data_test_sum/sum_data

If we look at the percentage of the “false” data that we would potentially throw out, we can see that there are still 10% of “false” data in the propdmg column which is quite a lot. If we look at highest ranking entries in the propdmg column, we can see that the first one is crucial for the amount of the property damage.

data_test %>% arrange(desc(propdmg))
data %>% filter(refnum==577616)

When looking at the first entry we can see that the county number does not fit with the remarks and countyname. So let’s correct its county value.

orleans <- data %>% filter(refnum==577616) %>% 
        mutate(county = 71)

data <- data %>% filter(refnum!=577616) %>% 
        bind_rows(orleans)

Let’s have a look at the percentage again:

data <- data %>% mutate(ansi_alphabetic_county=paste(ansi_alphabetic, county), ansi_alphabetic_countyname = paste(ansi_alphabetic, countyname)) %>% print
## # A tibble: 875,666 × 17
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1950-04-18 1950-04-18
##  2            1 al              alabama   state          1950-04-18 1950-04-18
##  3            1 al              alabama   state          1951-02-20 1951-02-20
##  4            1 al              alabama   state          1951-06-08 1951-06-08
##  5            1 al              alabama   state          1951-11-15 1951-11-15
##  6            1 al              alabama   state          1951-11-15 1951-11-15
##  7            1 al              alabama   state          1951-11-16 1951-11-16
##  8            1 al              alabama   state          1952-01-22 1952-01-22
##  9            1 al              alabama   state          1952-02-13 1952-02-13
## 10            1 al              alabama   state          1952-02-13 1952-02-13
## # … with 875,656 more rows, and 11 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>,
## #   ansi_alphabetic_county <chr>, ansi_alphabetic_countyname <chr>
data_test <- data %>% 
        filter(!((ansi_alphabetic_county %in% counties$ansi_alphabetic_county) | 
                                       (ansi_alphabetic_county %in% forecast_zones$ansi_alphabetic_zone) | 
                                       (ansi_alphabetic_countyname %in% counties$ansi_alphabetic_countyname) | 
                                       (ansi_alphabetic_countyname %in% forecast_zones$ansi_alphabetic_zonename)))

data_test_sum <- data_test %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities))

sum_data <- data %>% summarize(propdmg=sum(propdmg), cropdmg=sum(cropdmg), injuries=sum(injuries), fatalities=sum(fatalities))

data_test_sum/sum_data

After having corrected the highest ranking value (refnum), the propdmg that would be lost if we deleted the rows of data_test is not as high anymore, so we can delete the rows without any further do. As the county variable does not only contain county numbers but also forecast zones, let us not lose the information which number is a county and which one is a forecast zone. Mostly, county numbers are also forecast zones, but sometimes the forecast zones are more detailed than the county zones:

type_counties <- data %>% 
        filter(ansi_alphabetic_county %in% counties$ansi_alphabetic_county) %>% 
        mutate(zone_type = "county") %>%
        print
## # A tibble: 827,026 × 18
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1950-04-18 1950-04-18
##  2            1 al              alabama   state          1950-04-18 1950-04-18
##  3            1 al              alabama   state          1951-02-20 1951-02-20
##  4            1 al              alabama   state          1951-06-08 1951-06-08
##  5            1 al              alabama   state          1951-11-15 1951-11-15
##  6            1 al              alabama   state          1951-11-15 1951-11-15
##  7            1 al              alabama   state          1951-11-16 1951-11-16
##  8            1 al              alabama   state          1952-01-22 1952-01-22
##  9            1 al              alabama   state          1952-02-13 1952-02-13
## 10            1 al              alabama   state          1952-02-13 1952-02-13
## # … with 827,016 more rows, and 12 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>,
## #   ansi_alphabetic_county <chr>, ansi_alphabetic_countyname <chr>,
## #   zone_type <chr>
forecast_zone_type <- data %>% 
        filter(!ansi_alphabetic_county %in% counties$ansi_alphabetic_county) %>% 
        mutate(zone_type = "forecast_zone") %>% 
        print
## # A tibble: 48,640 × 18
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1962-06-08 1962-06-08
##  2            1 al              alabama   state          1964-04-27 1964-04-27
##  3            1 al              alabama   state          1964-06-17 1964-06-17
##  4            1 al              alabama   state          1967-03-06 1967-03-06
##  5            1 al              alabama   state          1967-07-05 1967-07-05
##  6            1 al              alabama   state          1967-10-30 1967-10-30
##  7            1 al              alabama   state          1967-10-30 1967-10-30
##  8            1 al              alabama   state          1968-06-12 1968-06-12
##  9            1 al              alabama   state          1968-06-12 1968-06-12
## 10            1 al              alabama   state          1969-04-23 1969-04-23
## # … with 48,630 more rows, and 12 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>,
## #   ansi_alphabetic_county <chr>, ansi_alphabetic_countyname <chr>,
## #   zone_type <chr>
data <- type_counties %>% full_join(forecast_zone_type) %>% 
        print
## Joining, by = c("ansi_numeric", "ansi_alphabetic", "territory", "territory_type", "bgn_date", "end_date", "county", "countyname", "evtype", "fatalities", "injuries", "propdmg", "cropdmg", "remarks", "refnum", "ansi_alphabetic_county", "ansi_alphabetic_countyname", "zone_type")
## # A tibble: 875,666 × 18
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1950-04-18 1950-04-18
##  2            1 al              alabama   state          1950-04-18 1950-04-18
##  3            1 al              alabama   state          1951-02-20 1951-02-20
##  4            1 al              alabama   state          1951-06-08 1951-06-08
##  5            1 al              alabama   state          1951-11-15 1951-11-15
##  6            1 al              alabama   state          1951-11-15 1951-11-15
##  7            1 al              alabama   state          1951-11-16 1951-11-16
##  8            1 al              alabama   state          1952-01-22 1952-01-22
##  9            1 al              alabama   state          1952-02-13 1952-02-13
## 10            1 al              alabama   state          1952-02-13 1952-02-13
## # … with 875,656 more rows, and 12 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>,
## #   ansi_alphabetic_county <chr>, ansi_alphabetic_countyname <chr>,
## #   zone_type <chr>

Now, we can delete the rows that do not fit with our datasets of counties and forecast zones:

data <- data %>% 
        filter((ansi_alphabetic_county %in% counties$ansi_alphabetic_county) | (ansi_alphabetic_county %in% forecast_zones$ansi_alphabetic_zone) | 
        (ansi_alphabetic_countyname %in% forecast_zones$ansi_alphabetic_zonename)) %>% 
        select(-ansi_alphabetic_county, -ansi_alphabetic_countyname) %>% 
        print
## # A tibble: 867,200 × 16
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##           <dbl> <chr>           <chr>     <chr>          <date>     <date>    
##  1            1 al              alabama   state          1950-04-18 1950-04-18
##  2            1 al              alabama   state          1950-04-18 1950-04-18
##  3            1 al              alabama   state          1951-02-20 1951-02-20
##  4            1 al              alabama   state          1951-06-08 1951-06-08
##  5            1 al              alabama   state          1951-11-15 1951-11-15
##  6            1 al              alabama   state          1951-11-15 1951-11-15
##  7            1 al              alabama   state          1951-11-16 1951-11-16
##  8            1 al              alabama   state          1952-01-22 1952-01-22
##  9            1 al              alabama   state          1952-02-13 1952-02-13
## 10            1 al              alabama   state          1952-02-13 1952-02-13
## # … with 867,190 more rows, and 10 more variables: county <dbl>,
## #   countyname <chr>, evtype <fct>, fatalities <dbl>, injuries <dbl>,
## #   propdmg <dbl>, cropdmg <dbl>, remarks <chr>, refnum <dbl>, zone_type <chr>

Let’s make it a bit more pretty:

data <- data %>% rename(zone = county, zone_description = countyname) %>% 
  relocate(zone_type, .after = zone_description) %>% 
  select(-remarks) %>% 
  mutate(ansi_numeric = as.factor(ansi_numeric), ansi_alphabetic = as.factor(ansi_alphabetic), territory = as.factor(territory), territory_type= as.factor(territory_type), zone = as.factor(zone), zone_type = as.factor(zone_type)) %>% 
  print
## # A tibble: 867,200 × 15
##    ansi_numeric ansi_alphabetic territory territory_type bgn_date   end_date  
##    <fct>        <fct>           <fct>     <fct>          <date>     <date>    
##  1 1            al              alabama   state          1950-04-18 1950-04-18
##  2 1            al              alabama   state          1950-04-18 1950-04-18
##  3 1            al              alabama   state          1951-02-20 1951-02-20
##  4 1            al              alabama   state          1951-06-08 1951-06-08
##  5 1            al              alabama   state          1951-11-15 1951-11-15
##  6 1            al              alabama   state          1951-11-15 1951-11-15
##  7 1            al              alabama   state          1951-11-16 1951-11-16
##  8 1            al              alabama   state          1952-01-22 1952-01-22
##  9 1            al              alabama   state          1952-02-13 1952-02-13
## 10 1            al              alabama   state          1952-02-13 1952-02-13
## # … with 867,190 more rows, and 9 more variables: zone <fct>,
## #   zone_description <chr>, zone_type <fct>, evtype <fct>, fatalities <dbl>,
## #   injuries <dbl>, propdmg <dbl>, cropdmg <dbl>, refnum <dbl>

The relevant columns for our analysis are cleaned up. The data is ready to be processed.

Description of the dataset resulting from the preproccesing

At the end of the data preprocessing chapter, the clean data should contain 867200 rows and 15 columns.

These are the columns/variables:

TERRITORY/STATE

The information of the four following columns is scraped from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations

- ansi_numeric

numeric ANSI code for all states and territories of the U.S.

- ansi_alphabetic

alphabetic ANSI code for all states and territories of the U.S.

- territory

is the state or territory name

- territory_type

is the type of territory. The types are: state, Federal district, Insular area (Commonwealth), Insular area (Territory), Insular Areas

DATE

- bgn_date

beginning date of the event in the county or forecast zone(s) of the specific state

- end_date

end date of the event in the county or forecast zone(s) of the specific state

COUNTY/FORECAST ZONE

The base for the counties is from the this site: https://www.weather.gov/NWR/counties which links to this file: https://www.weather.gov/source/nwr/SameCode.txt The base for the forecast zones is from: https://www.weather.gov/gis/PublicZones, download link: https://www.weather.gov/source/gis/Shapefiles/WSOM/z_08se21.zip

- zone

either county and/or forecast numbers. If the weather events hit several forecast zones, only one, mostly the first of them is in the zone column

- zone_description

further information on the zone, like the county name, but sometimes also the range of forecast zones that the event hit.

- zone_type

either county or forecast_zone. If the type is county, it might also be a forecast zone. If the rows says forecast_zone, it is not at the same time a county, forecast zones can be more detailled than counties.

EVENTS

- evtype

there are 43 unique event types. You can find further information on the weather events here: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

HARM TO HUMAN HEALTH

- fatalities

- injuries

ECONOMIC DAMAGE

- propdmg

property damage

- cropdmg

crop damage

UNIQUE ID

- refnum

Each entry has a unique reference number.

Data Processing:

Now we summarize the data into the sums of the population health harms (injuries, fatalities) and economic damages (crop damages, property damage) per event type:

sum_damage_harm <- data %>% 
        group_by(evtype) %>% 
        summarise(fatalities = sum(fatalities), injuries = sum(injuries), 
                  propdmg = sum(propdmg), cropdmg = sum(cropdmg)) %>%
        arrange(desc(fatalities)) %>%
        print
## # A tibble: 43 × 5
##    evtype            fatalities injuries     propdmg     cropdmg
##    <fct>                  <dbl>    <dbl>       <dbl>       <dbl>
##  1 tornado                 5658    91346 58537110397   417454710
##  2 excessive heat          1919     6595    18328700   502952000
##  3 flash flood             1028     1799 16892384703  1532014150
##  4 heat                     905     2037     1797000   401461500
##  5 lightning                816     5225   928609917    12092090
##  6 thunderstorm wind        700     9353 10908914922  1159705900
##  7 rip current              524      513      163000           0
##  8 flood                    473     6793 34636860860 10740072950
##  9 high wind                253     1261  5880903295   648083200
## 10 frost/freeze             202      631    99301400  2411604000
## # … with 33 more rows

The data processing required for the analysis is completed.

Description of the dataset resulting from the proccesing

The columns/variables resulting from the data processing in sum_damage_harm are:

- evtype

see evtype in the data preprocessing step

- fatalities

sum of fatalities of the cleaned “data” dataset

- injuries

sum of injuries of the cleaned “data” dataset

- propdmg

sum of propdmg of the cleaned “data” dataset

- cropdmg

sum of cropdmg of the cleaned “data” dataset

Results

1. Question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Now we plot the calculated sum into a bar plot.

Figure 1

sum_damage_harm %>% 
        gather(harmtype, harmcount, c(2,3)) %>% 
        mutate(evtype = fct_reorder(evtype, harmcount)) %>% 
        ggplot(aes(x=harmcount, y=evtype, fill=harmtype)) + 
        geom_bar(stat = "identity") +
        labs(y="Event Types", x="Cases", title="Most harmful weather events for population health in the U.S.", caption= "Figure 1: Most harmful weather events for population health in the U.S.")

In the above figure we see how harmful each weather event is for population health in the U.S.

Answer

We can see that the most harmful events are tornadoes by far, for both fatalities and injuries. Mostly, when injuries are high, fatalities are also high. There are a few exceptions like with excessive heat: There is a relatively low number of injuries in contrast to fatalities.

2. Question: Across the United States, which types of events have the greatest economic consequences?

Figure 2

sum_damage_harm %>% 
        gather(dmgtype, dmgcount, c(4,5)) %>% 
        mutate(evtype = fct_reorder(evtype, dmgcount)) %>% 
        ggplot(aes(x=dmgcount, y=evtype, fill=dmgtype)) + 
        geom_bar(stat = "identity") +
        labs(y="Event Types", x="Cases", title="Economic damage caused by weather events in the U.S.", caption="Figure 2: Economic damage caused by weather events in the U.S.")

Answer

We can see with the economic damages graph that the economic damage is the highest with weather events like hurricane/typhoons and tornadoes. However, it is interesting to see that the economic damages are dominated by the property damages. The highest crop damages are caused by flood and drought.