Analysis of harmful weather effects on population health in the United States

Synopsis

The analysis answers two questions based on NOAA Storm Database. 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? To address this question I determined which event types resulted in most direct and indirect casualties. The data contained many typos and input errors which were cleaned before analysis.

2.Across the United States, which types of events have the greatest economic consequences? To address this questions I’ve checked which events resulted in most property and crop damage. Units of measurement (B, M, K for billions, millions, and thousands, respectively, were taken into account).

Data Processing

The data was loaded from raw compressed csv.bz2 as follows. EVTYPE field was cleaned from the following errors (typos): 1. EVTYPE set to ‘THUNDERSTORM’ for all the entries with ‘THUNDERSTORM’ and ‘TSTM’ 2. set to ‘TORNADO’ for all the entries with ‘TORNADO’

I. Question 1

    # read the data
    data <- read.csv("StormData.csv.bz2")
    # we'll use dplyr to filter and summarise:
    library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

    library(data.table)

## Warning: package 'data.table' was built under R version 3.4.2

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

    # clean the data
    data$EVTYPE[data$EVTYPE %like% "TORNADO"] <- "TORNADO"
    data$EVTYPE[data$EVTYPE %like% "THUNDERSTORM"] <- "THUNDERSTORM"
    data$EVTYPE[data$EVTYPE %like% "TSTM"] <- "THUNDERSTORM"
    # group by EVTYPE and summarize by two fields (FATALITIES, INJURIES)
    grouped_data <- group_by(data, EVTYPE)
    # summarize, check and remove is any NAs
    summ <- summarize(grouped_data, total = sum(FATALITIES, na.rm = TRUE) + sum(INJURIES, na.rm = TRUE))

## Warning: package 'bindrcpp' was built under R version 3.4.2

    # output 10 event types with the highest fatalities + injuries:
    arrange(summ, desc(total))[1:10,]

## # A tibble: 10 x 2
##            EVTYPE total
##            <fctr> <dbl>
##  1        TORNADO 97068
##  2   THUNDERSTORM 10273
##  3 EXCESSIVE HEAT  8428
##  4          FLOOD  7259
##  5      LIGHTNING  6046
##  6           HEAT  3037
##  7    FLASH FLOOD  2755
##  8      ICE STORM  2064
##  9   WINTER STORM  1527
## 10      HIGH WIND  1385

    # calculate the % of total for the most harmful EVTYPE:
    arrange(summ, desc(total))[1,2]/sum(summ$total)

##       total
## 1 0.6235378

    # prepare the data for vizualization - top 5 EVTYPE and other:
    viz <- arrange(summ, desc(total))[1:5,]
    viz[6,] <- c('OTHER', sum(summ$total) - sum(viz$total))
    viz$total <- as.numeric(viz$total)
    library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.2

    # build the barplot
    ggplot(data=viz, aes(x=EVTYPE, y=total)) + ggtitle("Events most harmful to population") + labs(y = "Fatalities and Injuries", x = "Events") + geom_bar(stat="identity")

Question 2

    #converting from units to $:
    data$PROPDMG[data$PROPDMGEXP %like% "B"] <- data$PROPDMG[data$PROPDMGEXP %like% "B"]*1000000000
    data$PROPDMG[data$PROPDMGEXP %like% "M"] <- data$PROPDMG[data$PROPDMGEXP %like% "M"]*1000000
    data$PROPDMG[data$PROPDMGEXP %like% "K"] <- data$PROPDMG[data$PROPDMGEXP %like% "K"]*1000
    data$CROPDMG[data$CROPDMGEXP %like% "B"] <- data$CROPDMG[data$CROPDMGEXP %like% "B"]*1000000000
    data$CROPDMG[data$CROPDMGEXP %like% "M"] <- data$CROPDMG[data$CROPDMGEXP %like% "M"]*1000000
    data$CROPDMG[data$CROPDMGEXP %like% "K"] <- data$CROPDMG[data$CROPDMGEXP %like% "K"]*1000
    # group by EVTYPE and summarize by two fields (PROPDMG, CROPDMG)
    grouped_data <- group_by(ungroup(data), EVTYPE)
    # summarize, check and remove is any NAs
    summ <- summarize(grouped_data, total = sum(PROPDMG, na.rm = TRUE) + sum(CROPDMG, na.rm = TRUE))
    # output 10 event types with the highest fatalities + injuries:
    arrange(summ, desc(total))[1:10,]

## # A tibble: 10 x 2
##               EVTYPE        total
##               <fctr>        <dbl>
##  1             FLOOD 150319678257
##  2 HURRICANE/TYPHOON  71913712800
##  3           TORNADO  58999059560
##  4       STORM SURGE  43323541000
##  5              HAIL  18752904943
##  6       FLASH FLOOD  17562129167
##  7           DROUGHT  15018672000
##  8         HURRICANE  14610229010
##  9      THUNDERSTORM  12245295473
## 10       RIVER FLOOD  10148404500

    # calculate the % of total for EVTYPE with the most damage:
    arrange(summ, desc(total))[1,2]/sum(summ$total)

##     total
## 1 0.31555

Results of the analysis.

It was found that THUNDERSTORM has by far caused the most casualties in all the period for which the data is available (97,068 out of 155,673 or 62.4%). Flood, however, caused the most property and crop damage - USD 150.3B out of total USD 476.4B (31.6%).

Analysis of harmful weather effects on population health in the United States

Mikhail Chepkiy

13 November 2017

Synopsis

Data Processing

Results of the analysis.