Synopsis

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data processing

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Aanswering the question by cleaning further and Exploring

Q.1 Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

For this question I will need variables EVTYPE, FATALITIES, and INJURIES

Missing values in this columns are : 0

I am using dplyr package to clean data for answering the above question. Grouping the data with even-type and adding all fatalities and injuries of all county in data so I got who havoc more damage to human life or health(Injury) compare to other events

library(dplyr)

# subsetting the data for our use 

total_health <- (sum(data$FATALITIES)+sum(data$INJURIES))

health <- data %>% select(EVTYPE, FATALITIES, INJURIES) %>% 
        group_by(EVTYPE) %>% 
        summarise(Tot_fatalities = sum(FATALITIES),
                  Tot_injuries = sum(INJURIES),
                  Total = Tot_fatalities + Tot_injuries ,
                  percentage = Total/total_health) %>%
        arrange(desc(percentage))

head(health)
## # A tibble: 6 x 5
##   EVTYPE         Tot_fatalities Tot_injuries Total percentage
##   <chr>                   <dbl>        <dbl> <dbl>      <dbl>
## 1 TORNADO                  5633        91346 96979     0.623 
## 2 EXCESSIVE HEAT           1903         6525  8428     0.0541
## 3 TSTM WIND                 504         6957  7461     0.0479
## 4 FLOOD                     470         6789  7259     0.0466
## 5 LIGHTNING                 816         5230  6046     0.0388
## 6 HEAT                      937         2100  3037     0.0195
# finding which quantile have more vlaued data 
quantiles <- quantile(health$Total, c(0 ,0.75, 0.80, 0.85, 0.90, 0.95, 1))
quantiles
##      0%     75%     80%     85%     90%     95%    100% 
##     0.0     0.0     1.0     2.0     7.6    64.8 96979.0

As from data I can deduce that maximum amount of damage to human health is caused by Tornado which is 62.29% of all events. There 985 different types events. Also the data skewed that means some of event cause more harm than other so I checked how they are distributed

Note 75% of overall data is negligibly distributed and last 5% have maximum valued data

I am filter the max data from skewed distribution to find some more insights.

# plot ready data
health <- health %>%
        mutate(Strom = if_else(percentage < 0.009, "Other",            EVTYPE)) %>% 
        group_by(Strom) %>% 
        summarize(percentage = sum(percentage)) %>%
        arrange(percentage)

library(ggplot2)

# Pie chart
pie <- ggplot(health, aes(x= '', y = percentage, fill =
                                  Strom,),stat = "identity", color = "white")


pie +
        coord_polar('y', start = 0) + geom_col(color = 'black') + 
        geom_text(aes(label = paste0(round(percentage*100),         "%")) , position = position_stack(vjust = 0.5)) +
        theme(panel.background = element_blank(), 
              axis.line = element_blank(), 
              axis.text = element_blank(), 
              axis.ticks = element_blank(),
              axis.title = element_blank(), 
              plot.title = element_text(hjust = 0.5, size =
                                                15),) +
        ggtitle("        No.of Human life Effected By Diffrent Stroms (1950-2011) ") +
        guides(fill = guide_legend(reverse = TRUE))

Q.2 Across the United States, which types of events have the greatest economic consequences?

For answer this question I am using columns EVTYPE, PROPDMG, CROPDMG which I think is cost of damage cause to property and crop in an event. Also there are two more columns PROPDMGEXP, and CROPDMGEXP; which are exponential or scale to damage cost for previous columns.(Online Research gave me this insight)

I will check is there any null values in this column

Missing values in PROPDMG, CROPDMG : 0 So there are no missing values

I will subset the data

# filtering the data
economics <- data %>%
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG,
               CROPDMGEXP)

# finding Exponential factor for cost 
union(unique(economics$PROPDMGEXP),
      unique(economics$CROPDMGEXP))
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
## [20] "k"

There are some 20 factors I identify from above analysis. (Note that capital and lower case are of same value ex- “h’ and”H" = 100)

Note: EXP = exponent These are possible values of CROPDMGEXP and PROPDMGEXP:

H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character

  • H,h = hundreds = 100
  • K,k = kilos = thousands = 1,000
  • M,m = millions = 1,000,000
  • B,b = billions = 1,000,000,000
  • (+) = 1
  • (-) = 0
  • (?) = 0
  • black/empty character = 0
  • numeric 0..8 = 10
# firstly I will convert PROPDMG

economics <- economics %>%
        mutate(PROPDMG =
                       if_else(
                               grepl("[Hh]", PROPDMGEXP),
                               
                               PROPDMG * 100,      # multiply by Hundred          
                               ifelse(
                                       grepl("[Kk]",
                                             PROPDMGEXP),
                                       PROPDMG * 1e3,     # multiply by Thousand
                                       if_else(
                                               grepl("[Mm]",
                                                     PROPDMGEXP),
                                               PROPDMG * 1e6,                 # multiply by Million
                                               if_else(
                                                       grepl("[Bb]", PROPDMGEXP),
                                                       
                                                       PROPDMG * 1e9,          # multiply by Billion
                                                       if_else(grepl("[0-9]", PROPDMGEXP),
                                                               PROPDMG *   10,
                                                               PROPDMG)
                                               )
                                       )
                               )
                       ))

# Now  I will convert CROPDMG

economics <- economics %>%
        mutate(CROPDMG = if_else(
                grepl("[Hh]", CROPDMGEXP),
                # multiply by Hundred
                CROPDMG * 100,                     # multiply by Hundred
                ifelse(
                        grepl("[Kk]",
                              CROPDMGEXP),
                        CROPDMG * 1e3,              # multiply by Thousand
                        if_else(
                                grepl("[Mm]", CROPDMGEXP),
                                # multiply by million
                                CROPDMG * 1e6,          # multiple by Million
                                if_else(
                                        grepl("[Bb]", CROPDMGEXP),
                                        
                                        CROPDMG * 1e9,    # multiple by Billion
                                        if_else(grepl("[0-9]", CROPDMGEXP),
                                                CROPDMG * 10,   
                                                CROPDMG)
                                )
                        )
                )
        ))
head(economics)
##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO   25000          K       0           
## 2 TORNADO    2500          K       0           
## 3 TORNADO   25000          K       0           
## 4 TORNADO    2500          K       0           
## 5 TORNADO    2500          K       0           
## 6 TORNADO    2500          K       0

I have clean the data in above steps,note that “+”, “?”, “-”, "" valued 1 or exponential 0 which mean multiple of 1; that’s why I keep the values as previous (In the last ifelse)
Now I transform data for plotting or answering the question

# plot ready
economics <- economics %>% 
        group_by(EVTYPE) %>% 
        summarise(Total_propdmg = sum(PROPDMG),
                  Total_cropdmg = sum(CROPDMG),
                  Total = Total_cropdmg + Total_propdmg) %>%
        arrange(desc(Total))

#exploring data 
summary(economics$Total_propdmg)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.338e+08 5.105e+04 1.447e+11
summary(economics$Total_cropdmg)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.985e+07 0.000e+00 1.397e+10
summary(economics$Total)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.837e+08 8.500e+04 1.503e+11

From this Above summaries I can say most of the values are zero as median is zero and Most valuable data lies in last quantile

I will plot top 10 event which cost most

top <- economics[11,]$Total # filter
economics <- economics %>% 
        mutate(plot = if_else(economics$Total > top, 
                              EVTYPE,'Other'))  %>% 
        group_by(plot) %>% 
        summarise(Total = sum(Total)) %>%
        arrange(desc(Total))

event <- economics$plot
total<- economics$Total
total <- total/1e6  # In millions

# bar plot
p <- ggplot() + geom_bar(aes(
        x = reorder(event, total),
        y = total,
        fill = total
),
stat = "identity",
show.legend = FALSE)

p + ggtitle("Strom impact on USA economy (1950- 2011)") +
        theme(plot.title = element_text(hjust = 0.5)) +
        theme(
                plot.title = element_text(size = 15)
        ) +
        xlab("") + ylab("Cost of Damage ( in million US $)") +
        coord_flip() +
        scale_fill_gradient(low = "blue", high = "red")

Result

I answered the above questions and conclude: - Human health(Death and Injuries) is mostly effected by Tornado

Limitation / Notes