Synopsis

I used the below data to answer two question :

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

In data processing section I read the data and try to explain about it.
Further I approach the questions and subset the data and did some Exploratory analysis appropriate for the question.
Missing values present in the overall data but I didn’t bother to clean them as they were not present in subset of data which was necessary for question .
I tried to answer all question properly and explain most of the steps where needed
libraries used are ggplot2 and dplyr
I also included limitations In the end

Data processing

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data link and Supporting documentation

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

DATA

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation
?sNational Climatic Data Center Storm Events FAQ

With bellow code I read the data

# Read the data 
setwd('D:/Rishikesh_Data_Science/R_proggrame_coursera/Reproducible Research/Assignment_2_week_4')
data <- read.csv('repdata_data_StormData.csv.bz2')

head(data,3)

##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3

NA_percent <- mean(is.na(data))*100

There are 5.2297366 % values that is missing which pretty less ; but I will check how much missing data is related question.

Aanswering the question by cleaning further and Exploring

Q.1 Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

For this question I will need variables EVTYPE, FATALITIES, and INJURIES

Missing values in this columns are : 0

I am using dplyr package to clean data for answering the above question. Grouping the data with even-type and adding all fatalities and injuries of all county in data so I got who havoc more damage to human life or health(Injury) compare to other events

library(dplyr)

# subsetting the data for our use 

total_health <- (sum(data$FATALITIES)+sum(data$INJURIES))

health <- data %>% select(EVTYPE, FATALITIES, INJURIES) %>% 
        group_by(EVTYPE) %>% 
        summarise(Tot_fatalities = sum(FATALITIES),
                  Tot_injuries = sum(INJURIES),
                  Total = Tot_fatalities + Tot_injuries ,
                  percentage = Total/total_health) %>%
        arrange(desc(percentage))

head(health)

## # A tibble: 6 x 5
##   EVTYPE         Tot_fatalities Tot_injuries Total percentage
##   <chr>                   <dbl>        <dbl> <dbl>      <dbl>
## 1 TORNADO                  5633        91346 96979     0.623 
## 2 EXCESSIVE HEAT           1903         6525  8428     0.0541
## 3 TSTM WIND                 504         6957  7461     0.0479
## 4 FLOOD                     470         6789  7259     0.0466
## 5 LIGHTNING                 816         5230  6046     0.0388
## 6 HEAT                      937         2100  3037     0.0195

# finding which quantile have more vlaued data 
quantiles <- quantile(health$Total, c(0 ,0.75, 0.80, 0.85, 0.90, 0.95, 1))
quantiles

##      0%     75%     80%     85%     90%     95%    100% 
##     0.0     0.0     1.0     2.0     7.6    64.8 96979.0

As from data I can deduce that maximum amount of damage to human health is caused by Tornado which is 62.29% of all events. There 985 different types events. Also the data skewed that means some of event cause more harm than other so I checked how they are distributed

Note 75% of overall data is negligibly distributed and last 5% have maximum valued data

I am filter the max data from skewed distribution to find some more insights.

# plot ready data
health <- health %>%
        mutate(Strom = if_else(percentage < 0.009, "Other",            EVTYPE)) %>% 
        group_by(Strom) %>% 
        summarize(percentage = sum(percentage)) %>%
        arrange(percentage)

library(ggplot2)

# Pie chart
pie <- ggplot(health, aes(x= '', y = percentage, fill =
                                  Strom,),stat = "identity", color = "white")


pie +
        coord_polar('y', start = 0) + geom_col(color = 'black') + 
        geom_text(aes(label = paste0(round(percentage*100),         "%")) , position = position_stack(vjust = 0.5)) +
        theme(panel.background = element_blank(), 
              axis.line = element_blank(), 
              axis.text = element_blank(), 
              axis.ticks = element_blank(),
              axis.title = element_blank(), 
              plot.title = element_text(hjust = 0.5, size =
                                                15),) +
        ggtitle("        No.of Human life Effected By Diffrent Stroms (1950-2011) ") +
        guides(fill = guide_legend(reverse = TRUE))

Q.2 Across the United States, which types of events have the greatest economic consequences?

For answer this question I am using columns EVTYPE, PROPDMG, CROPDMG which I think is cost of damage cause to property and crop in an event. Also there are two more columns PROPDMGEXP, and CROPDMGEXP; which are exponential or scale to damage cost for previous columns.(Online Research gave me this insight)

I will check is there any null values in this column

Missing values in PROPDMG, CROPDMG : 0 So there are no missing values

I will subset the data

# filtering the data
economics <- data %>%
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG,
               CROPDMGEXP)

# finding Exponential factor for cost 
union(unique(economics$PROPDMGEXP),
      unique(economics$CROPDMGEXP))

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
## [20] "k"

There are some 20 factors I identify from above analysis. (Note that capital and lower case are of same value ex- “h’ and”H" = 100)

Note: EXP = exponent These are possible values of CROPDMGEXP and PROPDMGEXP:

H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character

H,h = hundreds = 100
K,k = kilos = thousands = 1,000
M,m = millions = 1,000,000
B,b = billions = 1,000,000,000
(+) = 1
(-) = 0
(?) = 0
black/empty character = 0
numeric 0..8 = 10

# firstly I will convert PROPDMG

economics <- economics %>%
        mutate(PROPDMG =
                       if_else(
                               grepl("[Hh]", PROPDMGEXP),
                               
                               PROPDMG * 100,      # multiply by Hundred          
                               ifelse(
                                       grepl("[Kk]",
                                             PROPDMGEXP),
                                       PROPDMG * 1e3,     # multiply by Thousand
                                       if_else(
                                               grepl("[Mm]",
                                                     PROPDMGEXP),
                                               PROPDMG * 1e6,                 # multiply by Million
                                               if_else(
                                                       grepl("[Bb]", PROPDMGEXP),
                                                       
                                                       PROPDMG * 1e9,          # multiply by Billion
                                                       if_else(grepl("[0-9]", PROPDMGEXP),
                                                               PROPDMG *   10,
                                                               PROPDMG)
                                               )
                                       )
                               )
                       ))

# Now  I will convert CROPDMG

economics <- economics %>%
        mutate(CROPDMG = if_else(
                grepl("[Hh]", CROPDMGEXP),
                # multiply by Hundred
                CROPDMG * 100,                     # multiply by Hundred
                ifelse(
                        grepl("[Kk]",
                              CROPDMGEXP),
                        CROPDMG * 1e3,              # multiply by Thousand
                        if_else(
                                grepl("[Mm]", CROPDMGEXP),
                                # multiply by million
                                CROPDMG * 1e6,          # multiple by Million
                                if_else(
                                        grepl("[Bb]", CROPDMGEXP),
                                        
                                        CROPDMG * 1e9,    # multiple by Billion
                                        if_else(grepl("[0-9]", CROPDMGEXP),
                                                CROPDMG * 10,   
                                                CROPDMG)
                                )
                        )
                )
        ))
head(economics)

##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO   25000          K       0           
## 2 TORNADO    2500          K       0           
## 3 TORNADO   25000          K       0           
## 4 TORNADO    2500          K       0           
## 5 TORNADO    2500          K       0           
## 6 TORNADO    2500          K       0

I have clean the data in above steps,note that “+”, “?”, “-”, "" valued 1 or exponential 0 which mean multiple of 1; that’s why I keep the values as previous (In the last ifelse)
Now I transform data for plotting or answering the question

# plot ready
economics <- economics %>% 
        group_by(EVTYPE) %>% 
        summarise(Total_propdmg = sum(PROPDMG),
                  Total_cropdmg = sum(CROPDMG),
                  Total = Total_cropdmg + Total_propdmg) %>%
        arrange(desc(Total))

#exploring data 
summary(economics$Total_propdmg)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.338e+08 5.105e+04 1.447e+11

summary(economics$Total_cropdmg)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.985e+07 0.000e+00 1.397e+10

summary(economics$Total)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.837e+08 8.500e+04 1.503e+11

From this Above summaries I can say most of the values are zero as median is zero and Most valuable data lies in last quantile

I will plot top 10 event which cost most

top <- economics[11,]$Total # filter
economics <- economics %>% 
        mutate(plot = if_else(economics$Total > top, 
                              EVTYPE,'Other'))  %>% 
        group_by(plot) %>% 
        summarise(Total = sum(Total)) %>%
        arrange(desc(Total))

event <- economics$plot
total<- economics$Total
total <- total/1e6  # In millions

# bar plot
p <- ggplot() + geom_bar(aes(
        x = reorder(event, total),
        y = total,
        fill = total
),
stat = "identity",
show.legend = FALSE)

p + ggtitle("Strom impact on USA economy (1950- 2011)") +
        theme(plot.title = element_text(hjust = 0.5)) +
        theme(
                plot.title = element_text(size = 15)
        ) +
        xlab("") + ylab("Cost of Damage ( in million US $)") +
        coord_flip() +
        scale_fill_gradient(low = "blue", high = "red")

Result

I answered the above questions and conclude: - Human health(Death and Injuries) is mostly effected by Tornado

Between 1950 - 2012 due Tornado 96979 life effected which account for 63% of all other storm
Other Top culprit were Excessive heat, Flood and TSTM(Thunderstorm) Wind
Flood was on Top for biggest Economy damage (property and crop )
Flood Damage cost US economy around 150.3120 Billion Dollar

Limitation / Notes

For first analysis I ignore the multiple type/name of Event(storm) since 62% is bigger than half so it wouldn’t have change the result
For exponent mean I took help of some online article which I assume was correct
In second question I took only top 10 event to plot as there were 985 events which are hard to plot on Graph. Note that ‘other’ contain all the event except first ten.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.(Coursera Question note)

Storm Data Analysis

Rishikesh Pillay

27 March 2021