Title: Course Project 2 R Markdown

Synopsis

This R Markdown Document consist a analysis of damage caused by Natural Calamities in the US. The damage is analysis in terms of two themes: Impact on Population Health and Impact on the Economy. The impact on population health is assessed on two variables viz. Fatalities and Injuries. The impact on economy is assessed on another two variables Crop Damage and Property Damage. The results are shown using a single graph of 4-Plots that show top ten type of natural calamities causing the negative impact. Overall, Tornados seems to cause most damage in all variables. Hence, pro-active steps are required to minimise the damge from tornados.

1. Data Pre - Processing

1.1 Installing all required libraries

In the following command a simple method has been developed to load all required libraries using a single command. And, we have hide all the warning and package loading messages.

lib_names <- c("readr","dplyr","tidyr","ggplot2","ggpubr")

lapply(lib_names, require, character.only = TRUE)

1.2 Download Required Data to Working Directory

Firstly, creating a destfile vector viz. a location where the downloading data will be stored. Secondly, creating a url vector of the web link from where the data is goining to be downloaded. Finally, using download.file function the data is being downloaded in current working directory, file is named as raw_data_p2. The code output is currently hidden and the file can take few minutes to process as the data file is bit large. Also, we’ll do cache = TRUE so next we run the same code it should’nt take much time.

        destfile <- paste(getwd(),"/raw_data_P2.zip", sep = "")
        
        url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        
        download.file(url = url, destfile = destfile)

1.3 Read Data

Generally, after downloading data in .zip format we must unzip it first. However, using the read_csv function from Tidyverse we can directly read csv file in a zipped folder. Thus, we are doing this here and we’ll name it as data1 for further analysis.

data1 <- read_csv("raw_data_P2.zip")

## Rows: 902297 Columns: 37
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl  (1): COUNTYENDN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

*1.4 Data Cleaning

Usually tidy format is considered as clean data. Thus, using as_tibble function from tidyverse we’ll covert our data1 into a tidy format and we’ll name the transformed version as data2.

data2 <- as_tibble(data1)
data2

## # A tibble: 902,297 x 37
##    STATE__ BGN_DATE  BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##      <dbl> <chr>     <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
##  1       1 4/18/195~ 0130     CST           97 MOBILE     AL    TORNA~         0
##  2       1 4/18/195~ 0145     CST            3 BALDWIN    AL    TORNA~         0
##  3       1 2/20/195~ 1600     CST           57 FAYETTE    AL    TORNA~         0
##  4       1 6/8/1951~ 0900     CST           89 MADISON    AL    TORNA~         0
##  5       1 11/15/19~ 1500     CST           43 CULLMAN    AL    TORNA~         0
##  6       1 11/15/19~ 2000     CST           77 LAUDERDALE AL    TORNA~         0
##  7       1 11/16/19~ 0100     CST            9 BLOUNT     AL    TORNA~         0
##  8       1 1/22/195~ 0900     CST          123 TALLAPOOSA AL    TORNA~         0
##  9       1 2/13/195~ 2000     CST          125 TUSCALOOSA AL    TORNA~         0
## 10       1 2/13/195~ 2000     CST           57 FAYETTE    AL    TORNA~         0
## # ... with 902,287 more rows, and 28 more variables: BGN_AZI <chr>,
## #   BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>, COUNTY_END <dbl>,
## #   COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>,
## #   LENGTH <dbl>, WIDTH <dbl>, F <dbl>, MAG <dbl>, FATALITIES <dbl>,
## #   INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>,
## #   CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>,
## #   LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>, ...

2. Main Data Processing or Data Analysis

This part contain our main analysis or the experiement with the variables in the dataset.

2.1 Most Harmful Events to Public Health

The following code extract top 10 most harmful events associated with Fatalities and Injuries.

Fatalities <- data2 %>% group_by(EVTYPE) %>% summarise(FATALITIES=sum(FATALITIES)) %>% arrange(desc(FATALITIES)) %>%
        head(10) %>% ggplot(aes(x=FATALITIES,y=EVTYPE)) + geom_col() + 
        ggtitle("FATALITIES") 

Injuries <- data2 %>% group_by(EVTYPE) %>% summarise(INJURIES=sum(INJURIES)) %>% arrange(desc(INJURIES)) %>%
        head(10) %>% ggplot(aes(x=INJURIES,y=EVTYPE)) + geom_col() + 
        ggtitle("INJURIES")

pop_h <- ggarrange(Fatalities, Injuries,
          labels = c("A","B"),ncol=2,nrow=1)

2.2 Most Lossful Events to Economy

The following code extract top 10 most lossful creating events to the economy. The variable assessed with event types are crop damage and property damage.

Property_dmg <- data2 %>% group_by(EVTYPE) %>% summarise(PROPDMG=sum(PROPDMG)) %>% arrange(desc(PROPDMG)) %>%
        head(10) %>% ggplot(aes(x=PROPDMG,y=EVTYPE)) + geom_col() + 
        ggtitle("PROPERTY DAMAGE") 

Crop_dmg <- data2 %>% group_by(EVTYPE) %>% summarise(CROPDMG=sum(CROPDMG)) %>% arrange(desc(CROPDMG)) %>%
        head(10) %>% ggplot(aes(x=CROPDMG,y=EVTYPE)) + geom_col() + 
        ggtitle("CROP DAMAGE")

eco_dmg <- ggarrange(Crop_dmg,Property_dmg, labels = c("C","D"),ncol = 2,nrow = 1)

3. Results

Using the ggarrange function we’ll plot all the results on a single graph.

pol_h1 <- annotate_figure(pop_h, top = text_grob("Population Health", color = "red", size = 20, face = "bold"))                                                      
eco_dmg1 <- annotate_figure(eco_dmg, top = text_grob("Economic Damge", color = "red", size = 20, face = "bold"))

ggarrange(pol_h1, eco_dmg1, ncol = 1, nrow = 2)

4. Conclusion

This data analysis was taken in order to identify most dangerous events (natural calamities) having negative consequences on public health and economy. The identified events are supposed to help US govt. make effective policies and response strategies against these events.

Project 2

Sahil Sharma

2022-08-13