1. Download the file into your present working directory

2. Read the content of the file into an object.

AfghanDisaster <- read.csv("D:/Data Science/R Programming/5 Reproducible Research/Handson/AfghanDisaster.csv", header = TRUE)

str(AfghanDisaster)
## 'data.frame':    630 obs. of  13 variables:
##  $ INC_TYPE  : Factor w/ 4 levels "Avalanche","Flood",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ INC_DATE  : Factor w/ 65 levels "01-02-15","01-03-15",..: 11 24 26 26 31 41 44 52 52 52 ...
##  $ PROV_CODE : int  15 15 15 15 15 14 32 13 13 13 ...
##  $ PROV_NAME : Factor w/ 29 levels "Badakhshan","Badghis",..: 1 1 1 1 1 21 11 16 16 16 ...
##  $ DIST_CODE : int  1527 1515 1525 1525 1515 1404 3201 1301 1314 1303 ...
##  $ DIST_NAME : Factor w/ 168 levels "Abkamari","Achin",..: 168 41 142 142 41 116 96 11 115 110 ...
##  $ DEAD      : int  1 3 3 3 2 0 0 0 0 0 ...
##  $ INJURED   : int  0 0 5 5 1 0 0 0 0 0 ...
##  $ MISSING   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AFF_FAM   : int  1 0 8 0 0 8 6 2 1 3 ...
##  $ AFF_IND   : int  7 3 8 0 0 51 42 17 9 17 ...
##  $ HOUSES_DAM: int  0 0 0 0 0 8 6 2 1 3 ...
##  $ HOUSES_DES: int  0 0 0 0 0 0 0 0 0 0 ...

3. Examine the Afghan Disaster Data. Suggest pre-processing steps that can be taken to improve this dataset.

4. Change the data types of the columns in the AfghanDisaster csv into more suitable data types

5. How many rows are there in the dataset? How many rows are there without Nas? List the rows with Nas?

Number of rows in datasets

nrow(AfghanDisaster)
## [1] 630

Number rows without Nas

nrow(na.omit(AfghanDisaster))
## [1] 614

Rows with Na

NA_AfghanDisaster <-  apply(AfghanDisaster, 1, FUN=function (x) {sum(is.na(x))})
AfghanDisaster[NA_AfghanDisaster>=2,]
##                  INC_TYPE INC_DATE PROV_CODE  PROV_NAME DIST_CODE
## 471 Landslide and Mudflow 26-04-15        16     Takhar      1609
## 472 Landslide and Mudflow 26-04-15        15 Badakhshan      1502
## 477                 Flood 08-05-15        21       Ghor      2105
## 478                 Flood 08-05-15        21       Ghor      2101
## 479                 Flood 08-05-15        21       Ghor      2105
## 534 Landslide and Mudflow 08-05-15        15 Badakhshan      1509
## 538                 Flood 09-05-15        28     Faryab      2803
## 553                 Flood 10-05-15        17     Kunduz      1701
## 554                 Flood 10-05-15        17     Kunduz      1706
## 555                 Flood 10-05-15         9    Baghlan       901
## 556                 Flood 10-05-15        16     Takhar      1601
##          DIST_NAME DEAD INJURED MISSING AFF_FAM AFF_IND HOUSES_DAM
## 471         Rostaq   NA      NA      NA      80     560         NA
## 472 Yaftal-e-Sufla   NA      NA      NA     120     840         NA
## 477        Shahrak   NA      NA      NA       4      28          4
## 478    Chaghcharan   NA      NA      NA      33     231         33
## 479        Shahrak   NA      NA      NA       5      35          5
## 534        Teshkan   NA      NA      NA     216    1512         NA
## 538     Pashtunkot   NA      NA      NA      22     154         22
## 553         Kunduz    1      NA      NA     279    1953        180
## 554       Khanabad   NA      NA      NA      25     175         18
## 555  Pul-e- khumri   NA      NA      NA       7      49          5
## 556        Taloqan   NA      NA      NA      NA      NA         NA
##     HOUSES_DES
## 471         NA
## 472          0
## 477         NA
## 478         NA
## 479         NA
## 534        216
## 538          0
## 553         99
## 554          7
## 555          2
## 556         NA

6. Discuss a viable strategy to replace the data that is missing.

One way to handle the data that is missingor filled with NA is by replare it with amount 0

7. Implement the strategy you suggested : remove all rows with NA, zerofy all NAs, replace NA with mean

new_AfghanDisaster <- read.csv("D:/Data Science/R Programming/5 Reproducible Research/Handson/AfghanDisaster.csv", header = TRUE)

new_AfghanDisaster[is.na(new_AfghanDisaster)] <- 0

8. Calculate the average value of HOUSES DESTROYED after you have normalized the data. What can you say about the data?

mean(new_AfghanDisaster$HOUSES_DES)
## [1] 4.684127

9. What type of disaster caused the most number of injuries?

aggregate(new_AfghanDisaster$INJURED , by=list(INC_TYPE=new_AfghanDisaster$INC_TYPE), FUN=sum)
##                INC_TYPE  x
## 1             Avalanche 52
## 2                 Flood 87
## 3        Heavy Snowfall 11
## 4 Landslide and Mudflow  2
disasterSum <- tapply(new_AfghanDisaster$INJURED, new_AfghanDisaster$INC_TYPE, FUN=sum)
disasterSum
##             Avalanche                 Flood        Heavy Snowfall 
##                    52                    87                    11 
## Landslide and Mudflow 
##                     2

10. What type of disaster caused the most number of deaths?

deathSum <-aggregate(DEAD ~ INC_TYPE, new_AfghanDisaster, sum)
deathSum
##                INC_TYPE DEAD
## 1             Avalanche  189
## 2                 Flood   78
## 3        Heavy Snowfall   16
## 4 Landslide and Mudflow   67

11. Assuming each house costs USD 1000, insert a new column to the dataframe you have been using which shows the economic loss.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
AfghanDisaster_loss <-mutate(new_AfghanDisaster, ECO_LOSS = (HOUSES_DAM + HOUSES_DES)* 1000)

str(AfghanDisaster_loss)
## 'data.frame':    630 obs. of  14 variables:
##  $ INC_TYPE  : Factor w/ 4 levels "Avalanche","Flood",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ INC_DATE  : Factor w/ 65 levels "01-02-15","01-03-15",..: 11 24 26 26 31 41 44 52 52 52 ...
##  $ PROV_CODE : int  15 15 15 15 15 14 32 13 13 13 ...
##  $ PROV_NAME : Factor w/ 29 levels "Badakhshan","Badghis",..: 1 1 1 1 1 21 11 16 16 16 ...
##  $ DIST_CODE : int  1527 1515 1525 1525 1515 1404 3201 1301 1314 1303 ...
##  $ DIST_NAME : Factor w/ 168 levels "Abkamari","Achin",..: 168 41 142 142 41 116 96 11 115 110 ...
##  $ DEAD      : num  1 3 3 3 2 0 0 0 0 0 ...
##  $ INJURED   : num  0 0 5 5 1 0 0 0 0 0 ...
##  $ MISSING   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AFF_FAM   : num  1 0 8 0 0 8 6 2 1 3 ...
##  $ AFF_IND   : num  7 3 8 0 0 51 42 17 9 17 ...
##  $ HOUSES_DAM: num  0 0 0 0 0 8 6 2 1 3 ...
##  $ HOUSES_DES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ECO_LOSS  : num  0 0 0 0 0 8000 6000 2000 1000 3000 ...

12. Plot a graph to show the ordered list of natural disasters which causes the most deaths and the most economic damage

ecoDamageSum <- aggregate(ECO_LOSS ~ INC_TYPE, AfghanDisaster_loss, sum)


Q12 <- data.frame(deathSum$INC_TYPE, deathSum$DEAD,ecoDamageSum$ECO_LOSS)

library(ggplot2)

plotDeath <- ggplot(data = Q12, aes(x = deathSum.INC_TYPE , y = deathSum.DEAD, color= factor(deathSum.INC_TYPE)))+ geom_point() + labs(color= "Disaster Type",x = "Natural Disaster", y = "Number of Death")

plotEcoLoss <- ggplot(data = Q12, aes(x = deathSum.INC_TYPE , y = ecoDamageSum.ECO_LOSS,color= factor(deathSum.INC_TYPE)))+ geom_point() + labs(color= "Disaster Type",x = "Natural Disaster", y = "Amount Economic Loss") +coord_cartesian(ylim=c(0, 12000000))

Plot of number of death on every disaster type

plotDeath

Plot of economic loss on every disaster type

plotEcoLoss

Based on the plot shows that number of death are the highest on disaster Avalanche while the economic loss are the highest on flood disaster