1. Download the file into your present working directory
2. Read the content of the file into an object.
AfghanDisaster <- read.csv("D:/Data Science/R Programming/5 Reproducible Research/Handson/AfghanDisaster.csv", header = TRUE)
str(AfghanDisaster)
## 'data.frame': 630 obs. of 13 variables:
## $ INC_TYPE : Factor w/ 4 levels "Avalanche","Flood",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ INC_DATE : Factor w/ 65 levels "01-02-15","01-03-15",..: 11 24 26 26 31 41 44 52 52 52 ...
## $ PROV_CODE : int 15 15 15 15 15 14 32 13 13 13 ...
## $ PROV_NAME : Factor w/ 29 levels "Badakhshan","Badghis",..: 1 1 1 1 1 21 11 16 16 16 ...
## $ DIST_CODE : int 1527 1515 1525 1525 1515 1404 3201 1301 1314 1303 ...
## $ DIST_NAME : Factor w/ 168 levels "Abkamari","Achin",..: 168 41 142 142 41 116 96 11 115 110 ...
## $ DEAD : int 1 3 3 3 2 0 0 0 0 0 ...
## $ INJURED : int 0 0 5 5 1 0 0 0 0 0 ...
## $ MISSING : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AFF_FAM : int 1 0 8 0 0 8 6 2 1 3 ...
## $ AFF_IND : int 7 3 8 0 0 51 42 17 9 17 ...
## $ HOUSES_DAM: int 0 0 0 0 0 8 6 2 1 3 ...
## $ HOUSES_DES: int 0 0 0 0 0 0 0 0 0 0 ...
3. Examine the Afghan Disaster Data. Suggest pre-processing steps that can be taken to improve this dataset.
4. Change the data types of the columns in the AfghanDisaster csv into more suitable data types
5. How many rows are there in the dataset? How many rows are there without Nas? List the rows with Nas?
Number of rows in datasets
nrow(AfghanDisaster)
## [1] 630
Number rows without Nas
nrow(na.omit(AfghanDisaster))
## [1] 614
Rows with Na
NA_AfghanDisaster <- apply(AfghanDisaster, 1, FUN=function (x) {sum(is.na(x))})
AfghanDisaster[NA_AfghanDisaster>=2,]
## INC_TYPE INC_DATE PROV_CODE PROV_NAME DIST_CODE
## 471 Landslide and Mudflow 26-04-15 16 Takhar 1609
## 472 Landslide and Mudflow 26-04-15 15 Badakhshan 1502
## 477 Flood 08-05-15 21 Ghor 2105
## 478 Flood 08-05-15 21 Ghor 2101
## 479 Flood 08-05-15 21 Ghor 2105
## 534 Landslide and Mudflow 08-05-15 15 Badakhshan 1509
## 538 Flood 09-05-15 28 Faryab 2803
## 553 Flood 10-05-15 17 Kunduz 1701
## 554 Flood 10-05-15 17 Kunduz 1706
## 555 Flood 10-05-15 9 Baghlan 901
## 556 Flood 10-05-15 16 Takhar 1601
## DIST_NAME DEAD INJURED MISSING AFF_FAM AFF_IND HOUSES_DAM
## 471 Rostaq NA NA NA 80 560 NA
## 472 Yaftal-e-Sufla NA NA NA 120 840 NA
## 477 Shahrak NA NA NA 4 28 4
## 478 Chaghcharan NA NA NA 33 231 33
## 479 Shahrak NA NA NA 5 35 5
## 534 Teshkan NA NA NA 216 1512 NA
## 538 Pashtunkot NA NA NA 22 154 22
## 553 Kunduz 1 NA NA 279 1953 180
## 554 Khanabad NA NA NA 25 175 18
## 555 Pul-e- khumri NA NA NA 7 49 5
## 556 Taloqan NA NA NA NA NA NA
## HOUSES_DES
## 471 NA
## 472 0
## 477 NA
## 478 NA
## 479 NA
## 534 216
## 538 0
## 553 99
## 554 7
## 555 2
## 556 NA
6. Discuss a viable strategy to replace the data that is missing.
One way to handle the data that is missingor filled with NA is by replare it with amount 0
7. Implement the strategy you suggested : remove all rows with NA, zerofy all NAs, replace NA with mean
new_AfghanDisaster <- read.csv("D:/Data Science/R Programming/5 Reproducible Research/Handson/AfghanDisaster.csv", header = TRUE)
new_AfghanDisaster[is.na(new_AfghanDisaster)] <- 0
8. Calculate the average value of HOUSES DESTROYED after you have normalized the data. What can you say about the data?
mean(new_AfghanDisaster$HOUSES_DES)
## [1] 4.684127
9. What type of disaster caused the most number of injuries?
aggregate(new_AfghanDisaster$INJURED , by=list(INC_TYPE=new_AfghanDisaster$INC_TYPE), FUN=sum)
## INC_TYPE x
## 1 Avalanche 52
## 2 Flood 87
## 3 Heavy Snowfall 11
## 4 Landslide and Mudflow 2
disasterSum <- tapply(new_AfghanDisaster$INJURED, new_AfghanDisaster$INC_TYPE, FUN=sum)
disasterSum
## Avalanche Flood Heavy Snowfall
## 52 87 11
## Landslide and Mudflow
## 2
10. What type of disaster caused the most number of deaths?
deathSum <-aggregate(DEAD ~ INC_TYPE, new_AfghanDisaster, sum)
deathSum
## INC_TYPE DEAD
## 1 Avalanche 189
## 2 Flood 78
## 3 Heavy Snowfall 16
## 4 Landslide and Mudflow 67
11. Assuming each house costs USD 1000, insert a new column to the dataframe you have been using which shows the economic loss.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
AfghanDisaster_loss <-mutate(new_AfghanDisaster, ECO_LOSS = (HOUSES_DAM + HOUSES_DES)* 1000)
str(AfghanDisaster_loss)
## 'data.frame': 630 obs. of 14 variables:
## $ INC_TYPE : Factor w/ 4 levels "Avalanche","Flood",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ INC_DATE : Factor w/ 65 levels "01-02-15","01-03-15",..: 11 24 26 26 31 41 44 52 52 52 ...
## $ PROV_CODE : int 15 15 15 15 15 14 32 13 13 13 ...
## $ PROV_NAME : Factor w/ 29 levels "Badakhshan","Badghis",..: 1 1 1 1 1 21 11 16 16 16 ...
## $ DIST_CODE : int 1527 1515 1525 1525 1515 1404 3201 1301 1314 1303 ...
## $ DIST_NAME : Factor w/ 168 levels "Abkamari","Achin",..: 168 41 142 142 41 116 96 11 115 110 ...
## $ DEAD : num 1 3 3 3 2 0 0 0 0 0 ...
## $ INJURED : num 0 0 5 5 1 0 0 0 0 0 ...
## $ MISSING : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AFF_FAM : num 1 0 8 0 0 8 6 2 1 3 ...
## $ AFF_IND : num 7 3 8 0 0 51 42 17 9 17 ...
## $ HOUSES_DAM: num 0 0 0 0 0 8 6 2 1 3 ...
## $ HOUSES_DES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ ECO_LOSS : num 0 0 0 0 0 8000 6000 2000 1000 3000 ...
12. Plot a graph to show the ordered list of natural disasters which causes the most deaths and the most economic damage
ecoDamageSum <- aggregate(ECO_LOSS ~ INC_TYPE, AfghanDisaster_loss, sum)
Q12 <- data.frame(deathSum$INC_TYPE, deathSum$DEAD,ecoDamageSum$ECO_LOSS)
library(ggplot2)
plotDeath <- ggplot(data = Q12, aes(x = deathSum.INC_TYPE , y = deathSum.DEAD, color= factor(deathSum.INC_TYPE)))+ geom_point() + labs(color= "Disaster Type",x = "Natural Disaster", y = "Number of Death")
plotEcoLoss <- ggplot(data = Q12, aes(x = deathSum.INC_TYPE , y = ecoDamageSum.ECO_LOSS,color= factor(deathSum.INC_TYPE)))+ geom_point() + labs(color= "Disaster Type",x = "Natural Disaster", y = "Amount Economic Loss") +coord_cartesian(ylim=c(0, 12000000))
Plot of number of death on every disaster type
plotDeath
Plot of economic loss on every disaster type
plotEcoLoss
Based on the plot shows that number of death are the highest on disaster Avalanche while the economic loss are the highest on flood disaster