W.H.O.

This dataset contains recent statistics from World health Organisation

Reading the datafile
* We need to make sure the file is saved is working directory

WHO <- read.csv("WHO.csv")

This will save the dataset in WHO.csv to the dataframe WHO.

Now, Lets look at our data

str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...

As we can see we have 194 observations and 13 varibles of type factor, int and num.

Variables are:
* Country - Name of the country.
* Region - Region the country is in.
* Population - Population in thousands.
* Under15 - Percentage of population under 15.
* Over60 - Percentage of population over 60.
* FertilityRate - Avg no of children per women.
* LifeExpectancy - Avg Life expectancy in years.
* ChildMortality - The number of children who die by age 5 per 1000 births.
* CellularSubscribers - Percentage of cellular subcribers or no of cellular subscribers per 100 population.
* LiteracyRate - Literacy rate among adults of age>=15.
* GNI - Gross National INcome per capita.
* PrimarySchoolEnrollmentMale - Percentage of male children enrolled in school.
* PrimarySchoolEnrollmentFemale - Percentage of female children enrolled in school.

Another way to look at our data is through summary function.

summary(WHO)
##                 Country                      Region     Population     
##  Afghanistan        :  1   Africa               :46   Min.   :      1  
##  Albania            :  1   Americas             :35   1st Qu.:   1696  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790  
##  Andorra            :  1   Europe               :53   Mean   :  36360  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000  
##  (Other)            :188                                               
##     Under15          Over60      FertilityRate   LifeExpectancy 
##  Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00  
##  1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00  
##  Median :28.65   Median : 8.53   Median :2.400   Median :72.50  
##  Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01  
##  3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00  
##  Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00  
##                                  NA's   :11                     
##  ChildMortality    CellularSubscribers  LiteracyRate        GNI       
##  Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340  
##  1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335  
##  Median : 18.600   Median : 97.75      Median :91.80   Median : 7870  
##  Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321  
##  3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558  
##  Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440  
##                    NA's   :10          NA's   :91      NA's   :32     
##  PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 94.70              Median : 95.10               
##  Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :100.00              Max.   :100.00               
##  NA's   :93                  NA's   :93

It gives us numeric summary for each variable, for type factor it gives the count.

Subsetting Data

New dataframe with only countries in europe

WHO_Europe <- subset(WHO, Region=="Europe")
str(WHO_Europe)
## 'data.frame':    53 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Population                   : int  3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
##  $ Under15                      : num  21.3 15.2 20.3 14.5 22.2 ...
##  $ Over60                       : num  14.93 22.86 14.06 23.52 8.24 ...
##  $ FertilityRate                : num  1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
##  $ LifeExpectancy               : int  74 82 71 81 71 71 80 76 74 77 ...
##  $ ChildMortality               : num  16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
##  $ CellularSubscribers          : num  96.4 75.5 103.6 154.8 108.8 ...
##  $ LiteracyRate                 : num  NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
##  $ GNI                          : num  8820 NA 6100 42050 8960 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
##  $ PrimarySchoolEnrollmentFemale: num  NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...

As we can see all 53 observations of Europe continent are saved in WHO_Europe dataset.

Now, lets save this new dataframe i.e WHO_EUROPE to a csv file

write.csv(WHO_Europe, "WhoEurope.csv")

Exploring WHO dataset

Average population under age 15 in percentage

mean(WHO$Under15)
## [1] 28.73242

standard devaition in population under age 15 in percentage

sd(WHO$Under15)
## [1] 10.53457

summary of one variable

summary(WHO$Under15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.72   28.65   28.73   37.75   49.99

The output tells us that there is a country with only 13% of population under age 15.
Lets find out that country

which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe

So, Japan has lowest population under age 15. In same way we can use which.max function.

Now, lets make table of region variable for count

table(WHO$Region)
## 
##                Africa              Americas Eastern Mediterranean 
##                    46                    35                    22 
##                Europe       South-East Asia       Western Pacific 
##                    53                    11                    27

Tables work well for variables with only a few possible values.

Now, lets explore numeric variables using tapply function.
So, ‘tapply’ function splits the data by second argument, applies the third agrument function on first argument.

tapply(WHO$Over60, WHO$Region, mean, na.rm=TRUE)
##                Africa              Americas Eastern Mediterranean 
##              5.220652             10.943714              5.620000 
##                Europe       South-East Asia       Western Pacific 
##             19.774906              8.769091             10.162963

This shows us the avg population over age 60 per region.

INVESTIGATING DATA THROUGH PLOTTING

Lets create a scatterplot of GNI vs FertilityRate

plot(WHO$GNI, WHO$FertilityRate, xlab = "Avg Gross Income per capita", ylab = "Avg Fertility Rate", main = "GNI vs FERTILITY RATE")

As per the observation, we can see most in most countries FertilityRate is inversely propotional to GNI.
However, for few countries both GNI and FertilityRate are high.
Let’s identify those countries.

We will create a subset by name ‘Outliers’ for countries where GNI>10000 & FertilityRate>2.5

Outliers <- subset(WHO, GNI>10000 & FertilityRate>2.5)

Now, to count no of countries in Outliers dataframe, we can use nrow() function

nrow(Outliers)
## [1] 7

Since, we just want to extract a few variables from Outliers (i.e. GNI, FertilityRate, Country)
So, we will make a vectorof the names of the variable we want the output

Outliers[c("Country","GNI","FertilityRate" )]
##               Country   GNI FertilityRate
## 23           Botswana 14550          2.71
## 56  Equatorial Guinea 25620          5.04
## 63              Gabon 13740          4.18
## 83             Israel 27110          2.92
## 88         Kazakhstan 11250          2.52
## 131            Panama 14510          2.52
## 150      Saudi Arabia 24700          2.76

This shows us Equitorial Guinea is the country that is very rich percapita with high fertility rate.

Diffrent types of plots in R and summarising them.

  • HISTOGRAM
hist(WHO$CellularSubscribers, xlab = "Cellular Subscribers", ylab = "Frequency(Count)", main = "Histogram of Cellular Subcribers")

The value of cellular subscribers in X axis anf frequncy(Count) in Y axis.

  • BOXPLOT
    A boxplot is useful to understand statistocal range of a variable
    Boxplot of LifeExpectancy sorted by Region
boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab="", ylab="Life Expectancy", main="Life Expectancy of countries by Region")

This boxplot shows how life expectancy in countries varies according to the region the country is in.
* The box in each region shows the range between the first and third quartile.
* The middle line marking the median value.
* The dashed lines(called as whiskers) shows the range from the minimum to maximum values excluding any outliers.
* The min and max range are defined by adding interquartile range to third quartile and subtracting interquartile range from first quartile.
* The outliers plotted as circles, are the ones lying outside min to max range.
*INTERQUARTILE RANGE: Height of the box or Diffrence between Third quartile and first quartile.

PIRATEPLOT
pirateplot(WHO$)

Functions Used

ctrl + alt +i = shortcut create r chunk echo = true // to print code in rmarkdown else false getwd()
setwd()
read.csv(“___.csv“)
str() - shows structure of data
subset()
summary()
write.csv()
mean()
which.min()
which.max()
table() - Counts the number of observation

PLOT FUNCTIONS
plot(xaxis, yaxis)
xlab, ylab - for naming main - for title