This dataset contains recent statistics from World health Organisation
Reading the datafile
* We need to make sure the file is saved is working directory
WHO <- read.csv("WHO.csv")
This will save the dataset in WHO.csv to the dataframe WHO.
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
As we can see we have 194 observations and 13 varibles of type factor, int and num.
Variables are:
* Country - Name of the country.
* Region - Region the country is in.
* Population - Population in thousands.
* Under15 - Percentage of population under 15.
* Over60 - Percentage of population over 60.
* FertilityRate - Avg no of children per women.
* LifeExpectancy - Avg Life expectancy in years.
* ChildMortality - The number of children who die by age 5 per 1000 births.
* CellularSubscribers - Percentage of cellular subcribers or no of cellular subscribers per 100 population.
* LiteracyRate - Literacy rate among adults of age>=15.
* GNI - Gross National INcome per capita.
* PrimarySchoolEnrollmentMale - Percentage of male children enrolled in school.
* PrimarySchoolEnrollmentFemale - Percentage of female children enrolled in school.
Another way to look at our data is through summary function.
summary(WHO)
## Country Region Population
## Afghanistan : 1 Africa :46 Min. : 1
## Albania : 1 Americas :35 1st Qu.: 1696
## Algeria : 1 Eastern Mediterranean:22 Median : 7790
## Andorra : 1 Europe :53 Mean : 36360
## Angola : 1 South-East Asia :11 3rd Qu.: 24535
## Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000
## (Other) :188
## Under15 Over60 FertilityRate LifeExpectancy
## Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00
## 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00
## Median :28.65 Median : 8.53 Median :2.400 Median :72.50
## Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01
## 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00
## Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00
## NA's :11
## ChildMortality CellularSubscribers LiteracyRate GNI
## Min. : 2.200 Min. : 2.57 Min. :31.10 Min. : 340
## 1st Qu.: 8.425 1st Qu.: 63.57 1st Qu.:71.60 1st Qu.: 2335
## Median : 18.600 Median : 97.75 Median :91.80 Median : 7870
## Mean : 36.149 Mean : 93.64 Mean :83.71 Mean :13321
## 3rd Qu.: 55.975 3rd Qu.:120.81 3rd Qu.:97.85 3rd Qu.:17558
## Max. :181.600 Max. :196.41 Max. :99.80 Max. :86440
## NA's :10 NA's :91 NA's :32
## PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## Min. : 37.20 Min. : 32.50
## 1st Qu.: 87.70 1st Qu.: 87.30
## Median : 94.70 Median : 95.10
## Mean : 90.85 Mean : 89.63
## 3rd Qu.: 98.10 3rd Qu.: 97.90
## Max. :100.00 Max. :100.00
## NA's :93 NA's :93
It gives us numeric summary for each variable, for type factor it gives the count.
New dataframe with only countries in europe
WHO_Europe <- subset(WHO, Region=="Europe")
str(WHO_Europe)
## 'data.frame': 53 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
## $ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
## $ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
## $ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
## $ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
## $ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
## $ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
## $ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
## $ GNI : num 8820 NA 6100 42050 8960 ...
## $ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
## $ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
As we can see all 53 observations of Europe continent are saved in WHO_Europe dataset.
Now, lets save this new dataframe i.e WHO_EUROPE to a csv file
write.csv(WHO_Europe, "WhoEurope.csv")
Average population under age 15 in percentage
mean(WHO$Under15)
## [1] 28.73242
standard devaition in population under age 15 in percentage
sd(WHO$Under15)
## [1] 10.53457
summary of one variable
summary(WHO$Under15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.12 18.72 28.65 28.73 37.75 49.99
The output tells us that there is a country with only 13% of population under age 15.
Lets find out that country
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
So, Japan has lowest population under age 15. In same way we can use which.max function.
Now, lets make table of region variable for count
table(WHO$Region)
##
## Africa Americas Eastern Mediterranean
## 46 35 22
## Europe South-East Asia Western Pacific
## 53 11 27
Tables work well for variables with only a few possible values.
Now, lets explore numeric variables using tapply function.
So, ‘tapply’ function splits the data by second argument, applies the third agrument function on first argument.
tapply(WHO$Over60, WHO$Region, mean, na.rm=TRUE)
## Africa Americas Eastern Mediterranean
## 5.220652 10.943714 5.620000
## Europe South-East Asia Western Pacific
## 19.774906 8.769091 10.162963
This shows us the avg population over age 60 per region.
Lets create a scatterplot of GNI vs FertilityRate
plot(WHO$GNI, WHO$FertilityRate, xlab = "Avg Gross Income per capita", ylab = "Avg Fertility Rate", main = "GNI vs FERTILITY RATE")
As per the observation, we can see most in most countries FertilityRate is inversely propotional to GNI.
However, for few countries both GNI and FertilityRate are high.
Let’s identify those countries.
We will create a subset by name ‘Outliers’ for countries where GNI>10000 & FertilityRate>2.5
Outliers <- subset(WHO, GNI>10000 & FertilityRate>2.5)
Now, to count no of countries in Outliers dataframe, we can use nrow() function
nrow(Outliers)
## [1] 7
Since, we just want to extract a few variables from Outliers (i.e. GNI, FertilityRate, Country)
So, we will make a vectorof the names of the variable we want the output
Outliers[c("Country","GNI","FertilityRate" )]
## Country GNI FertilityRate
## 23 Botswana 14550 2.71
## 56 Equatorial Guinea 25620 5.04
## 63 Gabon 13740 4.18
## 83 Israel 27110 2.92
## 88 Kazakhstan 11250 2.52
## 131 Panama 14510 2.52
## 150 Saudi Arabia 24700 2.76
This shows us Equitorial Guinea is the country that is very rich percapita with high fertility rate.
hist(WHO$CellularSubscribers, xlab = "Cellular Subscribers", ylab = "Frequency(Count)", main = "Histogram of Cellular Subcribers")
The value of cellular subscribers in X axis anf frequncy(Count) in Y axis.
boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab="", ylab="Life Expectancy", main="Life Expectancy of countries by Region")
This boxplot shows how life expectancy in countries varies according to the region the country is in.
* The box in each region shows the range between the first and third quartile.
* The middle line marking the median value.
* The dashed lines(called as whiskers) shows the range from the minimum to maximum values excluding any outliers.
* The min and max range are defined by adding interquartile range to third quartile and subtracting interquartile range from first quartile.
* The outliers plotted as circles, are the ones lying outside min to max range.
*INTERQUARTILE RANGE: Height of the box or Diffrence between Third quartile and first quartile.
PIRATEPLOT
pirateplot(WHO$)
ctrl + alt +i = shortcut create r chunk echo = true // to print code in rmarkdown else false getwd()
setwd()
read.csv(“___.csv“)
str() - shows structure of data
subset()
summary()
write.csv()
mean()
which.min()
which.max()
table() - Counts the number of observation
PLOT FUNCTIONS
plot(xaxis, yaxis)
xlab, ylab - for naming main - for title