Education for all has been one of the most important goals for the World Health Organization. But not all regions are equal in schooling and educational opportunties. We would like to create elementary educational programs in places in the world with lower literacy rates.
Does gross national income affect literacy rates?
~Analyzing which countries/regions have the lowest literacy rates ~Examine the differences between primary enrollment of males and females ~Analyze the GNI vs literacy rates
I hope that my analytic models will discover which regions have lower educational levels and literacy rates. I hope it learns whether or not gross national income affects literacy rates.
5 of the attributes are Literacy Rate, Primary school enrollment male, Primary school enrollment female, GNI (Gross National Income), Region. Literacy Rate is numeric, Primary school enrollment male is numeric, Primary school enrollment female is numeric. GNI is numeric and Region is a factor with six levels. GNI is potential relevant due to the fact that the literacy rate could be low due to the low gross national income. The government and citizens in those regions could potentially not have enough money to bring their children to schools. Literacy rate will be used to determine which Regions could potential benefit from an educational program.
Below I have shown five examples of the instances in this dataset. I have Angola, Austria, Bhutan, Barbados and Nepal. Each row shows the country name, region, population, population under 15, population over 60, fertility rate, life expenctancy, child mortality, cellular subscribers, literacy rates, GNI, primary school enrollment male and primary school enrollment female. The data format of CSV is the comma separated values which is easily imported into programs like R. Each field is separated within a header and row number. For example: row number 5 is Angola, its header is Country. The row contains all of the information of a specific country. The row shows Angola’s population, literary rate, GNI and etc.
getwd()
## [1] "C:/Users/Toby/Desktop"
setwd("C:/Users/Toby/Desktop")
who <- read.table("WHO.csv", header = TRUE, sep = ",")
head(who, 5)
## Country Region Population Under15 Over60
## 1 Afghanistan Eastern Mediterranean 29825 47.42 3.82
## 2 Albania Europe 3162 21.33 14.93
## 3 Algeria Africa 38482 27.42 7.17
## 4 Andorra Europe 78 15.20 22.86
## 5 Angola Africa 20821 47.58 3.84
## FertilityRate LifeExpectancy ChildMortality CellularSubscribers
## 1 5.40 60 98.5 54.26
## 2 1.75 74 16.7 96.39
## 3 2.83 73 20.0 98.99
## 4 NA 82 3.2 75.49
## 5 6.10 51 163.5 48.38
## LiteracyRate GNI PrimarySchoolEnrollmentMale
## 1 NA 1140 NA
## 2 NA 8820 NA
## 3 NA 8310 98.2
## 4 NA NA 78.4
## 5 70.1 5230 93.1
## PrimarySchoolEnrollmentFemale
## 1 NA
## 2 NA
## 3 96.4
## 4 79.4
## 5 78.2
str(who)
## 'data.frame': 148 obs. of 13 variables:
## $ Country : Factor w/ 148 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
rows <- who[c(5,10,20,15,120),]
rows
## Country Region Population Under15 Over60 FertilityRate
## 5 Angola Africa 20821 47.58 3.84 6.10
## 10 Austria Europe 8464 14.51 23.52 1.44
## 20 Bhutan South-East Asia 742 28.53 6.90 2.32
## 15 Barbados Americas 283 18.99 15.78 1.84
## 120 Swaziland Africa 1231 38.05 5.34 3.48
## LifeExpectancy ChildMortality CellularSubscribers LiteracyRate GNI
## 5 51 163.5 48.38 70.1 5230
## 10 81 4.0 154.78 NA 42050
## 20 67 44.6 65.58 NA 5570
## 15 78 18.4 127.01 NA NA
## 120 50 79.7 63.70 87.4 5930
## PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## 5 93.1 78.2
## 10 NA NA
## 20 88.3 91.5
## 15 NA NA
## 120 NA NA
hist(who$LiteracyRate, breaks = 10, xlab = "Literacy Rate", main = "World Literacy Rates")
The range of literacy rates make sense. There are countries around the world that are very literate such as first world countries and those are not quite literate. Is this graph is reasonable to assume. No, the distribution is not a bell curve. If people can read, they are considered 100% literate, the majority should not be in the center. There does not seem to have an anomalies in the graph.
hist(who$GNI, breaks = 20, xlab = "Gross National Income", main = "Worldly Gross National Income")
The range of the GNI makes sense and I know that the american national income is in the 40000 - 50000 range. The range number seem correct but the distribution is what shocked me. The distribution is not a bell curve but it is skewed to the left. There is a very frequency of countries/regions that have a nothing to very low income. Perhaps there are countries with no value that are considered to be zero. Perhaps the countries with NA should be removed from the data.
hist(who$PrimarySchoolEnrollmentMale, breaks = 30, xlab = "Enrollment Rate", main = "Primary School Enrollment for Males")
The range of the variable makes sense. There are countries where school enrollment for primary school students are mandatory. But the frequency is what shocked me. The distibution is skewed to the right. Many first world countries have mandatory schooling for younger children.
hist(who$PrimarySchoolEnrollmentFemale, breaks = 20, xlab = "Enrollent Rate", main = "Primary School Enrollment for Females")
The range for this graph is a little confusing. I wonder what the left range values of this graph is. Is it zero or Na? The distribution is not a bell curve and is skewed to the right. It is not as skewed as the primary school enrollment for males perhaps due to countries where females are not encouraged to obtain educational opportunties. The anomalies are in the range and left data. I wonder if it is the NA values.
hist(log10(who$GNI), breaks = 10, xlab = "Gross National Income", main = "Worldly Gross National Income Based on Logarithm")
Yes, the logarithm of the GNI values is graphed as a bell curve. It seems more similar to normal distribution. But the middle of the graph are a little flat. The ranges on the left are out of bounds.
Predictor Variable: GNI To be Predicted: Literacy Rates
plot(who$GNI, who$LiteracyRate, xlab= "Gross National Income", ylab = "Literacy Rates", main = "GNI vs Literacy Rates")
myline <- lm(who$LiteracyRate ~ who$GNI)
points(who$GNI, myline$coefficients[1] + myline$coefficients[2] * who$GNI, type = "l", col = "red")
Yes, the linear regression line captures the relation between Gross National Income and Literacy Rates. Although the line does not capture all of the plots as a correlation, it does show that perhaps as gross national income increases, literacy rates does as well.
I believe that this data exploration has helped to analyze the problems. Although I believe that the initial analysis has only scratched the surface of resolving the problem. The data is of good quality but I believe that the data needs to be cleansed of fields with no values and blanks. Perhaps since all of the graphs are skewed, log transformation is needed for some of the variables such as primary enrollment for males and females. The linear model is suitable for the modeling but I think some of the plot needs to be closer examined. Perhaps the data needs more variables focusing on education and not just about primary school enrollment.
~Gross National Income does seem to affect Literacy Rates ~Primary school enrollment is much higher for males than females ~There are much more countries and regions whose gross national income is so much lower than expected