This is my first R Markdown document. Bear with me; and if there are any mistakes don’t hesitate to contact me. My contact details are at the end of this document.

1 Singapore (SG)

Let’s talk for a moment about the Republic of Singapore / Lonely Planet.

A small Southeastern nation in Asia, but very, very powerful - it has economically stable, non-corrupt and business-friendly government. According to Wikipedia, the country’s area is just about 687 km^2 which makes SG 190th smallest country in the world. (It’s a bit bigger than Isle of Man, but smaller than Bahrain.)

Singapore is likely known for food and strict law. Both are great, sadly “unreachable” here in Europe.

1.1 SG’s education

The country’s education makes the nation highly competitive, worldwide. Therefore, I decided to investigate that a bit, by analyzing some of its data. Most of them were taken from data SG and I would like to thank the government for making them public (as often opposed by countries in Africa [sic!]).

My goal is to analyze expenditures on students (non-university) and country’s student-teacher ratio.

In regard to student-teacher ratio (STR), there have been many articles written about the topic [1,2]. And results confirm the hypotheses that the smaller ratio, the better for both students & teachers.

NOTE: Ministry of Education (MOE) maintains a bit different view. Although I (mostly) accept their view, I believe in the above statement too.

While a smaller class size may be intuitively appealing, empirical evidence on the benefits of a smaller class size remains inconclusive. Studies have shown that teacher quality is the most important factor in achieving better student outcomes. Hence, MOE’s focus is on raising the quality of teachers, even as we increase our recruitment of > teachers.

source

Now, let’s work with some real data.

Because I am not going to explain each data set (e.g. in regard to data source, its quality etc.) the reader is expected to see following files on my GitHub:

Gist

1.1.1 Expenditure per Student 1986 - 2012

source("ExpenditurePerStudent_Pupils.R") # load the data
summary(cleanedData.pupils)

##      Years        ExpednInSD  
##  Min.   :1986   Min.   :1404  
##  1st Qu.:1992   1st Qu.:2158  
##  Median :1999   Median :2960  
##  Mean   :1999   Mean   :3449  
##  3rd Qu.:2006   3rd Qu.:4032  
##  Max.   :2012   Max.   :7396

Both are numerical variables.

cor(cleanedData.pupils$Years, cleanedData.pupils$ExpednInSD)

## [1] 0.93823

With this correlation, we prove that there is a strong, positive relationship. We are almost 100% sure that if year goes one up, the money spend on students will increase. How significant is this value? Very much.

cor.test(cleanedData.pupils$Years, cleanedData.pupils$ExpednInSD)

## 
##  Pearson's product-moment correlation
## 
## data:  cleanedData.pupils$Years and cleanedData.pupils$ExpednInSD
## t = 13.5578, df = 25, p-value = 5.027e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8675231 0.9717690
## sample estimates:
##     cor 
## 0.93823

(Fisher r-to-z transformation)

We are also 95% confident that correlation between two variables will be between 0.87 and 0.97 - if that would be a sample - here it is not.

eps <- ggplot(data=cleanedData.pupils, aes(x=Years, y=ExpednInSD)) +
  geom_point() +
  coord_cartesian(xlim=c(1985, 2013), ylim=c(1104, 8000)) + # zoom 
  labs(x="Years", y= "Expenditure in Singapore's Dollar", title="") + # labels
  scale_x_continuous(breaks = seq(min(cleanedData.pupils$Years), 
                                  max(cleanedData.pupils$Years), by = 1)) +    
  scale_y_continuous(breaks = seq(min(cleanedData.pupils$ExpednInSD), 
                                  max(cleanedData.pupils$ExpednInSD)+1000, by = 500)) +
  geom_abline(intercept=-389762.30, slope=196.70, color="red")+
  geom_smooth(se = FALSE) #[3]

eps

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

summary(lm(cleanedData.pupils$ExpednInSD ~ cleanedData.pupils$Years))

## 
## Call:
## lm(formula = cleanedData.pupils$ExpednInSD ~ cleanedData.pupils$Years)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -857.3 -491.6   -9.0  360.1 1390.0 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -389762.30   29002.89  -13.44 6.11e-13 ***
## cleanedData.pupils$Years     196.70      14.51   13.56 5.03e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 587.2 on 25 degrees of freedom
## Multiple R-squared:  0.8803, Adjusted R-squared:  0.8755 
## F-statistic: 183.8 on 1 and 25 DF,  p-value: 5.029e-13

Ok, so. We already know that there is almost a linear relationship. With R^2 of 88%, we see that it will only supports that claim.

1.1.2 Student-teacher ratio 1986-2012

Let’s stop there for a moment and switch to STR.

source("PupilsPerTeacherRatio.R") # load our data
summary(cleanedData.ratio)

##      Years      PupilsPerTeacher
##  Min.   :1986   Min.   :17.70   
##  1st Qu.:1992   1st Qu.:23.05   
##  Median :1999   Median :25.00   
##  Mean   :1999   Mean   :24.03   
##  3rd Qu.:2006   3rd Qu.:25.80   
##  Max.   :2012   Max.   :26.60

Both are again numerical variables. What is far more interesting is however the correlation.

cor(cleanedData.ratio$Years, cleanedData.ratio$PupilsPerTeacher)

## [1] -0.8142409

-0.81 means that there is a quite strong negative relationship between years and the ratio. This also means that we would expect our regression line go down; not up as in the first part. Let’s take a closer look.

ppt <- ggplot(data=cleanedData.ratio, aes(x=Years, y=PupilsPerTeacher)) +
  geom_point() +
  coord_cartesian(xlim=c(1985, 2013), ylim=c(15, 30)) + # zoom 
  labs(x="Years", y= "# of Student per 1 teacher (our ratio)", title="") + # labels
  scale_x_continuous(breaks = seq(min(cleanedData.ratio$Years), 
                                  max(cleanedData.ratio$Years), by = 1)) +    
  scale_y_continuous(breaks = seq(min(cleanedData.ratio$PupilsPerTeacher), 
                                  max(cleanedData.ratio$PupilsPerTeacher)+5, by = 2)) +
  geom_abline(intercept=557.70279, slope=-0.26697, color="red")+
  geom_smooth(se = FALSE) #[3]

ppt

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

1.1.3 Combining 2 measures in one graph

So far so good. BTW we are lucky as the data can be combined in one graph. E.g. we put years on the X-Axis and both measures on the left and right Y-Axis. You know - kind of 3 axes. Harley doesn’t like it, and therefore he didn’t included that option in ggplot2. That’s sad.

One possible solution would be to scale the Y-Axis in a way that we could plot both numbers on one Y-axis. Something like log. Here however, the difference is so big that it won’t help either. Just a bit.

source("BigTable.R")

1.1.4 Number of classes and their avarage size

I have also decided to include one other statistic. Therefore, let’s look on our last measurement: That’s class size and avarage number of students.

source("ClassSize.R")

noc <- ggplot(data=cleanedData.numberOfClases)
noc <- noc + geom_point(aes(x=Year, y=AllNumbersForPrimarySchools))+
  geom_point(aes(x=Year, y=AllNumbersForSecondarySchools))+
  coord_cartesian(xlim=c(min(cleanedData.numberOfClases$Year)-1, max(cleanedData.numberOfClases$Year)+1), 
                  ylim=c(4500, 9000)) + # zoom 
  labs(title="", x="Years", y="Number of classes for primary and secondary schools")+
  scale_x_continuous(breaks = seq(min(cleanedData.numberOfClases$Year)-1, 
                                  max(cleanedData.numberOfClases$Year)+1, by = 2))+
  annotate("text", label = "Primary are those up; secondary are those down", x = 1997, y = 6300, size = 5, colour = "red")

noc

What can we clearly see is that there we far more classes in primary schools as opposed to those in secondary schools. Why? We cannot infer that from that graphic; however, we may see it a bit later.

acs <- ggplot(data=cleanedData.avarageClassSize)
acs <- acs + geom_point(aes(x=Year, y=AllNumbersForPrimarySchools, colour="Primary schools"))+
  geom_point(aes(x=Year, y=AllNumbersForSecondarySchools, colour="Secondary schools"))+
  coord_cartesian(xlim=c(min(cleanedData.avarageClassSize$Year)-1, 
                         max(cleanedData.avarageClassSize$Year)+1), ylim=c(31, 39))+
  labs(title="", x="Years", y="Avarage class size for primary and secondary schools")+
  scale_x_continuous(breaks = seq(min(cleanedData.avarageClassSize$Year)-1,
                                  max(cleanedData.avarageClassSize$Year)+1, by = 2))+
  theme(legend.title=element_blank())+ # [4]
  guides(colour = guide_legend(override.aes = list(size=4)))

acs

mean(cleanedData.avarageClassSize$AllNumbersForPrimarySchools) # Primary

## [1] 35.97097

mean(cleanedData.avarageClassSize$AllNumbersForSecondarySchools) # Secondary

## [1] 34.66129

What can be inferred from this? Here it is actually very interesting because we see that SG has decreased its class size for primary schools but it is still higher than in secondary schools. However, since 2004 there was a very sharp decrease.

What is also very interesting is the above statement from MOE.

Most primary and secondary schools have classes of 40 students or fewer, while Primary 1 and 2 classes have 30 students or fewer. We plan on the basis of 30 students per class at primary 1 and 2 and 40 students per class at the other primary and secondary levels.

And their target ? For Europe that would be almost as from different world.

Thus, while the Ministry does not mandate targets for class size, we plan to improve the pupil teacher ratio from 18 and 15 at the primary and secondary levels to 16 and 13 by 2015 when the Education Service grows to 33,000 Education Officers.

What is the summary from last 2 plot? Just in regard to primary schools, there may be relationship between number of classes and average class size. Namely the idea that because we have less students per class, we need more classes/hours of teaching to educate our future generation. Furthermore, there is a notion that pupils do many activities outside of the school and therefore they learn at school from professionals. This changes rapidly once they are in the secondary schools where each student sits until 01:00 AM on homework.

1.1.5 Thanks and let me know!

That’s all for this time. I hope that I didn’t make any mistake and if yes, please let me know. This is my first time posting to R-Pubs community. Hope you enjoy reading it.

1.1.6 Bibliography

[1] Karen Akerhielm, Does class size matter?, Economics of Education Review, Volume 14, Issue 3,September 1995, Pages 229-241, ISSN 0272-7757, http://dx.doi.org/10.1016/0272-7757(95)00004-4

[2] McDonald, Gael. “Does Size Matter? The Impact Of Student-Staff Ratios.” Journal Of Higher Education Policy And Management 35.6 (2013): 652-667. ERIC. Web. 31 Oct. 2014.

[3] https://hopstat.wordpress.com/2014/10/30/my-commonly-done-ggplot2-graphs/

[4] http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/

1.1.7 About the Author

I am student from Germany who is interested in Asia (and its small nations).

Analysis of Singapore’s Education using three datasets

Dmitrij Petrov

1.11.2014