This is my first R Markdown document. Bear with me; and if there are any mistakes don’t hesitate to contact me. My contact details are at the end of this document.
Let’s talk for a moment about the Republic of Singapore / Lonely Planet.
A small Southeastern nation in Asia, but very, very powerful - it has economically stable, non-corrupt and business-friendly government. According to Wikipedia, the country’s area is just about 687 km^2 which makes SG 190th smallest country in the world. (It’s a bit bigger than Isle of Man, but smaller than Bahrain.)
Singapore is likely known for food and strict law. Both are great, sadly “unreachable” here in Europe.
The country’s education makes the nation highly competitive, worldwide. Therefore, I decided to investigate that a bit, by analyzing some of its data. Most of them were taken from data SG and I would like to thank the government for making them public (as often opposed by countries in Africa [sic!]).
My goal is to analyze expenditures on students (non-university) and country’s student-teacher ratio.
In regard to student-teacher ratio (STR), there have been many articles written about the topic [1,2]. And results confirm the hypotheses that the smaller ratio, the better for both students & teachers.
NOTE: Ministry of Education (MOE) maintains a bit different view. Although I (mostly) accept their view, I believe in the above statement too.
While a smaller class size may be intuitively appealing, empirical evidence on the benefits of a smaller class size remains inconclusive. Studies have shown that teacher quality is the most important factor in achieving better student outcomes. Hence, MOE’s focus is on raising the quality of teachers, even as we increase our recruitment of > teachers.
Now, let’s work with some real data.
Because I am not going to explain each data set (e.g. in regard to data source, its quality etc.) the reader is expected to see following files on my GitHub:
source("ExpenditurePerStudent_Pupils.R") # load the data
summary(cleanedData.pupils)
## Years ExpednInSD
## Min. :1986 Min. :1404
## 1st Qu.:1992 1st Qu.:2158
## Median :1999 Median :2960
## Mean :1999 Mean :3449
## 3rd Qu.:2006 3rd Qu.:4032
## Max. :2012 Max. :7396
Both are numerical variables.
cor(cleanedData.pupils$Years, cleanedData.pupils$ExpednInSD)
## [1] 0.93823
With this correlation, we prove that there is a strong, positive relationship. We are almost 100% sure that if year goes one up, the money spend on students will increase. How significant is this value? Very much.
cor.test(cleanedData.pupils$Years, cleanedData.pupils$ExpednInSD)
##
## Pearson's product-moment correlation
##
## data: cleanedData.pupils$Years and cleanedData.pupils$ExpednInSD
## t = 13.5578, df = 25, p-value = 5.027e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8675231 0.9717690
## sample estimates:
## cor
## 0.93823
(Fisher r-to-z transformation)
We are also 95% confident that correlation between two variables will be between 0.87 and 0.97 - if that would be a sample - here it is not.
eps <- ggplot(data=cleanedData.pupils, aes(x=Years, y=ExpednInSD)) +
geom_point() +
coord_cartesian(xlim=c(1985, 2013), ylim=c(1104, 8000)) + # zoom
labs(x="Years", y= "Expenditure in Singapore's Dollar", title="") + # labels
scale_x_continuous(breaks = seq(min(cleanedData.pupils$Years),
max(cleanedData.pupils$Years), by = 1)) +
scale_y_continuous(breaks = seq(min(cleanedData.pupils$ExpednInSD),
max(cleanedData.pupils$ExpednInSD)+1000, by = 500)) +
geom_abline(intercept=-389762.30, slope=196.70, color="red")+
geom_smooth(se = FALSE) #[3]
eps
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
summary(lm(cleanedData.pupils$ExpednInSD ~ cleanedData.pupils$Years))
##
## Call:
## lm(formula = cleanedData.pupils$ExpednInSD ~ cleanedData.pupils$Years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -857.3 -491.6 -9.0 360.1 1390.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -389762.30 29002.89 -13.44 6.11e-13 ***
## cleanedData.pupils$Years 196.70 14.51 13.56 5.03e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 587.2 on 25 degrees of freedom
## Multiple R-squared: 0.8803, Adjusted R-squared: 0.8755
## F-statistic: 183.8 on 1 and 25 DF, p-value: 5.029e-13
Ok, so. We already know that there is almost a linear relationship. With R^2 of 88%, we see that it will only supports that claim.
Let’s stop there for a moment and switch to STR.
source("PupilsPerTeacherRatio.R") # load our data
summary(cleanedData.ratio)
## Years PupilsPerTeacher
## Min. :1986 Min. :17.70
## 1st Qu.:1992 1st Qu.:23.05
## Median :1999 Median :25.00
## Mean :1999 Mean :24.03
## 3rd Qu.:2006 3rd Qu.:25.80
## Max. :2012 Max. :26.60
Both are again numerical variables. What is far more interesting is however the correlation.
cor(cleanedData.ratio$Years, cleanedData.ratio$PupilsPerTeacher)
## [1] -0.8142409
-0.81 means that there is a quite strong negative relationship between years and the ratio. This also means that we would expect our regression line go down; not up as in the first part. Let’s take a closer look.
ppt <- ggplot(data=cleanedData.ratio, aes(x=Years, y=PupilsPerTeacher)) +
geom_point() +
coord_cartesian(xlim=c(1985, 2013), ylim=c(15, 30)) + # zoom
labs(x="Years", y= "# of Student per 1 teacher (our ratio)", title="") + # labels
scale_x_continuous(breaks = seq(min(cleanedData.ratio$Years),
max(cleanedData.ratio$Years), by = 1)) +
scale_y_continuous(breaks = seq(min(cleanedData.ratio$PupilsPerTeacher),
max(cleanedData.ratio$PupilsPerTeacher)+5, by = 2)) +
geom_abline(intercept=557.70279, slope=-0.26697, color="red")+
geom_smooth(se = FALSE) #[3]
ppt
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
So far so good. BTW we are lucky as the data can be combined in one graph. E.g. we put years on the X-Axis and both measures on the left and right Y-Axis. You know - kind of 3 axes. Harley doesn’t like it, and therefore he didn’t included that option in ggplot2. That’s sad.
One possible solution would be to scale the Y-Axis in a way that we could plot both numbers on one Y-axis. Something like log. Here however, the difference is so big that it won’t help either. Just a bit.
source("BigTable.R")
I have also decided to include one other statistic. Therefore, let’s look on our last measurement: That’s class size and avarage number of students.
source("ClassSize.R")
noc <- ggplot(data=cleanedData.numberOfClases)
noc <- noc + geom_point(aes(x=Year, y=AllNumbersForPrimarySchools))+
geom_point(aes(x=Year, y=AllNumbersForSecondarySchools))+
coord_cartesian(xlim=c(min(cleanedData.numberOfClases$Year)-1, max(cleanedData.numberOfClases$Year)+1),
ylim=c(4500, 9000)) + # zoom
labs(title="", x="Years", y="Number of classes for primary and secondary schools")+
scale_x_continuous(breaks = seq(min(cleanedData.numberOfClases$Year)-1,
max(cleanedData.numberOfClases$Year)+1, by = 2))+
annotate("text", label = "Primary are those up; secondary are those down", x = 1997, y = 6300, size = 5, colour = "red")
noc
What can we clearly see is that there we far more classes in primary schools as opposed to those in secondary schools. Why? We cannot infer that from that graphic; however, we may see it a bit later.
acs <- ggplot(data=cleanedData.avarageClassSize)
acs <- acs + geom_point(aes(x=Year, y=AllNumbersForPrimarySchools, colour="Primary schools"))+
geom_point(aes(x=Year, y=AllNumbersForSecondarySchools, colour="Secondary schools"))+
coord_cartesian(xlim=c(min(cleanedData.avarageClassSize$Year)-1,
max(cleanedData.avarageClassSize$Year)+1), ylim=c(31, 39))+
labs(title="", x="Years", y="Avarage class size for primary and secondary schools")+
scale_x_continuous(breaks = seq(min(cleanedData.avarageClassSize$Year)-1,
max(cleanedData.avarageClassSize$Year)+1, by = 2))+
theme(legend.title=element_blank())+ # [4]
guides(colour = guide_legend(override.aes = list(size=4)))
acs
mean(cleanedData.avarageClassSize$AllNumbersForPrimarySchools) # Primary
## [1] 35.97097
mean(cleanedData.avarageClassSize$AllNumbersForSecondarySchools) # Secondary
## [1] 34.66129
What can be inferred from this? Here it is actually very interesting because we see that SG has decreased its class size for primary schools but it is still higher than in secondary schools. However, since 2004 there was a very sharp decrease.
What is also very interesting is the above statement from MOE.
Most primary and secondary schools have classes of 40 students or fewer, while Primary 1 and 2 classes have 30 students or fewer. We plan on the basis of 30 students per class at primary 1 and 2 and 40 students per class at the other primary and secondary levels.
And their target ? For Europe that would be almost as from different world.
Thus, while the Ministry does not mandate targets for class size, we plan to improve the pupil teacher ratio from 18 and 15 at the primary and secondary levels to 16 and 13 by 2015 when the Education Service grows to 33,000 Education Officers.
What is the summary from last 2 plot? Just in regard to primary schools, there may be relationship between number of classes and average class size. Namely the idea that because we have less students per class, we need more classes/hours of teaching to educate our future generation. Furthermore, there is a notion that pupils do many activities outside of the school and therefore they learn at school from professionals. This changes rapidly once they are in the secondary schools where each student sits until 01:00 AM on homework.
That’s all for this time. I hope that I didn’t make any mistake and if yes, please let me know. This is my first time posting to R-Pubs community. Hope you enjoy reading it.
[1] Karen Akerhielm, Does class size matter?, Economics of Education Review, Volume 14, Issue 3,September 1995, Pages 229-241, ISSN 0272-7757, http://dx.doi.org/10.1016/0272-7757(95)00004-4
[2] McDonald, Gael. “Does Size Matter? The Impact Of Student-Staff Ratios.” Journal Of Higher Education Policy And Management 35.6 (2013): 652-667. ERIC. Web. 31 Oct. 2014.
[3] https://hopstat.wordpress.com/2014/10/30/my-commonly-done-ggplot2-graphs/
[4] http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/