Introduction
Hi, my name is Nguyen Huy Tu Quan. In this portfolio, I will create a
scenario which is similar to those FUV may often face. This scenario
allows me to illustrate my abilities to ask the right question, collect
and process data, and then conduct data analysis to help inform the
decisions made by the University.
I created this portfolio by using Rmarkdown and the source code can
be found in here.
The scenario
Motivated by disappointment about the poor quality of the current
education system in Colombia, especially in the technology and
engineering sector, professors and scholars around the country convened
in Bogotá capital city to discuss solutions. Together, they decided to
create a new and non-profit University aiming at teaching technology and
engineering excellently. They named it after the city where the
unthinkable aspiration started - the Bogotá University of Technology and
Science (BUST).
With the participation of great talents from all over the country,
BUST is undoubtedly a University of great potential. But as every great
project starts with a small but firm step, the University first need to
determine enrollment and scholarship policies for the University. After
discussion, the Board of Trustees resolved that the University would
focus on high school students with the most potential to perform
academically and practically well before entering the labour market.
They then assign the University’s provost to propose a detailed
enrollment plan to achieve this goal.
The question
Supporting the provost was a team of data analysts. These people well
understood the importance of asking the right question from the start.
They first stated a broad question as follows:
What are the characteristics of high school students accompanying
academic and practice competency?
Because the first enrollment of BUST had yet to start, and the
University did not have any data about its students (they had yet to
attend BUST), they decided to look outside for data. After a brief
research, they found a data set containing the SABER Pro score - a
popular exam conducted during the senior year at University in Colombia.
They also learned that SABER Pro was a comprehensive test evaluating
various students’ skills and knowledge, including reading, writing,
quantitative, English skills, and citizen competency. Therefore, the
SABER Pro score could be used as an indicator of the “academic and
practice competence” of students.
In this context, the question was refined as follows:
What are the characteristics of high school students
accompanying a high SABER Pro score?
The data set
The Data set used, namely “Data set of academic performance
evolution for engineering students”, was obtained by orderly
crossing the databases of the Colombian Institute for the Evaluation of
Education and the data collection was conducted by Delahoz-Dominguez et
al. (2020).
The data set contains academic performance information of 12,411
engineering students studying at different Colombian universities.
Specifically, it provides students’ scores in the high school graduation
exam, SABER 11. In addition, it also provides the scores in a
professional exam conducted in the last year of university, SABER
Pro.
The data set also includes information about the social and economic
background of the students, such as parents’ education and occupation,
household income, number of people living in the household, the
availability of internet and computer, ownership nature of students’
high schools (private or public owned), and socioeconomic level of
students place of residence.
The data set includes 44 different variables. However, for
simplification, I am going to use only 10 variables as described
following:
| 1 |
sc_pro |
Student’s global SABER Pro score. In the original data,
this variable was named as global_sc. |
| 2 |
sc_s11 |
Student’s average high school graduation (SABER 11)
score. The number 11 means the exam is conducted at 11th grade. |
| 3 |
female |
Is the student a female person? (1 for female, 0 for
male). |
| 4 |
edu_father |
Education level of student’s father. |
| 5 |
edu_mother |
Education level of student’s mother. |
| 6 |
income |
Family’s income. |
| 7 |
internet |
Internet availability at home (1 for available, 0 for
unavailable). |
| 8 |
sel |
Socioeconomic level of student’s place of residence (1
is lowest and 4 is highest). |
| 9 |
sel_ihe |
Social economic level of university’s campus. |
| 10 |
private_high_school |
Did the student come from private high school (also
known as “upper high school”) (1 for Yes, 0 for No). |
The data set could be assessed at Mendeley
Data, and a detailed description of the data set can be found here.
Descriptive
Statistics
First, we will implement uni-variate analysis to understand key
variables in the data set better.
The response
variable: SABER Pro score
SABER Pro score is the variable that this data analysis aims to
explain. The graph below shows the distribution of the SABER Pro score,
which ranges from 37 to 247 points. Of the 12411 students surveyed, half
had scores ranging from 147 (25th percentile) to 179 (75th percentile).
The median score is 163.
Predictors
Full statistics
table
Abc
Two-variables
analysis
SABER
11 as a predictor
This section will examine the relations between the SABER Pro score
and other predictor variables. One of the strongest indicators of
present academic performance (SABER Pro score) is its predecessor (SABER
11 score). The scatter-plot on the left shows a strong correlation
between the two, that students with higher SABER 11 scores tend to have
higher scores on the SABER Pro examination several years later.
Furthermore, the contour chart on the right presents the density of data
points according to colours (blue represents low density, yellow
represents high density) and thereby helps us see the trend more
clearly. From this chart, we can see that the contour lines align well
with the direction of the trend line in orange.
Note: In the following scatter plots, R means correlation
coefficient. The closer R is to 1 or -1, the stronger the correlation.
Meanwhile, p-value shows the confidence level of the estimation of R.
p-value close to 0 (zero) indicates a high level of confidence.

But we may wonder whether this positive association still hold
consistently in different subsets of the data set. Therefore, it is
necessary to create the above scatter plots for different sample subsets
to see if the trend is still there.
Income subsets
Across all income groups, students with high SABER 11 scores also
tend to have high SABER Pro scores. The correlation coefficient R ranges
from 0.65 to 0.76, suggesting a strong correlation. Furthermore, the
computed p value is minimal, at less than \(2.2*10^{-6}\), showing a high level of
statistical significance.

Gender and
Highschool subsets
Similarly, there is a strong correlation between SABER 11 and SABER
Pro scores appearing consistently across different combinations of
gender and high school ownership type.

Socioeconomic
levels subsets
Also, we can find the same trend in different socioeconomic level
groups.

Parents highest
education subsets
Similarly, SABER 11 score remains an important indicator of SABER Pro
score regardless of the parent’s highest educational level.

Other
predictors
So we already know that the SABER 11 score is an important indicator
of a student’s academic competency and thus can be used to predict the
SABER Pro score.
However, there is likely more than one predictor of students’ SABER
Pro scores. In other words, students with the same SABER 11 scores but
have different demographic features may perform differently on the SABER
Pro examination. Therefore, in the following sections, we will inspect
if there is any association between these demographic factors and the
SABER Pro score. The method we will use is to divide the sample into
different subsets according to each factor, such as gender, family
income, etc. then compare the score distribution of each subset to see
if there is any systematic difference across subsets.
Gender
By zooming in on the chart below, we can see that the box plot of
male students’ scores is higher than that of female students, indicating
that male students perform slightly better on the SABER Pro examination.
Specifically, the median score of male students is 164 points, whereas
the same number of female students is 2 points lower. For female
students, the 25th percentile score is 146, and the 75th percentile
score is 177. Meanwhile, the same figures for male students are 147 and
180 respectively, which are higher than those of their female peers.
Family income
We can see from the chart below that the scores distributions of
wealthier students skew systematically to the right compared to those
from lower-income families. This pattern suggests that students from
richer families associate with higher SABER Pro scores.

Students’ SEL
Similarly, students from areas with better socioeconomic conditions,
as demonstrated by higher socioeconomic levels, tend to score higher on
the SABER Pro examination.

University’s
SEL
The figure below shows that students from high socioeconomic level
universities tend to have score distributions skew to the right. This
phenomenon can be explained as follows: socioeconomic levels represent
the levels of economic development, availability of infrastructure, and
population density of the campus’s area. Universities located at high
socioeconomic levels are likely to have better access to talented
lecturers and skilled staff and tend to have larger budgets. To some
extent, the socioeconomic level of a university is an indicator of its
education quality and, therefore, can impact students’ performance
positively.

Parent
education
Students whose parents had a tertiary or upper high-school education
seem to perform slightly better than students in the other group.
However, if there was an effect of parents’ education on students’
performance, it can not be seen evidently from this chart.

High school
Scores of students from private high schools have an interquartile
range (IQR) spreading between 153 and 185 points. Meanwhile, the same
IQR for public high schools is from 142 to 172 points, which is lower
than the previous one. As such, students from private schools tend to
score higher.
Internet
The chart below shows the SABER Pro score distribution of 2 groups of
students, depending on whether they have internet access at home. We can
see that the density curve of the group having internet access at home
is skewed to the right compared to the same curve of the other group. In
short, internet availability is associated with higher scores in the
SABER Pro examination.
Mutiple-Variables
Analysis
Bivariate analysis’s
risk of error
So far, we have examined the relations between the SABER Pro score
and different factors and discovered some association between them, as
outlined above. However, these patterns were recognized based on simple
bivariate analysis, which does possess certain error risks.
For example, we have found that students from wealthier families also
tend to have higher SABER Pro scores. However, if we compare SABER Pro
scores within groups of students with similar academic achievements in
the past (SABER 11 scores), the trend mentioned before between family
wealth and achievement disappears.
The figure below illustrates this phenomenon. When we use the whole
data set for analysis, we see that the SABER Pro score distributions of
the wealthier groups shift rightward. But when comparing only within a
particular group of students with the same academic performance in the
past, say, students whose SABER 11 scores ranged from the 45th
percentile to the 55th percentile, the rightward shifting trend no
longer be apparent. Richers’ score distributions even tend to skew to
the left, implying that when a poor student gets the same SABER 11 score
as a rich student, they can perform as well as the rich one in the SABER
Pro examination, or even better. As such, family income may be
irrelevant when predicting the SABER Pro score when we take SABER 11
scores into account.


The regression
model
To minimize the risks of error, it is necessary to assess the impacts
of multiple variables simultaneously. This method of analysis is called
Multivariate Regression Analysis. In the below section, I will perform
this kind of analysis, using the following variables to explain
students’ SABER Pro score (sc_pro):
SABER 11 score (sc_s11),
Gender (female),
Family income (income),
Highest parental education
(highest_edu_parent),
Availability availability of the internet
(internet),
Ownership type of students’ high school
(private_high_school),
The socioeconomic level of the place where the student lives
(sel), and
The socioeconomic level of the place where the university is
located (sel_ihe).
Interpreting regression results
Below is a chart showing the estimated impact of different factors on
the SABER Pro score. Specifically, the sa_s11’s coefficient
of 1.8 means that, on average, students with one point higher than the
others in the SABER 11 exam tend to score 1.8 points higher in the SABER
Pro examination. Also, the p-value corresponding to sa_s11
is very small, showing that we have a high confidence level in this
result.
Note: Confidence level = \((1 -
p-value)*100/%\)
We can interpret other coefficients and p-value similarly. On
average, all other things are equal:
Female students have a higher SABER Pro score than male students
by 0.53 points. We can be confident about this statement at a 95 per
cent level.
Students with a family income of more than five minimum monthly
wages (MMWs) have lower SABER Pro test scores than students with a
family income of less than one MMW, from 1.64 to 3.27 points. The
confidence level is at 95.1% or above.
Students from areas with socioeconomic level 3 and level 4 have
higher scores than those with socioeconomic level 1 by 1.06-1.4 points.
The confidence level is 93.1% and 98.9%, respectively. - Students who
can access the internet at home tend to score 1.19 points higher than
their disadvantaged counterparts.
Students of the universities in areas with socioeconomic levels
2, 3, and 4 score higher than those with socioeconomic level 1,
respectively, by 3.01, 6.85, and 7.07 points.
The model also estimates the impact of the other factors, including
ownership type of high school, parents’ highest education level, etc.
However, these estimations only accompany confidence levels of less than
90% - a threshold below which we should not accept the estimations as
statistically significant. Thus, we should not draw conclusions
regarding their predictability for the SABER Pro score.
Regression results
table
|
F(20,11229)
|
919.46
|
|
R²
|
0.62
|
|
Adj. R²
|
0.62
|
|
|
Est.
|
S.E.
|
t val.
|
p
|
|
(Intercept)
|
44.44
|
1.35
|
32.95
|
0.00
|
|
sc_s11
|
1.80
|
0.02
|
112.85
|
0.00
|
|
female
|
0.53
|
0.28
|
1.94
|
0.05
|
|
income1-2
|
-0.29
|
0.55
|
-0.52
|
0.60
|
|
income2-3
|
-0.08
|
0.61
|
-0.13
|
0.89
|
|
income3-5
|
-0.80
|
0.66
|
-1.22
|
0.22
|
|
income5-7
|
-1.64
|
0.78
|
-2.11
|
0.03
|
|
income7-10
|
-3.27
|
0.92
|
-3.54
|
0.00
|
|
incomeAbove 10
|
-1.74
|
0.88
|
-1.97
|
0.05
|
|
highest_edu_parent1. Primary
|
1.39
|
0.94
|
1.48
|
0.14
|
|
highest_edu_parent2a. Lower Secondary
|
1.21
|
0.90
|
1.35
|
0.18
|
|
highest_edu_parent2b. Upper Secondary
|
0.87
|
0.92
|
0.94
|
0.35
|
|
highest_edu_parent3. Tertiary
|
1.26
|
0.92
|
1.37
|
0.17
|
|
internet
|
1.19
|
0.38
|
3.13
|
0.00
|
|
private_high_school
|
-0.36
|
0.33
|
-1.08
|
0.28
|
|
factor(sel)2
|
0.07
|
0.43
|
0.17
|
0.87
|
|
factor(sel)3
|
1.06
|
0.58
|
1.82
|
0.07
|
|
factor(sel)4
|
1.40
|
0.55
|
2.54
|
0.01
|
|
factor(sel_ihe)2
|
3.01
|
0.49
|
6.18
|
0.00
|
|
factor(sel_ihe)3
|
6.85
|
0.70
|
9.73
|
0.00
|
|
factor(sel_ihe)4
|
7.07
|
0.61
|
11.67
|
0.00
|
|
Standard errors: OLS
|
Regression results
chart
Note: The chart below show show estimated coefficients by the
regression model (white and small circle at the middle of each line).
While the blue line show us the 90% confidence interval of each
predictor - which mean if we collect the same data set 100 more times,
there will be 90 times the estimated coefficient fall between this
line.

Main findings and
Implications
We have outlined some characteristics of students likely to perform
better on the SABER Pro examination, including having a high SABER 11
score, being female (which is a slight advantage), coming from an area
with a high socioeconomic level, and having internet access at home.
With all other conditions being equal, students originating from wealthy
families tend to score lower in the SABER Pro examination than their
poorer peers.
However, we should interpret these findings with caution. While the
mentioned trends hold in general, applying the model strictly to every
individual student is faulty. Indeed, although the regression model has
a relatively high R-square of 0.6, meaning that the model can explain
60% of the difference in SABER Pro scores, this number also warns us
that the remaining 40% depends on other factors which are not
represented in the data, such as motivation of each individual student.
Therefore, if we rely too much on these trends for enrollment decisions,
we risk losing talented students that do not possess the above-mentioned
characteristics.
In light of this understanding, I am going present some implications
for the University as follows:
The characteristics discovered above are helpful for
pre-screening applicants. At the same time, admissions committees still
need to use different methods to comprehensively assess students’
abilities and potential, such as requiring personal essays and personal
interviews.
The University should also consider measures to support students
in learning. For example, we found that internet access is associated
with better performance in the SABER Pro examination. Therefore, the
University should consider digitalizing more of its materials to enlarge
the positive impact of the internet. It should also consider providing
computers with internet access as a part of the scholarship for students
or equipping more computers at the campus.
Students originating from low-income families (under one MMW)
achieved higher scores than those from families with income above 5
MMWs. Therefore, the school should prioritise the former group of
students over the latter group when allocating scholarships and
financial aid.
Thank you for reading.