1 Introduction

Hi, my name is Nguyen Huy Tu Quan. In this portfolio, I will create a scenario which is similar to those FUV may often face. This scenario allows me to illustrate my abilities to ask the right question, collect and process data, and then conduct data analysis to help inform the decisions made by the University.

I created this portfolio by using Rmarkdown and the source code can be found in here.

1.1 The scenario

Motivated by disappointment about the poor quality of the current education system in Colombia, especially in the technology and engineering sector, professors and scholars around the country convened in Bogotá capital city to discuss solutions. Together, they decided to create a new and non-profit University aiming at teaching technology and engineering excellently. They named it after the city where the unthinkable aspiration started - the Bogotá University of Technology and Science (BUST).

With the participation of great talents from all over the country, BUST is undoubtedly a University of great potential. But as every great project starts with a small but firm step, the University first need to determine enrollment and scholarship policies for the University. After discussion, the Board of Trustees resolved that the University would focus on high school students with the most potential to perform academically and practically well before entering the labour market. They then assign the University’s provost to propose a detailed enrollment plan to achieve this goal.

1.2 The question

Supporting the provost was a team of data analysts. These people well understood the importance of asking the right question from the start. They first stated a broad question as follows:

What are the characteristics of high school students accompanying academic and practice competency?

Because the first enrollment of BUST had yet to start, and the University did not have any data about its students (they had yet to attend BUST), they decided to look outside for data. After a brief research, they found a data set containing the SABER Pro score - a popular exam conducted during the senior year at University in Colombia. They also learned that SABER Pro was a comprehensive test evaluating various students’ skills and knowledge, including reading, writing, quantitative, English skills, and citizen competency. Therefore, the SABER Pro score could be used as an indicator of the “academic and practice competence” of students.

In this context, the question was refined as follows:

What are the characteristics of high school students accompanying a high SABER Pro score?

1.3 The data set

The Data set used, namely “Data set of academic performance evolution for engineering students”, was obtained by orderly crossing the databases of the Colombian Institute for the Evaluation of Education and the data collection was conducted by Delahoz-Dominguez et al. (2020).

The data set contains academic performance information of 12,411 engineering students studying at different Colombian universities. Specifically, it provides students’ scores in the high school graduation exam, SABER 11. In addition, it also provides the scores in a professional exam conducted in the last year of university, SABER Pro.

The data set also includes information about the social and economic background of the students, such as parents’ education and occupation, household income, number of people living in the household, the availability of internet and computer, ownership nature of students’ high schools (private or public owned), and socioeconomic level of students place of residence.

The data set includes 44 different variables. However, for simplification, I am going to use only 10 variables as described following:

No. Variable name Description
1 sc_pro Student’s global SABER Pro score. In the original data, this variable was named as global_sc.
2 sc_s11 Student’s average high school graduation (SABER 11) score. The number 11 means the exam is conducted at 11th grade.
3 female Is the student a female person? (1 for female, 0 for male).
4 edu_father Education level of student’s father.
5 edu_mother Education level of student’s mother.
6 income Family’s income.
7 internet Internet availability at home (1 for available, 0 for unavailable).
8 sel Socioeconomic level of student’s place of residence (1 is lowest and 4 is highest).
9 sel_ihe Social economic level of university’s campus.
10 private_high_school Did the student come from private high school (also known as “upper high school”) (1 for Yes, 0 for No).

The data set could be assessed at Mendeley Data, and a detailed description of the data set can be found here.

2 Descriptive Statistics

First, we will implement uni-variate analysis to understand key variables in the data set better.

2.1 The response variable: SABER Pro score

SABER Pro score is the variable that this data analysis aims to explain. The graph below shows the distribution of the SABER Pro score, which ranges from 37 to 247 points. Of the 12411 students surveyed, half had scores ranging from 147 (25th percentile) to 179 (75th percentile). The median score is 163.

2.2 Predictors

2.2.1 Notable figures

From the statistics table on the right tab, we find some notable figures as follows:

  • SABER 11 scores range from 25.8 to 95.6 points. The average number is 62.3 points, with a standard deviation of 9.6 points. The median score is 61.8 points.

  • Among the students in the data set, 59.4% are female, and 40.6% are male. Regarding parents’ education, approximately half of the students have parents who either completed upper secondary school or university.

  • Regarding family income, one-third of the students come from families whose income ranges from 1 to 2 minimum monthly wages (MMWs). At the 2nd place, 23 per cent of students’ families earn between 2 and 3 MMWs. Only 18.5 per cent of students have family income of 5 to 7 MMWs. Each remaining group account for less than 10 per cent of the total sample.

  • There are 53 per cent of students come from private high schools.

  • Internet access is popular among engineering students, with 78.6 per cent having Internet connection at home.

  • Regarding the socioeconomic levels of students’ place of living, most of them live in level 2 (38,2 per cent) and level 4 (32,6 per cent) areas. Meanwhile for the case of universities, more than 60% of them locate in area with socioeconomic level 2.

2.2.2 Full statistics table

Abc

Variable Stats / Values Freqs (% of Valid) Graph
sc_s11 [numeric]
Mean (sd) : 62.3 (9.6)
min ≤ med ≤ max:
35.8 ≤ 61.8 ≤ 95.6
IQR (CV) : 13.8 (0.2)
280 distinct values
female [numeric]
Min : 0
Mean : 0.4
Max : 1
0:7368(59.4%)
1:5043(40.6%)
income [character]
1. 0-1
2. 1-2
3. 2-3
4. 3-5
5. 5-7
6. 7-10
7. Above 10
1037(8.5%)
3873(31.9%)
2783(22.9%)
2239(18.5%)
973(8.0%)
509(4.2%)
718(5.9%)
edu_father [character]
1. 0. Under Primary
2. 1. Primary
3. 2a. Lower Secondary
4. 2b. Upper Secondary
5. 3. Tertiary
858(7.4%)
1915(16.5%)
3268(28.1%)
3293(28.4%)
2279(19.6%)
internet [numeric]
Min : 0
Mean : 0.8
Max : 1
0:2659(21.4%)
1:9752(78.6%)
private_high_school [numeric]
Min : 0
Mean : 0.5
Max : 1
0:5846(47.1%)
1:6565(52.9%)
sel [numeric]
Mean (sd) : 2.6 (1.1)
min ≤ med ≤ max:
1 ≤ 2 ≤ 4
IQR (CV) : 2 (0.4)
1:2138(17.2%)
2:4742(38.2%)
3:1491(12.0%)
4:4040(32.6%)
sel_ihe [numeric]
Mean (sd) : 2.4 (0.9)
min ≤ med ≤ max:
1 ≤ 2 ≤ 4
IQR (CV) : 1 (0.4)
1:1137(9.2%)
2:7748(62.4%)
3:834(6.7%)
4:2692(21.7%)

Generated by summarytools 1.0.1 (R version 4.2.3)
2023-04-22

3 Two-variables analysis

3.1 SABER 11 as a predictor

This section will examine the relations between the SABER Pro score and other predictor variables. One of the strongest indicators of present academic performance (SABER Pro score) is its predecessor (SABER 11 score). The scatter-plot on the left shows a strong correlation between the two, that students with higher SABER 11 scores tend to have higher scores on the SABER Pro examination several years later. Furthermore, the contour chart on the right presents the density of data points according to colours (blue represents low density, yellow represents high density) and thereby helps us see the trend more clearly. From this chart, we can see that the contour lines align well with the direction of the trend line in orange.

Note: In the following scatter plots, R means correlation coefficient. The closer R is to 1 or -1, the stronger the correlation. Meanwhile, p-value shows the confidence level of the estimation of R. p-value close to 0 (zero) indicates a high level of confidence.

But we may wonder whether this positive association still hold consistently in different subsets of the data set. Therefore, it is necessary to create the above scatter plots for different sample subsets to see if the trend is still there.

3.1.1 Income subsets

Across all income groups, students with high SABER 11 scores also tend to have high SABER Pro scores. The correlation coefficient R ranges from 0.65 to 0.76, suggesting a strong correlation. Furthermore, the computed p value is minimal, at less than \(2.2*10^{-6}\), showing a high level of statistical significance.

3.1.2 Gender and Highschool subsets

Similarly, there is a strong correlation between SABER 11 and SABER Pro scores appearing consistently across different combinations of gender and high school ownership type.

3.1.3 Socioeconomic levels subsets

Also, we can find the same trend in different socioeconomic level groups.

3.1.4 Parents highest education subsets

Similarly, SABER 11 score remains an important indicator of SABER Pro score regardless of the parent’s highest educational level.

3.2 Other predictors

So we already know that the SABER 11 score is an important indicator of a student’s academic competency and thus can be used to predict the SABER Pro score.

However, there is likely more than one predictor of students’ SABER Pro scores. In other words, students with the same SABER 11 scores but have different demographic features may perform differently on the SABER Pro examination. Therefore, in the following sections, we will inspect if there is any association between these demographic factors and the SABER Pro score. The method we will use is to divide the sample into different subsets according to each factor, such as gender, family income, etc. then compare the score distribution of each subset to see if there is any systematic difference across subsets.

3.2.1 Gender

By zooming in on the chart below, we can see that the box plot of male students’ scores is higher than that of female students, indicating that male students perform slightly better on the SABER Pro examination. Specifically, the median score of male students is 164 points, whereas the same number of female students is 2 points lower. For female students, the 25th percentile score is 146, and the 75th percentile score is 177. Meanwhile, the same figures for male students are 147 and 180 respectively, which are higher than those of their female peers.

3.2.2 Family income

We can see from the chart below that the scores distributions of wealthier students skew systematically to the right compared to those from lower-income families. This pattern suggests that students from richer families associate with higher SABER Pro scores.

3.2.3 Students’ SEL

Similarly, students from areas with better socioeconomic conditions, as demonstrated by higher socioeconomic levels, tend to score higher on the SABER Pro examination.

3.2.4 University’s SEL

The figure below shows that students from high socioeconomic level universities tend to have score distributions skew to the right. This phenomenon can be explained as follows: socioeconomic levels represent the levels of economic development, availability of infrastructure, and population density of the campus’s area. Universities located at high socioeconomic levels are likely to have better access to talented lecturers and skilled staff and tend to have larger budgets. To some extent, the socioeconomic level of a university is an indicator of its education quality and, therefore, can impact students’ performance positively.

3.2.5 Parent education

Students whose parents had a tertiary or upper high-school education seem to perform slightly better than students in the other group. However, if there was an effect of parents’ education on students’ performance, it can not be seen evidently from this chart.

3.2.6 High school

Scores of students from private high schools have an interquartile range (IQR) spreading between 153 and 185 points. Meanwhile, the same IQR for public high schools is from 142 to 172 points, which is lower than the previous one. As such, students from private schools tend to score higher.

3.2.7 Internet

The chart below shows the SABER Pro score distribution of 2 groups of students, depending on whether they have internet access at home. We can see that the density curve of the group having internet access at home is skewed to the right compared to the same curve of the other group. In short, internet availability is associated with higher scores in the SABER Pro examination.

4 Mutiple-Variables Analysis

4.1 Bivariate analysis’s risk of error

So far, we have examined the relations between the SABER Pro score and different factors and discovered some association between them, as outlined above. However, these patterns were recognized based on simple bivariate analysis, which does possess certain error risks.

For example, we have found that students from wealthier families also tend to have higher SABER Pro scores. However, if we compare SABER Pro scores within groups of students with similar academic achievements in the past (SABER 11 scores), the trend mentioned before between family wealth and achievement disappears.

The figure below illustrates this phenomenon. When we use the whole data set for analysis, we see that the SABER Pro score distributions of the wealthier groups shift rightward. But when comparing only within a particular group of students with the same academic performance in the past, say, students whose SABER 11 scores ranged from the 45th percentile to the 55th percentile, the rightward shifting trend no longer be apparent. Richers’ score distributions even tend to skew to the left, implying that when a poor student gets the same SABER 11 score as a rich student, they can perform as well as the rich one in the SABER Pro examination, or even better. As such, family income may be irrelevant when predicting the SABER Pro score when we take SABER 11 scores into account.

4.2 The regression model

To minimize the risks of error, it is necessary to assess the impacts of multiple variables simultaneously. This method of analysis is called Multivariate Regression Analysis. In the below section, I will perform this kind of analysis, using the following variables to explain students’ SABER Pro score (sc_pro):

  • SABER 11 score (sc_s11),

  • Gender (female),

  • Family income (income),

  • Highest parental education (highest_edu_parent),

  • Availability availability of the internet (internet),

  • Ownership type of students’ high school (private_high_school),

  • The socioeconomic level of the place where the student lives (sel), and

  • The socioeconomic level of the place where the university is located (sel_ihe).

4.3 Interpreting regression results

Below is a chart showing the estimated impact of different factors on the SABER Pro score. Specifically, the sa_s11’s coefficient of 1.8 means that, on average, students with one point higher than the others in the SABER 11 exam tend to score 1.8 points higher in the SABER Pro examination. Also, the p-value corresponding to sa_s11 is very small, showing that we have a high confidence level in this result.

Note: Confidence level = \((1 - p-value)*100/%\)

We can interpret other coefficients and p-value similarly. On average, all other things are equal:

  • Female students have a higher SABER Pro score than male students by 0.53 points. We can be confident about this statement at a 95 per cent level.

  • Students with a family income of more than five minimum monthly wages (MMWs) have lower SABER Pro test scores than students with a family income of less than one MMW, from 1.64 to 3.27 points. The confidence level is at 95.1% or above.

  • Students from areas with socioeconomic level 3 and level 4 have higher scores than those with socioeconomic level 1 by 1.06-1.4 points. The confidence level is 93.1% and 98.9%, respectively. - Students who can access the internet at home tend to score 1.19 points higher than their disadvantaged counterparts.

  • Students of the universities in areas with socioeconomic levels 2, 3, and 4 score higher than those with socioeconomic level 1, respectively, by 3.01, 6.85, and 7.07 points.

The model also estimates the impact of the other factors, including ownership type of high school, parents’ highest education level, etc. However, these estimations only accompany confidence levels of less than 90% - a threshold below which we should not accept the estimations as statistically significant. Thus, we should not draw conclusions regarding their predictability for the SABER Pro score.

4.3.1 Regression results table

F(20,11229) 919.46
0.62
Adj. R² 0.62
Est. S.E. t val. p
(Intercept) 44.44 1.35 32.95 0.00
sc_s11 1.80 0.02 112.85 0.00
female 0.53 0.28 1.94 0.05
income1-2 -0.29 0.55 -0.52 0.60
income2-3 -0.08 0.61 -0.13 0.89
income3-5 -0.80 0.66 -1.22 0.22
income5-7 -1.64 0.78 -2.11 0.03
income7-10 -3.27 0.92 -3.54 0.00
incomeAbove 10 -1.74 0.88 -1.97 0.05
highest_edu_parent1. Primary 1.39 0.94 1.48 0.14
highest_edu_parent2a. Lower Secondary 1.21 0.90 1.35 0.18
highest_edu_parent2b. Upper Secondary 0.87 0.92 0.94 0.35
highest_edu_parent3. Tertiary 1.26 0.92 1.37 0.17
internet 1.19 0.38 3.13 0.00
private_high_school -0.36 0.33 -1.08 0.28
factor(sel)2 0.07 0.43 0.17 0.87
factor(sel)3 1.06 0.58 1.82 0.07
factor(sel)4 1.40 0.55 2.54 0.01
factor(sel_ihe)2 3.01 0.49 6.18 0.00
factor(sel_ihe)3 6.85 0.70 9.73 0.00
factor(sel_ihe)4 7.07 0.61 11.67 0.00
Standard errors: OLS

4.3.2 Regression results chart

Note: The chart below show show estimated coefficients by the regression model (white and small circle at the middle of each line). While the blue line show us the 90% confidence interval of each predictor - which mean if we collect the same data set 100 more times, there will be 90 times the estimated coefficient fall between this line.

5 Main findings and Implications

We have outlined some characteristics of students likely to perform better on the SABER Pro examination, including having a high SABER 11 score, being female (which is a slight advantage), coming from an area with a high socioeconomic level, and having internet access at home. With all other conditions being equal, students originating from wealthy families tend to score lower in the SABER Pro examination than their poorer peers.

However, we should interpret these findings with caution. While the mentioned trends hold in general, applying the model strictly to every individual student is faulty. Indeed, although the regression model has a relatively high R-square of 0.6, meaning that the model can explain 60% of the difference in SABER Pro scores, this number also warns us that the remaining 40% depends on other factors which are not represented in the data, such as motivation of each individual student. Therefore, if we rely too much on these trends for enrollment decisions, we risk losing talented students that do not possess the above-mentioned characteristics.

In light of this understanding, I am going present some implications for the University as follows:

  • The characteristics discovered above are helpful for pre-screening applicants. At the same time, admissions committees still need to use different methods to comprehensively assess students’ abilities and potential, such as requiring personal essays and personal interviews.

  • The University should also consider measures to support students in learning. For example, we found that internet access is associated with better performance in the SABER Pro examination. Therefore, the University should consider digitalizing more of its materials to enlarge the positive impact of the internet. It should also consider providing computers with internet access as a part of the scholarship for students or equipping more computers at the campus.

  • Students originating from low-income families (under one MMW) achieved higher scores than those from families with income above 5 MMWs. Therefore, the school should prioritise the former group of students over the latter group when allocating scholarships and financial aid.

Thank you for reading.