Student Data Modeling for Top 5 and Bottom 5 Countries

Nirmal Ghimire, Ph.D. https://www.linkedin.com/in/nirmal-ghimire-5b96a034/ (Watson College of Education, University of North Carolina Wilmington)https://www.uncw.edu/ed/
2024-09-13

A. Data Structure and Diemnsions

[1] 67946  1122
[1] 67946    14
Classes 'data.table' and 'data.frame':  67946 obs. of  14 variables:
 $ cntryid       : Factor w/ 80 levels "8","31","32",..: 20 20 20 20 20 20 20 20 20 20 ...
 $ cnt           : Factor w/ 80 levels "ALB","ARE","ARG",..: 20 20 20 20 20 20 20 20 20 20 ...
 $ cntschid      : num  21400001 21400001 21400001 21400001 21400001 ...
 $ cntstuid      : num  21400089 21400095 21400268 21400581 21400680 ...
 $ oecd          : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ icthome       : num  6 4 2 0 4 2 1 1 2 11 ...
 $ ictsch        : num  6 2 7 1 3 2 1 2 4 10 ...
 $ grade         : num  -1 -1 -2 -1 0 0 0 1 -1 -1 ...
 $ paredint      : num  16 14.5 12 9 12 6 9 16 12 16 ...
 $ wealth        : num  -1.222 -2.089 -4.318 NA -0.869 ...
 $ reading_score : num  272 345 255 249 343 ...
 $ math_score    : num  298 317 246 294 320 ...
 $ science_score : num  276 330 260 291 304 ...
 $ student_gender: Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 1 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr> 

B. Variable Descriptions

C. Missing Value Analysis

The missing value analysis shows that the ictsch (ICT availability at school) and icthome (ICT availability at home) variables have substantial missing data, with 22.83% and 21.77% of values missing, respectively, which could significantly impact the reliability of analyses involving these variables. The paredint (highest parental education) variable has 4.17% missing data, while the wealth (family wealth index) variable has 2.44% missing, both of which are less concerning but still noteworthy.

D. Descriptive Analyiss of Key Variables

i. Summary Statistics for Students’ Reading Scores, Math Scores, Science Scores, Parental Education, and Wealth

# A tibble: 5 × 8
  Variable    mean  median     sd    min    max skewness kurtosis
  <chr>      <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1 reading  456.    454.    113.   147.   810.     0.0405   -0.739
2 math     461.    463.    112.   100.   825.     0.0122   -0.726
3 science  458.    456.    108.   114.   843.     0.0660   -0.733
4 paredint  13.0    14.5     3.30   3     16     -1.22      1.13 
5 wealth    -0.637  -0.535   1.33  -7.33   4.63  -0.444     1.81 
# A tibble: 50 × 7
   cnt   Variable    mean  median      sd    min    max
   <fct> <chr>      <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
 1 DOM   reading  345.    338.     78.4   169.   647.  
 2 DOM   math     328.    323.     64.6   100.   594.  
 3 DOM   science  338.    329.     64.9   177.   603.  
 4 DOM   paredint  12.9    12       3.32    3     16   
 5 DOM   wealth    -1.69   -1.72    1.20   -6.81   4.13
 6 GBR   reading  500.    503.     94.6   185.   795.  
 7 GBR   math     495.    496.     81.9   182.   754.  
 8 GBR   science  497.    498.     89.3   205.   780.  
 9 GBR   paredint  14.1    14.5     2.16    3     16   
10 GBR   wealth     0.444   0.397   0.891  -6.79   4.11
11 HKG   reading  527.    537.     93.7   195.   783.  
12 HKG   math     554.    560.     85.0   237.   792.  
13 HKG   science  519.    525.     79.0   247.   759.  
14 HKG   paredint  12.2    12       2.79    3     16   
15 HKG   wealth    -0.467  -0.455   0.831  -5.27   4.04
16 KOR   reading  516.    525.     97.9   147.   785.  
17 KOR   math     528.    532.     93.2   182.   825.  
18 KOR   science  520.    527.     92.0   184.   843.  
19 KOR   paredint  14.8    16       1.91    6     16   
20 KOR   wealth    -0.442  -0.448   0.562  -2.87   4.02
21 MAC   reading  525.    532.     88.4   192.   749.  
22 MAC   math     558.    561.     72.4   274.   757.  
23 MAC   science  544.    548.     77.3   220.   757.  
24 MAC   paredint  12.1    12       3.11    3     16   
25 MAC   wealth    -0.542  -0.585   0.856  -6.88   3.11
26 MAR   reading  358.    353.     71.2   162.   604.  
27 MAR   math     367.    360.     67.9   162.   625.  
28 MAR   science  375.    369.     61.5   200.   605.  
29 MAR   paredint   9.42    9       4.82    3     16   
30 MAR   wealth    -1.90   -1.88    1.38   -7.33   4.63
31 PAN   reading  378.    373.     83.3   156.   656.  
32 PAN   math     355.    351.     70.3   146.   611.  
33 PAN   science  366.    360.     80.4   114.   656.  
34 PAN   paredint  12.0    12       3.86    3     16   
35 PAN   wealth    -1.55   -1.50    1.58   -6.93   4.34
36 QAZ   reading  389.    389.     70.4   178.   671.  
37 QAZ   math     420.    418.     80.5   149.   704.  
38 QAZ   science  398.    395.     67.4   200.   655.  
39 QAZ   paredint  13.4    14.5     2.17    3     16   
40 QAZ   wealth    -1.16   -1.19    1.02   -6.84   4.14
41 TAP   reading  498.    505.     99.0   150.   772.  
42 TAP   math     527.    533.     93.9   201.   780.  
43 TAP   science  511.    516.     94.9   124.   777.  
44 TAP   paredint  13.5    14.5     2.27    3     16   
45 TAP   wealth    -0.506  -0.540   0.893  -6.99   4.42
46 USA   reading  501.    505.    105.    157.   810.  
47 USA   math     473.    474.     86.5   197.   750.  
48 USA   science  498.    499.     94.6   195.   795.  
49 USA   paredint  14.0    16       2.58    3     16   
50 USA   wealth     0.411   0.351   1.04   -6.85   4.26

ii. Distribution of Students by Country, Country Code, OECD Membership

# A tibble: 10 × 3
   cntryid cnt   Observations
   <fct>   <fct>        <int>
 1 31      QAZ           6827
 2 158     TAP           7243
 3 214     DOM           5674
 4 344     HKG           6037
 5 410     KOR           6650
 6 446     MAC           3775
 7 504     MAR           6814
 8 591     PAN           6270
 9 826     GBR          13818
10 840     USA           4838
[1] "Number of unique countries: 10"
[1] "The OECD variable shows that there are 42640 observations from non-OECD countries and 25306 from OECD countries."
[1] "The dataset includes 33591 female students and 34355 male students."

iii. Cross-tabulation of Variables

     
      Female Male
  DOM   2890 2784
  GBR   6996 6822
  HKG   2955 3082
  KOR   3191 3459
  MAC   1862 1913
  MAR   3262 3552
  PAN   3173 3097
  QAZ   3262 3565
  TAP   3624 3619
  USA   2376 2462

iv. Average Scores by Subjects for Top 6 and Bottom 4 Countries

v. Calculate Mean Scores by Group and Gender

vi. Calcualte Mean Scores by Group, Gender, and Wealth (Quartiles)

vii. Scores by Group, Gender, and Parental Education