Introduction

Project author: Lamova Tamara

The main idea of the research project is to understand what is more important for Russian 8th grade student achievement nowadays: books or computers.

Recently a huge role is played by Internet sources, which provide a huge selection of disciplines to study (Linn, 2004). Everyone can find what they do not understand in the study and learn new material through a variety of online schools, plaftorm and so on. However, in most Russian schools, much attention is also paid to books, often teachers even like to teach children from old publications that students can get in the library. Students in the 21st century continue to carry textbooks in a briefcase, and also do homework on books that are issued at school. However the role of computers and other electronic devices in academic success is an important aspect. After all, both book learning and online learning are closely intertwined and it is difficult to understand what really determines academic success (Turel, 2015). This project tries to analyze these reasons for success in mathematics among Russian 8th grade students.

Data description

TIMSS is an international assessment of mathematics and science at the fourth and eighth grades. In 2015, 57 countries participated in TIMSS. There are questionnaires that were completed by students, parents, teachers, school principals, and curriculum specialists. The questionnaire data are presented publicly.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The next step is to explain variables that will be used in our analysis.

  1. BSMMAT01 - 1ST PLAUSIBLE VALUE MATHEMATICS

  2. BSBG04 - About how many books are there in your home?

1: 0–10 books; 2: 11–25 books; 3: 26–100 books; 4: 101–200 books; 5: More than 200

  1. Do you have any of these things at your home?

BSBG06A - A computer or tablet of your own 1: Yes; 2: No

BSBG06B - A computer or tablet that is shared with other people at home 1: Yes; 2: No

Controlled variables:

  1. BSBG01 - Sex of student

1: Girl; 2: Boy

  1. BSBG07A and BSBG07B - What is the highest level of education completed by your mother (or stepmother or female guardian)? / What is the highest level of education completed by your father (or stepfather or male guardian)?

1: Some Primary or Lower secondary or did not go to school; 2: Lower secondary; 3: Upper secondary; 4: Post-secondary, non-tertiary; 5: Short-cycle tertiary; 6: Bachelor’s or equivalent; 7: Postgraduate degree; 8: Don’t know

  1. BSBG10A - Were you born in country? 1: Yes; 2: No

Descriptive statistics

About how many books are there in your home?

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

Not many students have 0-10 books. However, mostly students have from 11 to 100 books.

Do you have a computer or tablet of your own?

Mostly students have a computer or tablet of their own.

Do you have a computer or tablet that is shared with other people at home?

Mostly students have a computer or tablet that is shared with other people at home.

What is the highest level of education completed by your mother (or stepmother or female guardian)?

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

In general we see that many students answered that their mothers’ highest level of education is Bachelor’s or equivalent.

What is the highest level of education completed by your father (or stepfather or male guardian)?

## Don't know how to automatically pick scale for object of type haven_labelled. Defaulting to continuous.

In general we see that students answered that many their fathers’ highest level of education is Bachelor’s or equivalent, but interestingly, most of the students did not know the highest level of education of their fathers.

Were you born in country?

Mostly students were born in country.

Bivariate statistics

Before conducting regression analysis, we need to look at the bivariate relationships.

  1. Correlation matrix
##                  math       books   tabletown    tabletsh
## math       1.00000000  0.16199978  0.05760517 -0.11490335
## books      0.16199978  1.00000000 -0.02219123 -0.07301553
## tabletown  0.05760517 -0.02219123  1.00000000 -0.03153575
## tabletsh  -0.11490335 -0.07301553 -0.03153575  1.00000000

There is some correlation between our variables. The closer the number to zero, the weaker the association between the variables. +- 1 indicates a perfect correlation (positive or negative). A strong correlation is also when it is +- 0,50. For our results, there is no strong correlation between these variables, but still books and math have low posititve correlation, while tablethsh and math have low negative correlation.

Another way to show correlation matrix. Correlation matrix displaying the positive (blue) and negative (red) correlations between our variables. There is a correlation. We can group our variables.

  1. ANOVA

H0: the means of the groups having different number of books in home are the same in math achievement.

H1: the means of the groups having different number of books in home differ in math achievement.

##               Df   Sum Sq Mean Sq F value Pr(>F)    
## timss$books    4   802722  200680   32.79 <2e-16 ***
## Residuals   4535 27756579    6121                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F(4, 4648) = 33.06, p-value < .01

As the p-value is less than the significance level 0.01, we can conclude that there are significant differences between the groups highlighted with “***" in the model summary.

H0: the means of the groups having a computer or tablet of their own or not are the same in math achievement.

H1: the means of the groups having a computer or tablet of their own or not differ in math achievement.

##                   Df   Sum Sq Mean Sq F value   Pr(>F)    
## timss$tabletown    1    94770   94770   15.11 0.000103 ***
## Residuals       4538 28464531    6272                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F(1, 4651) = 12.33, p-value < .01

As the p-value is less than the significance level 0.01, we can conclude that there are significant differences between the groups highlighted with “***" in the model summary.

H0: the means of the groups having a computer or tablet that is shared or not are the same in math achievement.

H1: the means of the groups having a computer or tablet that is shared or not differ in math achievement.

##                  Df   Sum Sq Mean Sq F value   Pr(>F)    
## timss$tabletsh    1   377062  377062   60.72 8.12e-15 ***
## Residuals      4538 28182238    6210                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F(1, 4651) = 57.2, p-value < .01

As the p-value is less than the significance level 0.01, we can conclude that there are significant differences between the groups highlighted with “***" in the model summary.

Regression analysis

First i need to build a model containing all the predictors and look at its significance. I will predict math achievement by the factors that were explained previously, controlling for gender, parental education and whether the student was born on the country or outside.

## 
## Call:
## lm(formula = timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup + timss$born)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -285.152  -49.831    1.768   51.315  248.498 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      502.3979    29.9323  16.784  < 2e-16 ***
## timss$books2       5.3039     4.8573   1.092 0.274917    
## timss$books3      12.5625     4.8076   2.613 0.009003 ** 
## timss$books4      23.6516     5.3224   4.444 9.05e-06 ***
## timss$books5      18.7577     5.8264   3.219 0.001294 ** 
## timss$tabletown2  14.3612     3.1190   4.604 4.25e-06 ***
## timss$tabletsh2  -18.6903     3.0344  -6.160 7.93e-10 ***
## timss$sex2         8.5098     2.2542   3.775 0.000162 ***
## timss$edum2       -1.6119    29.2839  -0.055 0.956107    
## timss$edum3       -3.1572    29.3231  -0.108 0.914262    
## timss$edum4       28.1009    29.3430   0.958 0.338280    
## timss$edum5       25.0972    29.3839   0.854 0.393087    
## timss$edum6       31.2254    29.2769   1.067 0.286230    
## timss$edum7       36.2490    29.3464   1.235 0.216816    
## timss$edum8        7.4631    29.3515   0.254 0.799300    
## timss$edup2      -17.1017    16.4612  -1.039 0.298901    
## timss$edup3       -8.0405    16.4879  -0.488 0.625814    
## timss$edup4        6.1440    16.4748   0.373 0.709214    
## timss$edup5        3.4069    16.6469   0.205 0.837848    
## timss$edup6       14.0510    16.4862   0.852 0.394099    
## timss$edup7       21.4955    16.7890   1.280 0.200494    
## timss$edup8       -3.1354    16.3602  -0.192 0.848027    
## timss$born2        0.1395     6.2442   0.022 0.982175    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.71 on 4517 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.1129 
## F-statistic: 27.27 on 22 and 4517 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.12% of the variability of the response data around its mean.

We can notice that here our control variables are not significant. Some levels of books, tabletsh, tabletown and sex are significant.

The intercept is 515.23, ,means that if all variables are egual to 0, the math achievement is egual to 515.23. The fact that the coefficient for sex2 is positive indicates that being a boy is associated with increase of math achievement on 8.33 (relative to girls). The fact that the coefficient for tabletown2 is positive indicates that for students not having a tablet of their own is associated with increase of math achievement on 12.8 (relative to having a tablet). The fact that the coefficient for tabletsh2 is negative indicates that for students not having a tablet that is shared is associated with decrease of math achievement on -18.09 (relative to having a tablet). Having 26-100 books is significantly associated with an average increase of 12.06 in math achievement compared to having 11-25 books. Having 101-200 books is significantly associated with an average increase of 22.08 in math achievement compared to having 26-100 books. Having more than 200 books is significantly associated with an average increase of 18.69 in math achievement compared to having 101-200 books.

Backward method

Next step of our analysis - creating the best model. Here we are using backward method - removing insignificant factors untill we will find good model. We don not have a lot of predictors, so this step is quite simple.

## 
## Call:
## lm(formula = timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -285.155  -49.797    1.763   51.353  248.495 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       502.411     29.923  16.790  < 2e-16 ***
## timss$books2        5.305      4.856   1.092 0.274698    
## timss$books3       12.562      4.807   2.613 0.008998 ** 
## timss$books4       23.649      5.320   4.445 9.00e-06 ***
## timss$books5       18.754      5.824   3.220 0.001290 ** 
## timss$tabletown2   14.363      3.118   4.606 4.21e-06 ***
## timss$tabletsh2   -18.692      3.033  -6.164 7.72e-10 ***
## timss$sex2          8.510      2.254   3.775 0.000162 ***
## timss$edum2        -1.599     29.275  -0.055 0.956447    
## timss$edum3        -3.142     29.312  -0.107 0.914642    
## timss$edum4        28.115     29.333   0.958 0.337884    
## timss$edum5        25.109     29.376   0.855 0.392727    
## timss$edum6        31.237     29.269   1.067 0.285922    
## timss$edum7        36.262     29.338   1.236 0.216517    
## timss$edum8         7.476     29.342   0.255 0.798897    
## timss$edup2       -17.123     16.431  -1.042 0.297403    
## timss$edup3        -8.060     16.462  -0.490 0.624415    
## timss$edup4         6.123     16.446   0.372 0.709682    
## timss$edup5         3.390     16.627   0.204 0.838469    
## timss$edup6        14.031     16.461   0.852 0.394033    
## timss$edup7        21.476     16.764   1.281 0.200241    
## timss$edup8        -3.157     16.331  -0.193 0.846745    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.7 on 4518 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.1131 
## F-statistic: 28.57 on 21 and 4518 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.12% of the variability of the response data around its mean.

We can notice that here our control variables are not significant still. Some levels of books, tabletsh, tabletown and sex are significant. The same results as in previous model.

## 
## Call:
## lm(formula = timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -275.860  -50.217    2.034   51.681  254.444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       504.939     28.657  17.620  < 2e-16 ***
## timss$books2        5.043      4.877   1.034 0.301208    
## timss$books3       13.815      4.824   2.863 0.004210 ** 
## timss$books4       26.312      5.334   4.933 8.38e-07 ***
## timss$books5       21.580      5.839   3.696 0.000222 ***
## timss$tabletown2   14.169      3.135   4.520 6.34e-06 ***
## timss$tabletsh2   -18.946      3.049  -6.214 5.64e-10 ***
## timss$sex2          8.709      2.262   3.851 0.000119 ***
## timss$edum2       -12.045     28.676  -0.420 0.674479    
## timss$edum3       -11.810     28.649  -0.412 0.680175    
## timss$edum4        27.071     28.650   0.945 0.344770    
## timss$edum5        24.161     28.722   0.841 0.400286    
## timss$edum6        34.129     28.591   1.194 0.232670    
## timss$edum7        43.052     28.715   1.499 0.133879    
## timss$edum8         1.882     28.632   0.066 0.947585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.17 on 4525 degrees of freedom
## Multiple R-squared:  0.1047, Adjusted R-squared:  0.1019 
## F-statistic:  37.8 on 14 and 4525 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.10% of the variability of the response data around its mean. Explanatory power is not really strong.

We can notice that here our control variables are not significant still.

The intercept is 517, ,means that if all variables are egual to 0, the math achievement is egual to 517. The fact that the coefficient for sex2 is positive indicates that being a boy is associated with increase of math achievement on 8.52 (relative to girls). The fact that the coefficient for tabletown2 is positive indicates that for students not having a tablet of their own is associated with increase of math achievement on 12.7 (relative to having a tablet). The fact that the coefficient for tabletsh2 is negative indicates that for students not having a tablet that is shared is associated with decrease of math achievement on -18.3 (relative to having a tablet). Having 26-100 books is significantly associated with an average increase of 13.23 in math achievement compared to having 11-25 books. Having 101-200 books is significantly associated with an average increase of 25.4 in math achievement compared to having 26-100 books. Having more than 200 books is significantly associated with an average increase of 21.4 in math achievement compared to having 101-200 books.

## 
## Call:
## lm(formula = timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$born)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -255.576  -54.196    2.617   54.604  266.684 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       511.001      4.836 105.664  < 2e-16 ***
## timss$books2       10.855      5.006   2.168   0.0302 *  
## timss$books3       26.630      4.901   5.434 5.81e-08 ***
## timss$books4       42.139      5.398   7.806 7.28e-15 ***
## timss$books5       41.838      5.873   7.124 1.22e-12 ***
## timss$tabletown2   13.664      3.234   4.226 2.43e-05 ***
## timss$tabletsh2   -22.470      3.138  -7.161 9.32e-13 ***
## timss$sex2          9.588      2.321   4.131 3.68e-05 ***
## timss$born2        -1.216      6.460  -0.188   0.8507    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.56 on 4531 degrees of freedom
## Multiple R-squared:  0.04562,    Adjusted R-squared:  0.04394 
## F-statistic: 27.07 on 8 and 4531 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.04% of the variability of the response data around its mean. Explanatory power is not really strong.

We can notice that here our control variable is not significant.

The intercept is 511.54, ,means that if all variables are egual to 0, the math achievement is egual to 511.54. The fact that the coefficient for sex2 is positive indicates that being a boy is associated with increase of math achievement on 9.2 (relative to girls). The fact that the coefficient for tabletown2 is positive indicates that for students not having a tablet of their own is associated with increase of math achievement on 12.1 (relative to having a tablet). The fact that the coefficient for tabletsh2 is negative indicates that for students not having a tablet that is shared is associated with decrease of math achievement on -21.7 (relative to having a tablet). Having 26-100 books is significantly associated with an average increase of 25.9 in math achievement compared to having 11-25 books. Having 101-200 books is significantly associated with an average increase of 40.9 in math achievement compared to having 26-100 books. Having more than 200 books is significantly associated with an average increase of 41.7 in math achievement compared to having 101-200 books.

## 
## Call:
## lm(formula = timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edup)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -281.42  -51.24    2.32   52.37  251.00 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       506.275     16.188  31.275  < 2e-16 ***
## timss$books2        8.256      4.898   1.685  0.09197 .  
## timss$books3       18.228      4.824   3.779  0.00016 ***
## timss$books4       30.026      5.335   5.628 1.93e-08 ***
## timss$books5       27.065      5.820   4.650 3.41e-06 ***
## timss$tabletown2   14.301      3.156   4.532 6.00e-06 ***
## timss$tabletsh2   -20.372      3.064  -6.649 3.31e-11 ***
## timss$sex2          8.501      2.268   3.749  0.00018 ***
## timss$edup2       -16.595     16.206  -1.024  0.30588    
## timss$edup3        -7.417     16.169  -0.459  0.64648    
## timss$edup4        21.922     16.099   1.362  0.17337    
## timss$edup5        18.887     16.305   1.158  0.24678    
## timss$edup6        32.520     16.091   2.021  0.04333 *  
## timss$edup7        43.764     16.386   2.671  0.00759 ** 
## timss$edup8         2.546     15.972   0.159  0.87336    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.63 on 4525 degrees of freedom
## Multiple R-squared:  0.09369,    Adjusted R-squared:  0.09089 
## F-statistic: 33.41 on 14 and 4525 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.09% of the variability of the response data around its mean.

We can notice that here some of our control variables are not significant, but now edup is significant.

The intercept is 507.26, means that if all variables are egual to 0, the math achievement is egual to 507.26. The fact that the coefficient for sex2 is positive indicates that being a boy is associated with increase of math achievement on 8.3 (relative to girls). The fact that the coefficient for tabletown2 is positive indicates that for students not having a tablet of their own is associated with increase of math achievement on 12.7 (relative to having a tablet). The fact that the coefficient for tabletsh2 is negative indicates that for students not having a tablet that is shared is associated with decrease of math achievement on -19.7 (relative to having a tablet). Having 26-100 books is significantly associated with an average increase of 17.6 in math achievement compared to having 11-25 books. Having 101-200 books is significantly associated with an average increase of 29 in math achievement compared to having 26-100 books. Having more than 200 books is significantly associated with an average increase of 27 in math achievement compared to having 101-200 books. The fact that the coefficient for edup6 is positive indicates that Bachelor’s or equivalent level of education of father is associated with increase of math achievement on 32.3 (relative to Short-cycle tertiary degree). Moreover, the fact that the coefficient for edup7 is positive indicates that Postgraduate degree of father is associated with increase of math achievement on 43.08 (relative to Bachelor’s or equivalent degree).

Comparing models

Now since we have several models, we need to compare them to find the best one.

For nested models it is usually used anova, for non-nested - AIC. We are using anova.

## Analysis of Variance Table
## 
## Model 1: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup
## Model 2: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup + timss$born
##   Res.Df      RSS Df Sum of Sq     F Pr(>F)
## 1   4518 25211299                          
## 2   4517 25211297  1    2.7864 5e-04 0.9822

Reduced model is better than full because p valuse > 0.05.

## Analysis of Variance Table
## 
## Model 1: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum
## Model 2: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup + timss$born
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1   4525 25569288                                  
## 2   4517 25211297  8    357991 8.0175 8.596e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here full model is better than reduced because p value < 0.05.

## Analysis of Variance Table
## 
## Model 1: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$born
## Model 2: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup + timss$born
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1   4531 27256363                                  
## 2   4517 25211297 14   2045066 26.172 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here again full model is better than reduced because p value < 0.05.

## Analysis of Variance Table
## 
## Model 1: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edup
## Model 2: timss$math ~ timss$books + timss$tabletown + timss$tabletsh + 
##     timss$sex + timss$edum + timss$edup + timss$born
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1   4525 25883503                                  
## 2   4517 25211297  8    672207 15.055 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The last anova test showed that full model is better than reduced because p value < 0.05.

Therefore, Model0 is better than models 2,3 and 4. Model1 is better than model0. So, model1 - our best model.

Adding interaction effect

We should also try to add an interactive effect. This will be added to our best model.

  1. tabletsh * books

Hypothesis: tabletsh and booksinfluence the math achievement. The more books at home a student has and if he has shared tablet, the higheer his math achievement.

## 
## Call:
## lm(formula = reg$math ~ books * tabletsh + tabletown + sex + 
##     edup + edum, data = reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -257.99  -52.83    2.56   53.81  268.14 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     467.7306     6.0848  76.868  < 2e-16 ***
## books            11.2778     1.2220   9.229  < 2e-16 ***
## tabletsh2        -7.9372     8.9368  -0.888 0.374509    
## tabletown2       14.1790     3.2115   4.415 1.03e-05 ***
## sex               8.3774     2.3150   3.619 0.000299 ***
## edup              0.7470     0.6191   1.207 0.227643    
## edum              4.3305     0.6905   6.271 3.91e-10 ***
## books:tabletsh2  -5.0295     3.0094  -1.671 0.094740 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.08 on 4532 degrees of freedom
## Multiple R-squared:  0.05729,    Adjusted R-squared:  0.05583 
## F-statistic: 39.35 on 7 and 4532 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.05% of the variability of the response data around its mean.

We can notice that here some of our control variables are not significant, but edum is significant.

The intercept is 462.59, means that if all variables are egual to 0, the math achievement is egual to 462.59. Model shows books, tabletown, sex and edum have significant effect on math achievement. If an books changes on 1, the variable math changes on 16.18. If an sex varible changes on 1, the math variable changes on -8.08. If an tableown varible changes on 1, the math variable changes on 12.62. If an edum varible changes on 1, the math variable changes on 4.48.

Interpretation:

The interaction effect is significance. We see how the two lines differ, which points us to the confirmation of the hypothesis. The hypothesis is confirmed, since with an increase in math achievement, students having computer that is shared have higher math achievement than those who do not have a computer that is shared.

  1. books * sex

Hypothesis: sex and number of books influence the math achievement. The more books at home a student has and if he is a boy, the higheer his math achievement.

## 
## Call:
## lm(formula = reg$math ~ books * sex + tabletsh + tabletown + 
##     edup + edum, data = reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -257.918  -53.022    2.273   53.737  269.658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 476.4771    11.5231  41.350  < 2e-16 ***
## books         8.3309     3.5264   2.362   0.0182 *  
## sex           4.1702     6.8625   0.608   0.5434    
## tabletsh2   -21.9522     3.1202  -7.036 2.28e-12 ***
## tabletown2   14.2655     3.2126   4.440 9.19e-06 ***
## edup          0.7246     0.6191   1.170   0.2420    
## edum          4.3674     0.6906   6.324 2.79e-10 ***
## books:sex     1.4105     2.2011   0.641   0.5217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.1 on 4532 degrees of freedom
## Multiple R-squared:  0.05679,    Adjusted R-squared:  0.05534 
## F-statistic: 38.98 on 7 and 4532 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.05% of the variability of the response data around its mean.

We can notice that here some of our control variables are not significant, but edum is significant.

The intercept is 484, means that if all variables are egual to 0, the math achievement is egual to 484. Model shows books, tabletown, tabletown and edum have significant effect on math achievement. Interestingly, here sex is not significant. If an books changes on 1, the variable math changes on 8.62. If an tabletsh varible changes on 1, the math variable changes on -21.08. If an tableown varible changes on 1, the math variable changes on 12.72. If an edum varible changes on 1, the math variable changes on 4.52.

Interpretation:

The interaction effect is not significance. We see how the two lines are the same, which points us that the hypothesis cannot be proved.

## [1] 52344.64
## [1] 52347.03
## [1] 52074.39

However, our model without interactive effect is better than with interactive effects.

Model diagnostics

Model diagnostic is should also be presented. Firsly, we check our model on multicollinerity, where collinearity exists between three or more variables. If it has been presented in the model, the solution of the regression model becomes unstable. Therefore we are using VIF, the variance inflation factor, which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
##                     GVIF Df GVIF^(1/(2*Df))
## timss$books     1.153297  4        1.017988
## timss$tabletown 1.010914  1        1.005442
## timss$tabletsh  1.017483  1        1.008704
## timss$sex       1.032104  1        1.015925
## timss$edum      3.284157  7        1.088648
## timss$edup      3.192506  7        1.086449

The test showed that vif score for the predictor variableles less than 5 - that is okay (moderately correlated). No multicollinearity in our model is presented.

Next step in model diagnostic is a look on residuals and leverages.

The first plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. It seems that the residuals and the fitted values are uncorrelated, as they should be in a homoscedastic linear model with normally distributed errors. So, no heteroscedasticity.

Normal Q-Q plot shows that the distributions matched more or less perfectly, the residuals are normally distributed because the points follow the dotted line closely.It is seen expect observations 4622, 4061, 4610. That is okay. The model residuals have passed the test of normality.

Scale location plot indicates spread of points across predicted values range. A horizontal red line is ideal and would indicate that residuals have uniform variance across the range. For our model the results are not good.

The last graphs show that we have outliers, but not leverages. Outliers are data points whose response y does not follow the general trend of the rest of the data. A data point has high leverage if it has “extreme” predictor x values. Under Cook’s distance there is no points, means no leverages, which is good.

Other tests to assess the adequacy of our model

## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferonni p
## 3960 -3.831935         0.00012887      0.58506

Bonferonni p-value shows that observation 4061 is an outlier.

qqPlot(model1, main="QQ Plot")

## [1] 2197 3960

There is another way to present Q-Q plot which also shows a normal distribution.

And here is another variant to show leverages plot.

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

The distribution of studentized residuals is normal.

EFA analysis

Correlation matrix displaying the positive (blue) and negative (red) correlations between our variables. There is a correlation. We can group our variables.

## 'data.frame':    4540 obs. of  11 variables:
##  $ dataa.dat.BSBG06A: num  1 1 1 1 1 1 1 1 2 1 ...
##  $ dataa.dat.BSBG06B: num  1 1 1 1 1 1 2 1 1 1 ...
##  $ dataa.dat.BSBG06C: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ dataa.dat.BSBG06D: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ dataa.dat.BSBG06E: num  1 1 2 1 1 1 1 1 1 1 ...
##  $ dataa.dat.BSBG06F: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ dataa.dat.BSBG06G: num  1 2 2 2 1 2 2 1 2 1 ...
##  $ dataa.dat.BSBG06H: num  2 2 2 2 1 2 2 1 1 2 ...
##  $ dataa.dat.BSBG06I: num  2 2 2 1 1 1 1 1 1 1 ...
##  $ dataa.dat.BSBG06J: num  1 2 2 1 1 2 2 2 2 1 ...
##  $ dataa.dat.BSBG06K: num  2 2 2 2 1 2 2 2 1 2 ...

Our variables were coded as numeric for next step of analysis.

Since Factor Analysis (FA) does not work with NA, we have already excluded it from our data. Now we can conduct FA.

How many factors should be extracted?

## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:polycor':
## 
##     polyserial
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Parallel analysis suggests that the number of factors =  5  and the number of components =  4

It is hard to interpret the plot, but if we need to know about number of factors itself. Parallel analysis suggests that the number of factors = 5 and the number of components = 4.

## Factor Analysis using method =  ml
## Call: fa(r = efa, nfactors = 4, rotate = "varimax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                     ML3   ML2   ML1   ML4    h2   u2 com
## dataa.dat.BSBG06A  0.32  0.08  0.17  0.02 0.139 0.86 1.7
## dataa.dat.BSBG06B -0.10  0.33 -0.06  0.13 0.141 0.86 1.6
## dataa.dat.BSBG06C  0.51  0.13  0.01  0.06 0.280 0.72 1.2
## dataa.dat.BSBG06D  0.43 -0.03  0.02  0.30 0.273 0.73 1.8
## dataa.dat.BSBG06E  0.17  0.48  0.06  0.00 0.260 0.74 1.3
## dataa.dat.BSBG06F  0.13  0.37  0.07 -0.02 0.157 0.84 1.3
## dataa.dat.BSBG06G  0.06  0.00  0.57  0.06 0.330 0.67 1.0
## dataa.dat.BSBG06H  0.02  0.14  0.13  0.13 0.052 0.95 3.0
## dataa.dat.BSBG06I  0.05  0.20  0.19  0.39 0.228 0.77 2.0
## dataa.dat.BSBG06J  0.13 -0.02  0.08  0.47 0.248 0.75 1.2
## dataa.dat.BSBG06K  0.08  0.06  0.37  0.22 0.194 0.81 1.8
## 
##                        ML3  ML2  ML1  ML4
## SS loadings           0.63 0.56 0.56 0.55
## Proportion Var        0.06 0.05 0.05 0.05
## Cumulative Var        0.06 0.11 0.16 0.21
## Proportion Explained  0.28 0.24 0.24 0.24
## Cumulative Proportion 0.28 0.52 0.76 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  0.5 with Chi Square of  2258.71
## The degrees of freedom for the model are 17  and the objective function was  0.01 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  4540 with the empirical chi square  95.1  with prob <  7.1e-13 
## The total number of observations was  4540  with Likelihood Chi Square =  67.89  with prob <  5e-08 
## 
## Tucker Lewis Index of factoring reliability =  0.925
## RMSEA index =  0.026  and the 90 % confidence intervals are  0.019 0.032
## BIC =  -75.27
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy             
##                                                     ML3   ML2   ML1   ML4
## Correlation of (regression) scores with factors    0.65  0.63  0.64  0.62
## Multiple R square of scores with factors           0.42  0.40  0.41  0.38
## Minimum correlation of possible factor scores     -0.15 -0.21 -0.18 -0.24

We tried to do FA with different number of factors (3,4,5,6), without and with rotation.

FA with 3 factors for rotation varimax and oblimin did not include some variables. Without rotation ML3 was with no variables, while ML2 was only with 1 variable.

FA with 4 factors without rotation showed that ML3 and ML4 have no variables. Only FA with varimax rotation showed that all factors have variables more than 1.

FA with 5 factors for rotation oblimin showed that ML5 was without variables. For varimax rotation, ML5 has only 1 variable. Without rotation, ML2, ML3, ML5 have no variables.

FA with 6 factors without rotation showed that ML5 and ML6 have no variables. With oblimin rotation ML1 and ML6 have only 1 variable. With varimax rotation ML1, ML2, ML 6 have 1 variable.

Hence, our analysis showed that the best results of FA is with 4 factors and varimax rotation.

We see that communality, the proportion of each variable’s variance that can be explained by the factors is lower than error or uniqueness for most of variables. Since the higher the communality (h2), the better, and otherwise, the lower the uniqueness (u2), the better. For our analysis this is not good. Here variables do not tell much about the factor since its uniqueness is higher than communality.

For Proportion Variance we know that each factor should explain at least 10% - if less - that factor does nor explain much. Our analysis showed that we have bad results of EFA.

For Proportion Explained we know that each factor should explain +- the same proportion. However, the first factor will always explain more than next factors. For our model it is true.

RMSR = 0.01, should be closer to 0, for our FA it is good result. RMSEA index = 0.026 and it is acceptable since it is < 0.08. Tucker Lewis Index = 0.92 and it is acceptable since it is > 0.09.

The plot shows our 4 factors. Each of them have more or 2 variables which is good since factors with less variables do not make sence. No negative relationships.

Accordingly, we conclude that results of FA are not good.

Let us name the factors.

Do you have any of these things at your home?

BSBG06A - A computer or tablet of your own BSBG06B - A computer or tablet that is shared with other people at home BSBG06C - Study desk/table for your use BSBG06D - Your own room BSBG06E - Internet connection BSBG06F - Your own mobile phone BSBG06G - A gaming system (e.g., PlayStation®, Wii®, XBox®) BSBG06H - BSBG06I - BSBG06J - BSBG06K -

ML1 - BSBG06G, BSBG06K - nothing in common ML2 - BSBG06E, BSBG06F, BSBG06B - electronic accesses ML3 - BSBG06C, BSBG06D, BSBG06A - personal things of students ML4 - BSBG06J, BSBG06I - country-specific indicators

Internal consistency: Cronbach’s alpha

## 
## Reliability analysis   
## Call: alpha(x = ML1, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.37      0.37    0.23      0.23 0.59 0.019  1.8 0.33     0.23
## 
##  lower alpha upper     95% confidence boundaries
## 0.33 0.37 0.41 
## 
##  Reliability if an item is dropped:
##                   raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r
## dataa.dat.BSBG06G     0.227      0.23   0.052      0.23  NA       NA 0.227
## dataa.dat.BSBG06K     0.052      0.23      NA        NA  NA       NA 0.052
##                   med.r
## dataa.dat.BSBG06G  0.23
## dataa.dat.BSBG06K  0.23
## 
##  Item statistics 
##                      n raw.r std.r r.cor r.drop mean   sd
## dataa.dat.BSBG06G 4540  0.80  0.78  0.37   0.23  1.7 0.44
## dataa.dat.BSBG06K 4540  0.76  0.78  0.37   0.23  1.8 0.40
## 
## Non missing response frequency for each item
##                      1    2 miss
## dataa.dat.BSBG06G 0.26 0.74    0
## dataa.dat.BSBG06K 0.21 0.79    0

Cronbach’s alpha = 0.37 - it is lower than 0.8 means it is a bad result. Reliability analysis is bad The scale is not reliable. A scale cannot be used as a variable in further analysis.

## 
## Reliability analysis   
## Call: alpha(x = ML2, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.26      0.35    0.26      0.15 0.53 0.016  1.1 0.16     0.14
## 
##  lower alpha upper     95% confidence boundaries
## 0.23 0.26 0.29 
## 
##  Reliability if an item is dropped:
##                   raw_alpha std.alpha G6(smc) average_r  S/N alpha se
## dataa.dat.BSBG06E      0.13      0.18    0.10      0.10 0.23    0.017
## dataa.dat.BSBG06F      0.20      0.24    0.14      0.14 0.32    0.019
## dataa.dat.BSBG06B      0.34      0.35    0.21      0.21 0.53    0.019
##                   var.r med.r
## dataa.dat.BSBG06E    NA  0.10
## dataa.dat.BSBG06F    NA  0.14
## dataa.dat.BSBG06B    NA  0.21
## 
##  Item statistics 
##                      n raw.r std.r r.cor r.drop mean   sd
## dataa.dat.BSBG06E 4540  0.55  0.68  0.40   0.20  1.0 0.18
## dataa.dat.BSBG06F 4540  0.45  0.66  0.36   0.17  1.0 0.14
## dataa.dat.BSBG06B 4540  0.85  0.63  0.26   0.16  1.2 0.37
## 
## Non missing response frequency for each item
##                      1    2 miss
## dataa.dat.BSBG06E 0.97 0.03    0
## dataa.dat.BSBG06F 0.98 0.02    0
## dataa.dat.BSBG06B 0.84 0.16    0

Cronbach’s alpha = 0.35 - it is lower than 0.8 means it is a bad result. Reliability analysis is bad The scale is not reliable. A scale cannot be used as a variable in further analysis.

## 
## Reliability analysis   
## Call: alpha(x = ML3, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.38       0.4    0.31      0.18 0.68 0.015  1.2 0.25     0.18
## 
##  lower alpha upper     95% confidence boundaries
## 0.35 0.38 0.41 
## 
##  Reliability if an item is dropped:
##                   raw_alpha std.alpha G6(smc) average_r  S/N alpha se
## dataa.dat.BSBG06C      0.23      0.24    0.14      0.14 0.31    0.022
## dataa.dat.BSBG06D      0.29      0.30    0.18      0.18 0.43    0.020
## dataa.dat.BSBG06A      0.35      0.39    0.24      0.24 0.64    0.017
##                   var.r med.r
## dataa.dat.BSBG06C    NA  0.14
## dataa.dat.BSBG06D    NA  0.18
## dataa.dat.BSBG06A    NA  0.24
## 
##  Item statistics 
##                      n raw.r std.r r.cor r.drop mean   sd
## dataa.dat.BSBG06C 4540  0.60  0.70  0.44   0.28  1.1 0.28
## dataa.dat.BSBG06D 4540  0.78  0.68  0.39   0.24  1.3 0.47
## dataa.dat.BSBG06A 4540  0.62  0.65  0.31   0.19  1.2 0.36
## 
## Non missing response frequency for each item
##                      1    2 miss
## dataa.dat.BSBG06C 0.92 0.08    0
## dataa.dat.BSBG06D 0.68 0.32    0
## dataa.dat.BSBG06A 0.85 0.15    0

Cronbach’s alpha = 0.4 - it is lower than 0.8 means it is a bad result. Reliability analysis is bad The scale is not reliable. A scale cannot be used as a variable in further analysis.

## 
## Reliability analysis   
## Call: alpha(x = ML4, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.35      0.35    0.21      0.21 0.54 0.019  1.4 0.38     0.21
## 
##  lower alpha upper     95% confidence boundaries
## 0.31 0.35 0.39 
## 
##  Reliability if an item is dropped:
##                   raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r
## dataa.dat.BSBG06J     0.212      0.21   0.045      0.21  NA       NA 0.212
## dataa.dat.BSBG06I     0.045      0.21      NA        NA  NA       NA 0.045
##                   med.r
## dataa.dat.BSBG06J  0.21
## dataa.dat.BSBG06I  0.21
## 
##  Item statistics 
##                      n raw.r std.r r.cor r.drop mean   sd
## dataa.dat.BSBG06J 4540  0.80  0.78  0.36   0.21  1.5 0.50
## dataa.dat.BSBG06I 4540  0.76  0.78  0.36   0.21  1.3 0.47
## 
## Non missing response frequency for each item
##                      1    2 miss
## dataa.dat.BSBG06J 0.51 0.49    0
## dataa.dat.BSBG06I 0.68 0.32    0

Cronbach’s alpha = 0.35 - it is lower than 0.8 means it is a bad result. Reliability analysis is bad The scale is not reliable. A scale cannot be used as a variable in further analysis.

Regression Analysis based on FA results

How we can save scores for our regression and add new controlled variables.

##  [1] "dataa.dat.BSBG06A" "dataa.dat.BSBG06B" "dataa.dat.BSBG06C"
##  [4] "dataa.dat.BSBG06D" "dataa.dat.BSBG06E" "dataa.dat.BSBG06F"
##  [7] "dataa.dat.BSBG06G" "dataa.dat.BSBG06H" "dataa.dat.BSBG06I"
## [10] "dataa.dat.BSBG06J" "dataa.dat.BSBG06K" "ML3"              
## [13] "ML2"               "ML1"               "ML4"

We will predict math achievement by the factors we got in the previous steps, controlling for gender and parental education.

Model 1 - by the ML1.

## 
## Call:
## lm(formula = regression$math ~ regression$sex + regression$edum + 
##     regression$edup + regression$ML1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -269.668  -51.485    1.725   51.459  265.522 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       504.102     30.051  16.775  < 2e-16 ***
## regression$sex2     7.262      2.288   3.173 0.001518 ** 
## regression$edum2    3.462     29.499   0.117 0.906573    
## regression$edum3    3.312     29.527   0.112 0.910702    
## regression$edum4   36.812     29.539   1.246 0.212756    
## regression$edum5   33.675     29.588   1.138 0.255133    
## regression$edum6   42.543     29.471   1.444 0.148936    
## regression$edum7   47.673     29.543   1.614 0.106673    
## regression$edum8   15.875     29.555   0.537 0.591204    
## regression$edup2  -17.926     16.543  -1.084 0.278610    
## regression$edup3   -7.244     16.582  -0.437 0.662214    
## regression$edup4    6.498     16.554   0.393 0.694656    
## regression$edup5    5.600     16.745   0.334 0.738088    
## regression$edup6   17.024     16.577   1.027 0.304497    
## regression$edup7   26.368     16.881   1.562 0.118352    
## regression$edup8   -3.148     16.445  -0.191 0.848202    
## regression$ML1      6.260      1.815   3.450 0.000567 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.37 on 4523 degrees of freedom
## Multiple R-squared:  0.1002, Adjusted R-squared:  0.09706 
## F-statistic: 31.49 on 16 and 4523 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.1% of the variability of the response data around its mean. Gender and ML1 here are significant. However, we can notice that parental education is not significant. Explanatory power is not really strong.

The intercept is 504.1, ,means that if all variables are egual to 0, the math achievement is egual to 504.1. Model shows ML1 has significant effect on math achievement. If ML1 factor changes on 1, math achievement changes on 6.3.

Model 2 - by the ML2.

## 
## Call:
## lm(formula = regression$math ~ regression$sex + regression$edum + 
##     regression$edup + regression$ML2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -278.474  -50.738    1.791   51.575  253.342 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       515.866     29.943  17.228  < 2e-16 ***
## regression$sex2     7.309      2.254   3.242   0.0012 ** 
## regression$edum2    3.299     29.379   0.112   0.9106    
## regression$edum3    1.625     29.408   0.055   0.9559    
## regression$edum4   33.904     29.421   1.152   0.2492    
## regression$edum5   30.742     29.470   1.043   0.2969    
## regression$edum6   38.158     29.353   1.300   0.1937    
## regression$edum7   43.180     29.426   1.467   0.1423    
## regression$edum8   12.780     29.436   0.434   0.6642    
## regression$edup2  -25.263     16.496  -1.532   0.1257    
## regression$edup3  -15.591     16.534  -0.943   0.3458    
## regression$edup4   -1.822     16.514  -0.110   0.9122    
## regression$edup5   -4.092     16.708  -0.245   0.8066    
## regression$edup6    6.937     16.530   0.420   0.6747    
## regression$edup7   15.793     16.836   0.938   0.3483    
## regression$edup8  -11.470     16.405  -0.699   0.4845    
## regression$ML2    -12.667      1.809  -7.004 2.86e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.07 on 4523 degrees of freedom
## Multiple R-squared:  0.1075, Adjusted R-squared:  0.1044 
## F-statistic: 34.07 on 16 and 4523 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.1% of the variability of the response data around its mean. Gender and ML2 here are significant. However, we can notice that parental education is not significant. Explanatory power is not really strong.

The intercept is 515.87, ,means that if all variables are egual to 0, the math achievement is egual to 515.87. Model shows ML2 has significant effect on math achievement. If ML2 factor changes on 1, math achievement changes on -12.67. Therefore, the more electronic accesses has a student, the lower level of math achievement.

Model 3 - by the ML3.

## 
## Call:
## lm(formula = regression$math ~ regression$sex + regression$edum + 
##     regression$edup + regression$ML3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -278.207  -50.982    1.849   51.791  254.838 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      503.1903    30.0717  16.733  < 2e-16 ***
## regression$sex2    5.9893     2.2555   2.655  0.00795 ** 
## regression$edum2   3.0157    29.5070   0.102  0.91860    
## regression$edum3   3.1027    29.5348   0.105  0.91634    
## regression$edum4  36.6036    29.5465   1.239  0.21547    
## regression$edum5  33.0490    29.5955   1.117  0.26419    
## regression$edum6  41.4680    29.4770   1.407  0.15956    
## regression$edum7  46.7887    29.5494   1.583  0.11340    
## regression$edum8  14.8694    29.5618   0.503  0.61499    
## regression$edup2 -15.5751    16.5833  -0.939  0.34768    
## regression$edup3  -5.2584    16.6215  -0.316  0.75174    
## regression$edup4   9.1285    16.5997   0.550  0.58240    
## regression$edup5   7.7817    16.7897   0.463  0.64304    
## regression$edup6  18.8414    16.6229   1.133  0.25708    
## regression$edup7  28.1107    16.9245   1.661  0.09679 .  
## regression$edup8  -0.5606    16.4914  -0.034  0.97288    
## regression$ML3     5.3731     1.7337   3.099  0.00195 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.39 on 4523 degrees of freedom
## Multiple R-squared:  0.09978,    Adjusted R-squared:  0.0966 
## F-statistic: 31.33 on 16 and 4523 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.09% of the variability of the response data around its mean. Gender and ML3 here are significant. However, we can notice that parental education is not significant. Explanatory power is not really strong.

The intercept is 503.19, ,means that if all variables are egual to 0, the math achievement is egual to 503.19. Model shows ML3 has significant effect on math achievement. If ML3 factor changes on 1, math achievement changes on 5.38. Therefore, the more personal things a student have, the higher his level of math achievement.

Model 4 - by the ML4.

## 
## Call:
## lm(formula = regression$math ~ regression$sex + regression$edum + 
##     regression$edup + regression$ML4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -276.160  -51.012    1.759   51.881  257.198 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       507.319     30.081  16.865  < 2e-16 ***
## regression$sex2     5.875      2.258   2.602  0.00931 ** 
## regression$edum2    3.099     29.542   0.105  0.91645    
## regression$edum3    2.986     29.568   0.101  0.91956    
## regression$edum4   36.241     29.580   1.225  0.22058    
## regression$edum5   32.913     29.631   1.111  0.26673    
## regression$edum6   41.232     29.509   1.397  0.16240    
## regression$edum7   46.489     29.581   1.572  0.11611    
## regression$edum8   14.834     29.594   0.501  0.61623    
## regression$edup2  -19.178     16.561  -1.158  0.24692    
## regression$edup3   -8.940     16.596  -0.539  0.59014    
## regression$edup4    5.317     16.571   0.321  0.74834    
## regression$edup5    3.891     16.759   0.232  0.81642    
## regression$edup6   14.684     16.583   0.885  0.37595    
## regression$edup7   24.085     16.890   1.426  0.15393    
## regression$edup8   -4.584     16.465  -0.278  0.78069    
## regression$ML4      1.189      1.861   0.639  0.52281    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.47 on 4523 degrees of freedom
## Multiple R-squared:  0.09795,    Adjusted R-squared:  0.09476 
## F-statistic:  30.7 on 16 and 4523 DF,  p-value: < 2.2e-16

P-value is less than 0.05, which is good. This model explains 0.09% of the variability of the response data around its mean. Gender here is significant. However, we can notice that parental education is not significant. Explanatory power is not really strong.

The intercept is 507.32, ,means that if all variables are egual to 0, the math achievement is egual to 507.32. Model shows ML4 do not have significant effect on math achievement.

Conclusion

The analysis showed that the achievement of boys in mathematics is higher than the success of girls among the 8th grade in Russia. The control variables used were not important, but some models have shown that a father’s education affects success in math among teenagers. However, the indicator of whether the student was born in the country was not significant anywhere.

A model with an interactive effect showed that the presence of books in the home of a student and whether he has access to electronic resources affects his success in school: namely, the more books in the house and the presence of a shared PC increases achievement in mathematics. The hypothesis regarding the availability of personal access and the number of books could not be confirmed. This may be due to the fact that when students have a shared computer, they can’t spend much time playing games and communicating online: their access to electronics is controlled by other family members, so the student uses the computer only for learning purposes and thus achieves more. However, students who have access to a personal computer or phone tend to play or text, and may not pay proper attention to their studies.

Factor analysis did not show the best results, since some factors were irrelevant and meaningless. However, 3 out of 4 factors were significant in predicting success in mathematics. More precisely, the more personal items a student has (their own room, computer, desk), the higher their success in mathematics. However, the more electronic accesses a student has (mobile phone, Internet), the lower their success in mathematics.

References

Linn M. C., Davis E. A., Bell P. E. Internet environments for science education. – Lawrence Erlbaum Associates Publishers, 2004. Turel Y., Toraman M. The relationship between Internet addiction and academic success of secondary school students. – 2015.