About project
The main purpose of this work is to find out what more important for Russian student achievement nowadays is: books or computers. The used data for this task is TIMSS 2015 dataset for the 8th graders provided by TIMSS & PIRLS International Study Center, USA. As for methods, the linear regression with interaction effect is performed, as well as exploratory factor analysis (EFA).
Data Pre-processing
Firsly, there is a point to select, rename, and recode (if needed) variables for analysis. Control variables refer such variables as (1) whether a student was born in a country or not, (2) his or her gender, (3) education received by parents. To extent them income and nationality variables could be used, but they are absent in this dataset. There is no strong reason to use age variable as a control one due to a fact that all students are from 8th grade and, as a consequence, their ages are from 14 to 15,5 that is not so substantial as it could be.
Overall, there are 16 variables. The next step is to visualize them to look at patterns.
Patterns in Data
1. Numeric variable
There is only one numeric variable in the prepated subset of the data that is scores recived by students for their achievements in mathematics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 273.1 484.4 540.5 538.3 593.9 819.8
It can be said that the distribution is bell-shaped and closed to a normal one. The minimum value is 273.1, the maximum is 819.8.
2. Categorical variable
The rest part of selected variables are factors. There are 15 of them. Although some characteristics vary, proportions of girls and boys are almost equally presented. The other patterns can be viewed from the plots below.
Linear Regression
The linear regression is conducted to explain a math achievement score with a number of books at home and availability of tablet/PC. Such control variables as being born in the country and student’s gender, and education of parents are considered. Due to predefined types of variables based on their meanings and codings, the independent variable is explained by a set of factors.
A set of box-plots demonstrates distributions of some variables regarding math scores obtained by students. As for A, B, C, and D plots, it can be said that there are no any remarkable differences: both have outliers, medians are visually equal, etc. Meanwhile, relations between math scores and parental education vary per categories, as well as having books at home and math scores do (E, F plots). One trend can be described like “the more books a student has at home, the better his or her scores in math can be”. As for parental education, there is an interesing remark in case of parents whose education level is “University or Higher” because there are too many outliers. Still, some general view in this case can be formed and patterns are traced: there are relations between student’s math scores and a level of education of his/her parents.
For now, let’s create a linear regression model.
| MathScore | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 489.55 | 471.95 – 507.15 | <0.001 |
| Gender [Girl] | -8.39 | -12.80 – -3.97 | <0.001 |
| BornInOut [Yes] | -0.28 | -12.51 – 11.94 | 0.964 |
|
ParentEduc [PostSecNotUni] |
37.36 | 28.20 – 46.51 | <0.001 |
|
ParentEduc [SomePrLowSecNoSch] |
87.78 | 21.09 – 154.47 | 0.010 |
| ParentEduc [UniOrHigher] | 53.14 | 44.27 – 62.02 | <0.001 |
| ParentEduc [Unknow] | 21.25 | 11.13 – 31.36 | <0.001 |
| ParentEduc [UpperSec] | 3.35 | -6.68 – 13.38 | 0.513 |
| BooksHome101-200 | 27.62 | 17.18 – 38.06 | <0.001 |
| BooksHome11-25 | 6.27 | -3.29 – 15.82 | 0.198 |
| BooksHome26-100 | 14.82 | 5.38 – 24.27 | 0.002 |
| BooksHome [More than 200] | 23.12 | 11.70 – 34.55 | <0.001 |
| CompOwn [Yes] | -14.29 | -20.44 – -8.15 | <0.001 |
| CompShare [Yes] | 19.45 | 13.48 – 25.42 | <0.001 |
| Observations | 4574 | ||
| R2 / R2 adjusted | 0.102 / 0.100 | ||
The model exmplains 10% of the data (R-squared is 0.10). According to the results, a majority values of variables were marked as significant. However, there is no role of being born inside or outside the country on a math scores. The same can be said regarding having 11-25 books at home and “Upper-Secondary” level of education obtained by parents.
The intercept is 489.55. As follows, independently on the factors included in the regresion, the math scores are predefined and started from this value.
Girls have scores in math lower in 8.39 compared to boys.
As for education obtained by parents, student’s math scores will go up in 37.36 in case of “Post-secondary but not University”, or in 87.78 in case of “Some Primary, Lower Secondary or No School”, or in 53.14 in case of “University or Higher”, or in 21.25 in case if a student does not know a level of his/her parents’ education, or in 3.35 in case of “Upper Secondary” level of education recieved by parents (this one is marked as insignificant) - all of that is in comparison with students whose parents have “Lower-Secondary” education.
Having books at home is another point for consideration. The general trend is that the more books a student has at home, the bigger his or her math scores are. Preciesly, studnet’s math scores will increase in 6.27 in case of 11-25 books (this one is marked as insignificant), or in 14.82 in case of 26-100 books, or in 27.62 in case of 101-200 books, or in 23.12 in case of having more than 200 books - all of that in comparison with students who possess no more than 10 books at home.
Furthemore, if a student has a personal PC or tablet, his or her math scores will decrease in 14.29 points. This notion can be hypothetically explained in the following way: a personal computer is used not exactly for educational purposes and, as a result, this gadget is though to be treated as a distracted factor from studying.
However, if a student has to share a computer or tablet with someone else, his or her math scores become bigger in 19.45. This point might be evidence that in case of sharing student’s usage of a gadget is more purposeful and focused on solving some tasks via computer or tablet without much intention for entertainment.
Interaction Effect
To make the previous model a bit complex, one interaction effect is included. All possible combinations were checked, but there was not any detected significance related to the interaction effect. Only in case of relation between gender (female) and a number of books at home (100-200) p-value is 0.089, it is still not enough to have a talk about strong dependencies. That is why this case is taken as an example of the model with interaction effect.
| MathScore | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 483.89 | 464.66 – 503.11 | <0.001 |
| Gender [Girl] | 3.88 | -13.55 – 21.30 | 0.663 |
| BornInOut [Yes] | -0.11 | -12.35 – 12.12 | 0.985 |
|
ParentEduc [PostSecNotUni] |
37.48 | 28.32 – 46.63 | <0.001 |
|
ParentEduc [SomePrLowSecNoSch] |
86.20 | 19.46 – 152.94 | 0.011 |
| ParentEduc [UniOrHigher] | 53.11 | 44.23 – 61.99 | <0.001 |
| ParentEduc [Unknow] | 21.26 | 11.15 – 31.38 | <0.001 |
| ParentEduc [UpperSec] | 3.32 | -6.71 – 13.35 | 0.517 |
| BooksHome101-200 | 35.68 | 21.74 – 49.61 | <0.001 |
| BooksHome11-25 | 11.50 | -1.06 – 24.05 | 0.073 |
| BooksHome26-100 | 20.64 | 8.13 – 33.14 | 0.001 |
| BooksHome [More than 200] | 24.94 | 9.34 – 40.53 | 0.002 |
| CompOwn [Yes] | -14.11 | -20.26 – -7.96 | <0.001 |
| CompShare [Yes] | 19.61 | 13.64 – 25.58 | <0.001 |
| GenderGirl:BooksHome101-200 | -17.93 | -38.57 – 2.71 | 0.089 |
| GenderGirl:BooksHome11-25 | -12.33 | -31.57 – 6.90 | 0.209 |
| GenderGirl:BooksHome26-100 | -13.46 | -32.24 – 5.32 | 0.160 |
|
Gender [Girl] * BooksHome [More than 200] |
-6.04 | -28.48 – 16.40 | 0.598 |
| Observations | 4574 | ||
| R2 / R2 adjusted | 0.103 / 0.100 | ||
Being based on the results and a plot, there is no any significant difference between gender and a number of books at home. Also, it can be said that interaction effect does not improve the model (e.g., R-squared is still 0.1).
According to the plot above, the interaction effect is not meaningful in this case (bars are overlapping).
Model Comparison
The two models are nested that is why ANOVA test is applied to choose the best one.
## Analysis of Variance Table
##
## Model 1: MathScore ~ Gender + BornInOut + ParentEduc + BooksHome + CompOwn +
## CompShare
## Model 2: MathScore ~ Gender + BornInOut + ParentEduc + BooksHome + CompOwn +
## CompShare + Gender * BooksHome
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4560 25874730
## 2 4556 25853340 4 21390 0.9424 0.4382
The performed test proved that interaction effect does not improve the situation (p-value is 0.44). As follows, the model without the effect is chosen for further analysis.
Diagnostics
- Multicollinearity
## GVIF Df GVIF^(1/(2*Df))
## Gender 1.022314 1 1.011095
## BornInOut 1.007450 1 1.003718
## ParentEduc 1.112564 5 1.010724
## BooksHome 1.115898 4 1.013802
## CompOwn 1.010958 1 1.005464
## CompShare 1.015339 1 1.007640
According to the test of Variance Inflation Factors, values are less than 5, By this, the multicollinearity problem is absent. In other words, there is a situation in which two or more explanatory variables in a multiple regression model are not highly linearly related that is good.
- Outliers
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 3992 -3.566067 0.0003661 NA
## [1] 1636 3992
There are a few observations (see the plot and results of the test) that have to be removed due to a reason that they are “out of the scope”.
- Leverage
As for leverages, it can be said that there are not detected in the model.
- Studentized residuals
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.566754 -0.676211 0.037246 -0.000012 0.682041 3.366918
The distribution of studentized residuals is bell-shaped and looks like normal.
- Homoscedasticity
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 9.749616, Df = 1, p = 0.0017936
There is a sign of heteroscedasticity. Let’s try to solve it.
## Box-Cox Transformation
##
## 4572 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 273.1 484.5 540.5 538.3 593.9 819.8
##
## Largest/Smallest: 3
## Sample Skewness: -0.126
##
## Estimated Lambda: 1.3
Some changes are observed in distributions regarding the variable before and after its transformation.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.340332, Df = 1, p = 0.067601
New results of the test revealed improvement (for now, p-value is 0.07).
Let’s see to the regression model after some corrections done in the process of the diagnostics.
| MathScore_1 | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 2429.25 | 2313.62 – 2544.87 | <0.001 |
| Gender [Girl] | -55.10 | -84.12 – -26.08 | <0.001 |
| BornInOut [Yes] | -1.61 | -81.92 – 78.70 | 0.969 |
|
ParentEduc [PostSecNotUni] |
241.64 | 181.50 – 301.78 | <0.001 |
|
ParentEduc [SomePrLowSecNoSch] |
576.16 | 138.15 – 1014.18 | 0.010 |
| ParentEduc [UniOrHigher] | 345.93 | 287.62 – 404.24 | <0.001 |
| ParentEduc [Unknow] | 136.03 | 69.59 – 202.46 | <0.001 |
| ParentEduc [UpperSec] | 19.68 | -46.20 – 85.57 | 0.558 |
| BooksHome101-200 | 180.09 | 111.52 – 248.66 | <0.001 |
| BooksHome11-25 | 38.25 | -24.51 – 101.01 | 0.232 |
| BooksHome26-100 | 94.85 | 32.81 – 156.88 | 0.003 |
| BooksHome [More than 200] | 150.18 | 75.15 – 225.21 | <0.001 |
| CompOwn [Yes] | -95.44 | -135.80 – -55.08 | <0.001 |
| CompShare [Yes] | 126.91 | 87.69 – 166.13 | <0.001 |
| Observations | 4572 | ||
| R2 / R2 adjusted | 0.102 / 0.099 | ||
Many general tendencies remained the same. However, estimated values changes, as well as intercept increased after Box-Cox and exponential transformation. In details, the intercept is 2429.25.
Girls have scores in math lower in 55.10 compared to boys.
As for education obtained by parents, student’s math scores will go up in 241.64 in case of “Post-secondary but not University”, or in 576.16 in case of “Some Primary, Lower Secondary or No School”, or in 345.93 in case of “University or Higher”, or in 136.03 in case if a student does not know a level of his/her parents’ education, or in 19.68 in case of “Upper Secondary” level of education recieved by parents (this one is marked as insignificant) - all of that is in comparison with students whose parents have “Lower-Secondary” education.
As for books, studnet’s math scores will increase in 38.25 in case of 11-25 books (marked as insignificant), or in 94.85 in case of 26-100 books, or in 180.09 in case of 101-200 books, or in 150.18 in case of having more than 200 books - all of that in comparison with students who possess no more than 10 books at home.
In addition, if a student has a personal PC or tablet, his or her math scores will decrease in 95.44 points. This trend remans the same as it was before transformations. Similarly, if a student has to share a computer or tablet with someone else, his or her math scores become bigger in 126.91.
Overall, transformations were resulted in estimates and changed a lot all values remaining the general trends the same as they were. Roughtly speaking, the more books a student has and the better education of his/her parents, he or she is more probable to recieve higher scores in math achievements.
Exploratory Factor Analysis, EFA
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn 1 Polychoric Polychoric Polychoric Polychoric Polychoric
## CompShare -0.07312 1 Polychoric Polychoric Polychoric Polychoric
## StudyDesk 0.3785 -0.008696 1 Polychoric Polychoric Polychoric
## RoomOwn 0.2547 0.002198 0.5027 1 Polychoric Polychoric
## InternetCon 0.3291 0.372 0.3931 0.1609 1 Polychoric
## MobileOwn 0.2582 0.3336 0.3585 0.1727 0.5743 1
## GamingSys 0.2804 -0.05849 0.06853 0.1032 0.1569 0.2284
## MusicInst -0.0003254 0.1113 0.169 0.05956 0.2099 0.2623
## Car 0.1831 0.2058 0.1886 0.1923 0.3169 0.2407
## HomeAtLeast4 0.1314 0.02927 0.1642 0.3308 0.09625 0.04024
## Dishwasher 0.1893 0.04221 0.2343 0.2099 0.2213 0.3684
## GamingSys MusicInst Car HomeAtLeast4 Dishwasher
## CompOwn Polychoric Polychoric Polychoric Polychoric Polychoric
## CompShare Polychoric Polychoric Polychoric Polychoric Polychoric
## StudyDesk Polychoric Polychoric Polychoric Polychoric Polychoric
## RoomOwn Polychoric Polychoric Polychoric Polychoric Polychoric
## InternetCon Polychoric Polychoric Polychoric Polychoric Polychoric
## MobileOwn Polychoric Polychoric Polychoric Polychoric Polychoric
## GamingSys 1 Polychoric Polychoric Polychoric Polychoric
## MusicInst 0.117 1 Polychoric Polychoric Polychoric
## Car 0.2356 0.1726 1 Polychoric Polychoric
## HomeAtLeast4 0.1378 0.07935 0.3366 1 Polychoric
## Dishwasher 0.3875 0.2259 0.3166 0.2459 1
##
## Standard Errors:
## CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn
## CompShare 0.03451
## StudyDesk 0.03323 0.04021
## RoomOwn 0.02757 0.02892 0.02751
## InternetCon 0.04458 0.04263 0.04691 0.04435
## MobileOwn 0.05615 0.05259 0.05734 0.0535 0.05207
## GamingSys 0.03051 0.02963 0.03644 0.02595 0.05007 0.06345
## MusicInst 0.02896 0.02842 0.03432 0.02464 0.0465 0.05778
## Car 0.02834 0.0276 0.03319 0.02427 0.04165 0.05222
## HomeAtLeast4 0.02805 0.02781 0.03294 0.02228 0.04437 0.05407
## Dishwasher 0.03275 0.03171 0.03979 0.02703 0.05513 0.07743
## GamingSys MusicInst Car HomeAtLeast4
## CompOwn
## CompShare
## StudyDesk
## RoomOwn
## InternetCon
## MobileOwn
## GamingSys
## MusicInst 0.02514
## Car 0.02539 0.02428
## HomeAtLeast4 0.02467 0.02361 0.02221
## Dishwasher 0.02441 0.0256 0.0263 0.02527
##
## n = 4574
The polychoric method was used. However, values represent not strong correlations.
Converting binomial variables to numeric ones for EFA analysis…
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn 1 Pearson Pearson Pearson Pearson Pearson
## CompShare -0.03062 1 Pearson Pearson Pearson Pearson
## StudyDesk 0.1755 -0.003188 1 Pearson Pearson Pearson
## RoomOwn 0.1357 0.001124 0.2424 1 Pearson Pearson
## InternetCon 0.1174 0.1372 0.1409 0.05442 1 Pearson
## MobileOwn 0.07369 0.1012 0.1069 0.0485 0.2098 1
## GamingSys 0.1229 -0.02942 0.0273 0.05793 0.04376 0.04854
## MusicInst -8.71e-05 0.05678 0.06994 0.03558 0.0628 0.06159
## Car 0.09508 0.1103 0.08462 0.1161 0.1113 0.06882
## HomeAtLeast4 0.06857 0.01555 0.07241 0.206 0.03189 0.011
## Dishwasher 0.08001 0.01948 0.07895 0.1092 0.05379 0.06016
## GamingSys MusicInst Car HomeAtLeast4 Dishwasher
## CompOwn Pearson Pearson Pearson Pearson Pearson
## CompShare Pearson Pearson Pearson Pearson Pearson
## StudyDesk Pearson Pearson Pearson Pearson Pearson
## RoomOwn Pearson Pearson Pearson Pearson Pearson
## InternetCon Pearson Pearson Pearson Pearson Pearson
## MobileOwn Pearson Pearson Pearson Pearson Pearson
## GamingSys 1 Pearson Pearson Pearson Pearson
## MusicInst 0.06875 1 Pearson Pearson Pearson
## Car 0.1301 0.1025 1 Pearson Pearson
## HomeAtLeast4 0.08161 0.0495 0.2097 1 Pearson
## Dishwasher 0.2256 0.1288 0.1612 0.1385 1
##
## Standard Errors:
## CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn
## CompShare 0.01477
## StudyDesk 0.01433 0.01479
## RoomOwn 0.01452 0.01479 0.01392
## InternetCon 0.01458 0.01451 0.01449 0.01474
## MobileOwn 0.01471 0.01464 0.01462 0.01475 0.01414
## GamingSys 0.01456 0.01477 0.01478 0.01474 0.01476 0.01475
## MusicInst 0.01479 0.01474 0.01472 0.01477 0.01473 0.01473
## Car 0.01465 0.01461 0.01468 0.01459 0.0146 0.01472
## HomeAtLeast4 0.01472 0.01478 0.01471 0.01416 0.01477 0.01479
## Dishwasher 0.01469 0.01478 0.0147 0.01461 0.01474 0.01473
## GamingSys MusicInst Car HomeAtLeast4
## CompOwn
## CompShare
## StudyDesk
## RoomOwn
## InternetCon
## MobileOwn
## GamingSys
## MusicInst 0.01472
## Car 0.01454 0.01463
## HomeAtLeast4 0.01469 0.01475 0.01414
## Dishwasher 0.01404 0.01454 0.0144 0.0145
##
## n = 4574
##
## P-values for Tests of Bivariate Normality:
## CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn
## CompShare 0
## StudyDesk 0 0
## RoomOwn 0 0 0
## InternetCon 0 0 0 0
## MobileOwn 0 0 0 0 0
## GamingSys 0 0 0 0 0 0
## MusicInst 0 0 0 0 0 0
## Car 0 0 0 0 0 0
## HomeAtLeast4 0 0 0 0 0 0
## Dishwasher 0 0 0 0 0 0
## GamingSys MusicInst Car HomeAtLeast4
## CompOwn
## CompShare
## StudyDesk
## RoomOwn
## InternetCon
## MobileOwn
## GamingSys
## MusicInst 0
## Car 0 0
## HomeAtLeast4 0 0 0
## Dishwasher 0 0 0 0
## CompOwn CompShare StudyDesk RoomOwn InternetCon
## CompOwn 0.000000e+00 3.454333e-01 2.827048e-31 1.392336e-18 6.668991e-14
## CompShare 3.838148e-02 0.000000e+00 1.000000e+00 1.000000e+00 5.371914e-19
## StudyDesk 5.654096e-33 8.293448e-01 0.000000e+00 1.933944e-60 4.764962e-20
## RoomOwn 3.094081e-20 9.394134e-01 3.516261e-62 0.000000e+00 3.925592e-03
## InternetCon 1.626583e-15 1.167807e-20 9.927004e-22 2.309172e-04 0.000000e+00
## MobileOwn 6.069895e-07 7.005627e-12 4.126251e-13 1.034738e-03 1.138183e-46
## GamingSys 7.198714e-17 4.664933e-02 6.490567e-02 8.847064e-05 3.076208e-03
## MusicInst 9.953012e-01 1.218446e-04 2.192299e-06 1.611820e-02 2.137414e-05
## Car 1.170932e-10 7.574369e-14 9.941703e-09 3.346403e-15 4.382150e-14
## HomeAtLeast4 3.459793e-06 2.929141e-01 9.459965e-07 5.117905e-45 3.101187e-02
## Dishwasher 6.010529e-08 1.877588e-01 8.985477e-08 1.287644e-13 2.731890e-04
## MobileOwn GamingSys MusicInst Car HomeAtLeast4
## CompOwn 1.699571e-05 3.023460e-15 1.000000e+00 3.864074e-09 7.964984e-05
## CompShare 2.381913e-10 3.731947e-01 2.193203e-03 2.878260e-12 1.000000e+00
## StudyDesk 1.485450e-11 4.543397e-01 5.699978e-05 3.181345e-07 2.554190e-05
## RoomOwn 1.432162e-02 1.680942e-03 1.773002e-01 1.338561e-13 2.610131e-43
## InternetCon 6.032370e-45 3.691450e-02 4.702312e-04 1.709039e-12 3.101187e-01
## MobileOwn 0.000000e+00 1.432162e-02 6.450884e-04 7.964984e-05 1.000000e+00
## GamingSys 1.022973e-03 0.000000e+00 7.964984e-05 4.524732e-17 1.009208e-06
## MusicInst 3.071850e-05 3.252775e-06 0.000000e+00 1.289028e-10 1.215871e-02
## Car 3.185994e-06 1.028348e-18 3.682937e-12 0.000000e+00 6.346882e-45
## HomeAtLeast4 4.571685e-01 3.255510e-08 8.105804e-04 1.220554e-46 0.000000e+00
## Dishwasher 4.680435e-05 7.302184e-54 2.168822e-18 5.169642e-28 5.118890e-21
## Dishwasher
## CompOwn 1.803159e-06
## CompShare 1.000000e+00
## StudyDesk 2.605788e-06
## RoomOwn 4.764283e-12
## InternetCon 4.371023e-03
## MobileOwn 9.360870e-04
## GamingSys 3.943179e-52
## MusicInst 9.325934e-17
## Car 2.533125e-26
## HomeAtLeast4 2.405878e-19
## Dishwasher 0.000000e+00
Correlations have become weaker than they were. The table representes p-values resulted as correlation test.
Kaiser-Meyer-Olkin factor adequacy
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = efa_data2)
## Overall MSA = 0.66
## MSA for each item =
## CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## 0.69 0.57 0.65 0.65 0.65 0.66
## GamingSys MusicInst Car HomeAtLeast4 Dishwasher
## 0.64 0.69 0.70 0.66 0.68
##
## Bartlett test of homogeneity of variances
##
## data: efa_data2
## Bartlett's K-squared = 11896, df = 10, p-value < 2.2e-16
KMO is more than 0.5, the sample can be treated as adequate for EFA. As for Berlett’s test of spherity, p-value is less than 0.05. By this, there might be statistically significant relations among variables in the data. Overall, EFA can be performed on this data.
## Parallel analysis suggests that the number of factors = 5 and the number of components = 4
## [1] 1.02636489 0.31527128 0.26993778 0.14698541 0.03687583 -0.05018421
## [7] -0.10235980 -0.12685946 -0.14058901 -0.15685137 -0.19222728
According to the Parallel Analysis Scree Plot, 5 factors should be used in EFA.
Model Creation
In the process of model creation, parameters are set mainly by hand. So, many different options have been tried. The presented and compared below models illustrate the best versions (visually) that were picked up. These models contain 2 or 3 factors. In case of folliwing to the recommendation of the arallel Analysis Scree Plot and setting 4 or 5 factors, some resulted factors have less than 3 variables which is undesirable in the analysis.
Model 1
Settings:
- Number of factors: 3
- Rotation: varimax, orthogonal
- Factoring method: ml, maximum likelihood factor analysis
- Method of finding correlation: mixed
To improve visual view of the model, varialbe related to having musical instrument has to be removed.
In details…
## Factor Analysis using method = ml
## Call: fa(r = efa_data3, nfactors = 3, rotate = "varimax", fm = "ml",
## cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 ML3 h2 u2 com
## CompOwn 0.36 0.16 0.29 0.24 0.761 2.3
## CompShare -0.11 0.56 -0.03 0.32 0.678 1.1
## StudyDesk 0.93 0.17 0.06 0.90 0.098 1.1
## RoomOwn 0.53 -0.01 0.24 0.33 0.668 1.4
## InternetCon 0.28 0.73 0.17 0.64 0.359 1.4
## MobileOwn 0.25 0.64 0.27 0.54 0.460 1.7
## GamingSys 0.03 0.06 0.58 0.34 0.657 1.0
## Car 0.13 0.26 0.43 0.27 0.731 1.8
## HomeAtLeast4 0.17 -0.04 0.38 0.18 0.825 1.4
## Dishwasher 0.18 0.15 0.60 0.42 0.582 1.3
##
## ML1 ML2 ML3
## SS loadings 1.50 1.40 1.28
## Proportion Var 0.15 0.14 0.13
## Cumulative Var 0.15 0.29 0.42
## Proportion Explained 0.36 0.34 0.31
## Cumulative Proportion 0.36 0.69 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 45 and the objective function was 2.27 with Chi Square of 10353.9
## The degrees of freedom for the model are 18 and the objective function was 0.27
##
## The root mean square of the residuals (RMSR) is 0.06
## The df corrected root mean square of the residuals is 0.09
##
## The harmonic number of observations is 4574 with the empirical chi square 1271.71 with prob < 4.8e-259
## The total number of observations was 4574 with Likelihood Chi Square = 1247.15 with prob < 8.8e-254
##
## Tucker Lewis Index of factoring reliability = 0.702
## RMSEA index = 0.122 and the 90 % confidence intervals are 0.116 0.128
## BIC = 1095.45
## Fit based upon off diagonal values = 0.95
## Measures of factor score adequacy
## ML1 ML2 ML3
## Correlation of (regression) scores with factors 0.94 0.84 0.79
## Multiple R square of scores with factors 0.89 0.71 0.62
## Minimum correlation of possible factor scores 0.78 0.42 0.24
Parameters:
- Proportion Explained: 0.36, 0.34, 0.31 (acceptable)
- Cum. Var.: 42%
- TLI: 0.707 (should be at least .90)
- RMSEA: 0.121 (hardly acceptable)
- RMSR: 0.06
- BIC: 1071.82
Loadings show that there are no any variable that could be related to a few factors at the same time.
##
## Loadings:
## ML1 ML2 ML3
## CompOwn 0.362
## CompShare 0.557
## StudyDesk 0.932
## RoomOwn 0.526
## InternetCon 0.733
## MobileOwn 0.637
## GamingSys 0.582
## Car 0.432
## HomeAtLeast4 0.382
## Dishwasher 0.602
##
## ML1 ML2 ML3
## SS loadings 1.502 1.402 1.278
## Proportion Var 0.150 0.140 0.128
## Cumulative Var 0.150 0.290 0.418
Model 2
Settings:
- Number of factors: 2
- Rotation: varimax, orthogonal
- Factoring method: ml, maximum likelihood factor analysis
- Method of finding correlation: mixed
To improve visual view of the model, varialbe related to having a musical instrument and a gaming system have to be removed.
In details…
## Factor Analysis using method = ml
## Call: fa(r = efa_data4, nfactors = 2, rotate = "varimax", fm = "ml",
## cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML2 ML1 h2 u2 com
## CompOwn 0.43 0.23 0.24 0.76 1.5
## CompShare -0.10 0.50 0.26 0.74 1.1
## StudyDesk 0.67 0.25 0.52 0.48 1.3
## RoomOwn 0.71 -0.01 0.50 0.50 1.0
## InternetCon 0.26 0.76 0.65 0.35 1.2
## MobileOwn 0.25 0.69 0.54 0.46 1.3
## Car 0.28 0.31 0.18 0.82 2.0
## HomeAtLeast4 0.39 0.00 0.15 0.85 1.0
## Dishwasher 0.33 0.26 0.18 0.82 1.9
##
## ML2 ML1
## SS loadings 1.63 1.58
## Proportion Var 0.18 0.18
## Cumulative Var 0.18 0.36
## Proportion Explained 0.51 0.49
## Cumulative Proportion 0.51 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 36 and the objective function was 2 with Chi Square of 9136.11
## The degrees of freedom for the model are 19 and the objective function was 0.34
##
## The root mean square of the residuals (RMSR) is 0.07
## The df corrected root mean square of the residuals is 0.1
##
## The harmonic number of observations is 4574 with the empirical chi square 1663.95 with prob < 0
## The total number of observations was 4574 with Likelihood Chi Square = 1564.48 with prob < 6.299337e-321
##
## Tucker Lewis Index of factoring reliability = 0.678
## RMSEA index = 0.133 and the 90 % confidence intervals are 0.128 0.139
## BIC = 1404.34
## Fit based upon off diagonal values = 0.93
## Measures of factor score adequacy
## ML2 ML1
## Correlation of (regression) scores with factors 0.84 0.86
## Multiple R square of scores with factors 0.71 0.74
## Minimum correlation of possible factor scores 0.41 0.47
Parameters:
- Proportion Explained: 0.50, 0.50 (equal that is good)
- Cum. Var.: 36%
- TLI: 0.684 (should be at least .90)
- RMSEA: 0.132 (hardly acceptable)
- RMSR: 0.07
- BIC: 1373.07
Loadings show that there are no any variable that could be related to a few factors at the same time.
##
## Loadings:
## ML2 ML1
## CompOwn 0.431
## CompShare 0.501
## StudyDesk 0.675
## RoomOwn 0.706
## InternetCon 0.760
## MobileOwn 0.688
## Car 0.311
## HomeAtLeast4 0.387
## Dishwasher 0.333
##
## ML2 ML1
## SS loadings 1.626 1.579
## Proportion Var 0.181 0.175
## Cumulative Var 0.181 0.356
Model 3
Settings:
- Number of factors: 2
- Rotation: oblimin, oblique transformation
- Factoring method: ml, maximum likelihood factor analysis
- Method of finding correlation: mixed
To improve visual view of the model, varialbe related to having a musical instrument, a gaming system, a car, and a dishwasher have to be removed.
In details…
## Factor Analysis using method = ml
## Call: fa(r = efa_data5, nfactors = 2, rotate = "oblimin", fm = "ml",
## cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 h2 u2 com
## CompOwn 0.17 0.40 0.243 0.76 1.4
## CompShare 0.56 -0.26 0.267 0.73 1.4
## StudyDesk 0.15 0.71 0.606 0.39 1.1
## RoomOwn -0.10 0.70 0.449 0.55 1.0
## InternetCon 0.81 0.07 0.701 0.30 1.0
## MobileOwn 0.66 0.09 0.492 0.51 1.0
## HomeAtLeast4 -0.05 0.33 0.099 0.90 1.0
##
## ML1 ML2
## SS loadings 1.49 1.37
## Proportion Var 0.21 0.20
## Cumulative Var 0.21 0.41
## Proportion Explained 0.52 0.48
## Cumulative Proportion 0.52 1.00
##
## With factor correlations of
## ML1 ML2
## ML1 1.00 0.38
## ML2 0.38 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 21 and the objective function was 1.49 with Chi Square of 6810.31
## The degrees of freedom for the model are 8 and the objective function was 0.12
##
## The root mean square of the residuals (RMSR) is 0.05
## The df corrected root mean square of the residuals is 0.08
##
## The harmonic number of observations is 4574 with the empirical chi square 514.07 with prob < 6.7e-106
## The total number of observations was 4574 with Likelihood Chi Square = 559.52 with prob < 1.2e-115
##
## Tucker Lewis Index of factoring reliability = 0.787
## RMSEA index = 0.123 and the 90 % confidence intervals are 0.114 0.132
## BIC = 492.09
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## ML1 ML2
## Correlation of (regression) scores with factors 0.89 0.86
## Multiple R square of scores with factors 0.79 0.74
## Minimum correlation of possible factor scores 0.58 0.47
Parameters:
- Proportion Explained: 0.52, 0.48 (almost equal that is good)
- Cum. Var.: 41%
- TLI: 0.788 (should be at least .90)
- RMSEA: 0.123 (hardly acceptable)
- RMSR: 0.05
- BIC: 489.47
Loadings show that there are no any variable that could be related to a few factors at the same time.
##
## Loadings:
## ML1 ML2
## CompOwn 0.400
## CompShare 0.556
## StudyDesk 0.710
## RoomOwn 0.702
## InternetCon 0.810
## MobileOwn 0.664
## HomeAtLeast4 0.331
##
## ML1 ML2
## SS loadings 1.471 1.347
## Proportion Var 0.210 0.192
## Cumulative Var 0.210 0.403
Summing up
| Models | NumFactors | PropExp | CumVar | ValueTLI | ValueRMSEA | ValueRMSR | ValueBIC |
|---|---|---|---|---|---|---|---|
| 1 | 3 | 0.36, 0.34, 0.31 | 0.42 | 0.707 | 0.121 | 0.06 | 1071.82 |
| 2 | 2 | 0.50, 0.50 | 0.36 | 0.684 | 0.132 | 0.07 | 1373.07 |
| 3 | 2 | 0.52, 0.48 | 0.41 | 0.788 | 0.123 | 0.05 | 489.47 |
It can be said that in these three cases proportion of explained is distributed relatively equally among factors. Although any RMSEA values of the model are unsatisfied, the 3rd EFA model can be named as the best because it has the biggest value of TLI (0.788), its RMSR the more closer to 0 compared to the others, the BIC is the lowest one. As a result, there are two factors. The first one can be treated as accessibility to gadgets (there are internet connection, personal mobile phone, sharing computer), while the second one might be related to possessing of objects, individual conditions for studying (it includes personal room, personal study desk, personal PC/tablet, home with at least 4 rooms).
However, the first EFA model also has relatively nice extimates. Moreover, there are 3 factors that can also be rather meaningfully interpreted:(1) personal settings for studying, (2) technical issues, and (3) home items.
Data Consistency
Cronbach’s alpha is
- For the 1st EFA model
## raw_alpha std.alpha G6(smc) average_r S/N ase mean
## 0.378326 0.4043881 0.3151827 0.184549 0.6789457 0.01525911 1.185104
## sd median_r
## 0.2504463 0.1755361
## raw_alpha std.alpha G6(smc) average_r S/N ase mean
## 0.258524 0.34506 0.2637287 0.1493844 0.5268574 0.01574247 1.071855
## sd median_r
## 0.1587677 0.1371855
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
## 0.4238019 0.4283606 0.3661256 0.1577803 0.7493545 0.01389692 1.585101 0.27524
## median_r
## 0.1498401
The Cronbach’s alpha revealed unaccaptable results (1 - .40, 2 - .34, 3 - .43).
- For the 3rd EFA model
## raw_alpha std.alpha G6(smc) average_r S/N ase mean
## 0.258524 0.34506 0.2637287 0.1493844 0.5268574 0.01574247 1.071855
## sd median_r
## 0.1587677 0.1371855
## raw_alpha std.alpha G6(smc) average_r S/N ase mean
## 0.3951876 0.4139894 0.3565513 0.1501032 0.7064537 0.01408232 1.261095
## sd median_r
## 0.2443319 0.1556028
Here, the Cronbach’s alpha also revealed unaccaptable results (1 - .34, 2 - .41).
Split-half Reliability
- For the 1st EFA model
## Split half reliabilities
## Call: splitHalf(r = efa_data3)
##
## Maximum split half reliability (lambda 4) = 0.58
## Guttman lambda 6 = 0.5
## Average split half reliability = 0.5
## Guttman lambda 3 (alpha) = 0.5
## Guttman lambda 2 = 0.51
## Minimum split half reliability (beta) = 0.36
## Average interitem r = 0.09 with median = 0.08
- For the 3rd EFA model
## Split half reliabilities
## Call: splitHalf(r = efa_data5)
##
## Maximum split half reliability (lambda 4) = 0.48
## Guttman lambda 6 = 0.4
## Average split half reliability = 0.42
## Guttman lambda 3 (alpha) = 0.41
## Guttman lambda 2 = 0.43
## Minimum split half reliability (beta) = 0.3
## Average interitem r = 0.09 with median = 0.07
Overall, the 1st model has a bit bigger internal consistency than the 3rd one.
Taking everything into account, the first EFA model is chosen for further analysis due to its relative compelixy, interpretability, and slightly better statistical estimates.
- Factor 1: Personal Settings for Studying
This variables consists of having (1) study desk, (2) personal room, and (3) personal computer.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0656 -0.5154 -0.3712 0.0000 0.1548 5.0524
- Factor 2: Technical Issues
This variables consists of (1) having internet connection, (2) having personal mobile, and (3) sharing computer with someone else.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.85999 -0.47568 -0.29091 0.00000 -0.02974 7.69651
- Factor 3: Home Items
This variables consists of (1) having a dishwasher, (2) having gaming system, (3) having a car, and (3) there are at least 4 room in a flat/house in which a student resides.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.51874 -0.53387 0.06417 0.00000 0.59631 3.25450
Linear Regression 2.0
As a result of the EFA, there are 3 new variables. For now, it is possible to use them in order to explain math achievement scores.
On A, B, and C plots above, there are seen how values of new factors are distributed regarding math scores. Rather interestingly, they look identically. As for D, E, and F plots, hardly linear relations can be observed, especially on D and E cases.
Let’s add those three variables to the previously created linear regression model that was marked as the best one (before diagnostics).
| MathScore | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 496.64 | 480.20 – 513.08 | <0.001 |
| BornInOut [Yes] | -1.49 | -13.72 – 10.74 | 0.811 |
| Gender [Girl] | -9.11 | -13.57 – -4.65 | <0.001 |
|
ParentEduc [PostSecNotUni] |
35.59 | 26.39 – 44.80 | <0.001 |
|
ParentEduc [SomePrLowSecNoSch] |
94.31 | 27.53 – 161.09 | 0.006 |
| ParentEduc [UniOrHigher] | 52.71 | 43.73 – 61.69 | <0.001 |
| ParentEduc [Unknow] | 19.01 | 8.85 – 29.17 | <0.001 |
| ParentEduc [UpperSec] | 2.28 | -7.77 – 12.33 | 0.656 |
| BooksHome101-200 | 27.33 | 16.86 – 37.81 | <0.001 |
| BooksHome11-25 | 5.79 | -3.78 – 15.35 | 0.236 |
| BooksHome26-100 | 14.24 | 4.77 – 23.72 | 0.003 |
| BooksHome [More than 200] | 23.37 | 11.92 – 34.82 | <0.001 |
| PersonalSettings | 1.90 | -0.37 – 4.16 | 0.100 |
| TechIssue | -8.71 | -11.07 – -6.34 | <0.001 |
| HomeItems | 5.78 | 3.09 – 8.46 | <0.001 |
| Observations | 4574 | ||
| R2 / R2 adjusted | 0.102 / 0.099 | ||
The model exmplains 10% of the data (R-squared is 0.10) as it was previously. Being born inside or outside the country and such level of parents’ education as “Upper-Secondary” are still marked as insignificant. As for new variables, two of them play roles: personal settings are revealed as not so important.
So now, the intercept is 496.64, while before it was 489.55.
Girls have scores in math lower in 9.11 (before 8.39) compared to boys.
As for education obtained by parents, student’s math scores will go up in 35.59 (37.36) in case of “Post-secondary but not University”, or in 94.31 (87.78) in case of “Some Primary, Lower Secondary or No School”, or in 52.71 (53.14) in case of “University or Higher”, or in 19.01 (21.25) in case if a student does not know a level of his/her parents’ education, or in 2.28 (3.35) in case of “Upper Secondary” level of education recieved by parents (this one is still marked as insignificant) - all of that is in comparison with students whose parents have “Lower-Secondary” education.
As for having books at home, the general trend is that the more books a student has at home, the more his or her math scores can be. Preciesly, studnet’s math scores will increase in 5.79 (6.27) in case of 11-25 books (this one is marked as insignificant), or in 14.24 (14.82) in case of 26-100 books, or in 27.33 (27.62) in case of 101-200 books, or in 23.37 (23.12) in case of having more than 200 books - all of that in comparison with students who possess no more than 10 books at home.
Taking about new added variables, it has to be mentioned that personal settings for studying (having study desk, personal room, and personal computer) are insignificant towards math scores. Meanwhile, home items, that are meaningfully more related to well-being and welfare, can increase math scores in 5.78. At the same time techinical issues such as having internet connection, having personal mobile, and sharing computer with someone else can decrease math scores in 8.71.
Taking everything into account, it can be concluded that books more matter for 8th-grade student achievement nowadays in Russia than computers and technical items/issues.
Diagnostics 2.0
As for the finishing touch, let’s check the model and perform simple diagnostics.
- Multicollinearity
## GVIF Df GVIF^(1/(2*Df))
## BornInOut 1.006785 1 1.003387
## Gender 1.040444 1 1.020021
## ParentEduc 1.156237 5 1.014623
## BooksHome 1.125877 4 1.014931
## PersonalSettings 1.122668 1 1.059560
## TechIssue 1.171008 1 1.082131
## HomeItems 1.201389 1 1.096079
According to the test of Variance Inflation Factors, values do not exceed 5 that is good.
- Outliers
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 3992 -3.507572 0.00045658 NA
## [1] 1636 3992
There are a few observations (see the plot and results of the test) that have to be removed due to a reason that they are “out of the scope”.
- Leverage
As for leverages, they are detected: 1061 row has to be removed.
- Studentized residuals
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3.200286 -0.688121 0.024220 0.000037 0.696024 3.109911 2
The distribution of studentized residuals is bell-shaped and looks like normal.
- Homoscedasticity
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 12.72465, Df = 1, p = 0.00036087
There is a sign of heteroscedasticity. Let’s try to solve it.
## Box-Cox Transformation
##
## 4569 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 273.1 484.5 540.5 538.2 593.9 786.6
##
## Largest/Smallest: 2.88
## Sample Skewness: -0.131
##
## Estimated Lambda: 1.3
Some changes are observed in distributions regarding the variable before and after its transformation.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 161.7732, Df = 1, p = < 2.22e-16
New results of the Non-constant Variance Score test do not show any improvement (for now, p-value is less than 0.05). The done steps have not helped and, as a result, another method has to be implemented in order to deal with heteroscedasticity problem.