About project

The main purpose of this work is to find out what more important for Russian student achievement nowadays is: books or computers. The used data for this task is TIMSS 2015 dataset for the 8th graders provided by TIMSS & PIRLS International Study Center, USA. As for methods, the linear regression with interaction effect is performed, as well as exploratory factor analysis (EFA).

Data Pre-processing

Firsly, there is a point to select, rename, and recode (if needed) variables for analysis. Control variables refer such variables as (1) whether a student was born in a country or not, (2) his or her gender, (3) education received by parents. To extent them income and nationality variables could be used, but they are absent in this dataset. There is no strong reason to use age variable as a control one due to a fact that all students are from 8th grade and, as a consequence, their ages are from 14 to 15,5 that is not so substantial as it could be.

Overall, there are 16 variables. The next step is to visualize them to look at patterns.

Patterns in Data

1. Numeric variable

There is only one numeric variable in the prepated subset of the data that is scores recived by students for their achievements in mathematics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   273.1   484.4   540.5   538.3   593.9   819.8

It can be said that the distribution is bell-shaped and closed to a normal one. The minimum value is 273.1, the maximum is 819.8.

2. Categorical variable

The rest part of selected variables are factors. There are 15 of them. Although some characteristics vary, proportions of girls and boys are almost equally presented. The other patterns can be viewed from the plots below.

Linear Regression

The linear regression is conducted to explain a math achievement score with a number of books at home and availability of tablet/PC. Such control variables as being born in the country and student’s gender, and education of parents are considered. Due to predefined types of variables based on their meanings and codings, the independent variable is explained by a set of factors.

A set of box-plots demonstrates distributions of some variables regarding math scores obtained by students. As for A, B, C, and D plots, it can be said that there are no any remarkable differences: both have outliers, medians are visually equal, etc. Meanwhile, relations between math scores and parental education vary per categories, as well as having books at home and math scores do (E, F plots). One trend can be described like “the more books a student has at home, the better his or her scores in math can be”. As for parental education, there is an interesing remark in case of parents whose education level is “University or Higher” because there are too many outliers. Still, some general view in this case can be formed and patterns are traced: there are relations between student’s math scores and a level of education of his/her parents.

For now, let’s create a linear regression model.

	MathScore
Predictors	Estimates	CI	p
(Intercept)	489.55	471.95 – 507.15	<0.001
Gender [Girl]	-8.39	-12.80 – -3.97	<0.001
BornInOut [Yes]	-0.28	-12.51 – 11.94	0.964
ParentEduc [PostSecNotUni]	37.36	28.20 – 46.51	<0.001
ParentEduc [SomePrLowSecNoSch]	87.78	21.09 – 154.47	0.010
ParentEduc [UniOrHigher]	53.14	44.27 – 62.02	<0.001
ParentEduc [Unknow]	21.25	11.13 – 31.36	<0.001
ParentEduc [UpperSec]	3.35	-6.68 – 13.38	0.513
BooksHome101-200	27.62	17.18 – 38.06	<0.001
BooksHome11-25	6.27	-3.29 – 15.82	0.198
BooksHome26-100	14.82	5.38 – 24.27	0.002
BooksHome [More than 200]	23.12	11.70 – 34.55	<0.001
CompOwn [Yes]	-14.29	-20.44 – -8.15	<0.001
CompShare [Yes]	19.45	13.48 – 25.42	<0.001
Observations	4574
R² / R² adjusted	0.102 / 0.100

The model exmplains 10% of the data (R-squared is 0.10). According to the results, a majority values of variables were marked as significant. However, there is no role of being born inside or outside the country on a math scores. The same can be said regarding having 11-25 books at home and “Upper-Secondary” level of education obtained by parents.

The intercept is 489.55. As follows, independently on the factors included in the regresion, the math scores are predefined and started from this value.

Girls have scores in math lower in 8.39 compared to boys.

As for education obtained by parents, student’s math scores will go up in 37.36 in case of “Post-secondary but not University”, or in 87.78 in case of “Some Primary, Lower Secondary or No School”, or in 53.14 in case of “University or Higher”, or in 21.25 in case if a student does not know a level of his/her parents’ education, or in 3.35 in case of “Upper Secondary” level of education recieved by parents (this one is marked as insignificant) - all of that is in comparison with students whose parents have “Lower-Secondary” education.

Having books at home is another point for consideration. The general trend is that the more books a student has at home, the bigger his or her math scores are. Preciesly, studnet’s math scores will increase in 6.27 in case of 11-25 books (this one is marked as insignificant), or in 14.82 in case of 26-100 books, or in 27.62 in case of 101-200 books, or in 23.12 in case of having more than 200 books - all of that in comparison with students who possess no more than 10 books at home.

Furthemore, if a student has a personal PC or tablet, his or her math scores will decrease in 14.29 points. This notion can be hypothetically explained in the following way: a personal computer is used not exactly for educational purposes and, as a result, this gadget is though to be treated as a distracted factor from studying.

However, if a student has to share a computer or tablet with someone else, his or her math scores become bigger in 19.45. This point might be evidence that in case of sharing student’s usage of a gadget is more purposeful and focused on solving some tasks via computer or tablet without much intention for entertainment.

Interaction Effect

To make the previous model a bit complex, one interaction effect is included. All possible combinations were checked, but there was not any detected significance related to the interaction effect. Only in case of relation between gender (female) and a number of books at home (100-200) p-value is 0.089, it is still not enough to have a talk about strong dependencies. That is why this case is taken as an example of the model with interaction effect.

	MathScore
Predictors	Estimates	CI	p
(Intercept)	483.89	464.66 – 503.11	<0.001
Gender [Girl]	3.88	-13.55 – 21.30	0.663
BornInOut [Yes]	-0.11	-12.35 – 12.12	0.985
ParentEduc [PostSecNotUni]	37.48	28.32 – 46.63	<0.001
ParentEduc [SomePrLowSecNoSch]	86.20	19.46 – 152.94	0.011
ParentEduc [UniOrHigher]	53.11	44.23 – 61.99	<0.001
ParentEduc [Unknow]	21.26	11.15 – 31.38	<0.001
ParentEduc [UpperSec]	3.32	-6.71 – 13.35	0.517
BooksHome101-200	35.68	21.74 – 49.61	<0.001
BooksHome11-25	11.50	-1.06 – 24.05	0.073
BooksHome26-100	20.64	8.13 – 33.14	0.001
BooksHome [More than 200]	24.94	9.34 – 40.53	0.002
CompOwn [Yes]	-14.11	-20.26 – -7.96	<0.001
CompShare [Yes]	19.61	13.64 – 25.58	<0.001
GenderGirl:BooksHome101-200	-17.93	-38.57 – 2.71	0.089
GenderGirl:BooksHome11-25	-12.33	-31.57 – 6.90	0.209
GenderGirl:BooksHome26-100	-13.46	-32.24 – 5.32	0.160
Gender [Girl] * BooksHome [More than 200]	-6.04	-28.48 – 16.40	0.598
Observations	4574
R² / R² adjusted	0.103 / 0.100

Being based on the results and a plot, there is no any significant difference between gender and a number of books at home. Also, it can be said that interaction effect does not improve the model (e.g., R-squared is still 0.1).

According to the plot above, the interaction effect is not meaningful in this case (bars are overlapping).

Model Comparison

The two models are nested that is why ANOVA test is applied to choose the best one.

## Analysis of Variance Table
## 
## Model 1: MathScore ~ Gender + BornInOut + ParentEduc + BooksHome + CompOwn + 
##     CompShare
## Model 2: MathScore ~ Gender + BornInOut + ParentEduc + BooksHome + CompOwn + 
##     CompShare + Gender * BooksHome
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1   4560 25874730                           
## 2   4556 25853340  4     21390 0.9424 0.4382

The performed test proved that interaction effect does not improve the situation (p-value is 0.44). As follows, the model without the effect is chosen for further analysis.

Diagnostics

Multicollinearity

##                GVIF Df GVIF^(1/(2*Df))
## Gender     1.022314  1        1.011095
## BornInOut  1.007450  1        1.003718
## ParentEduc 1.112564  5        1.010724
## BooksHome  1.115898  4        1.013802
## CompOwn    1.010958  1        1.005464
## CompShare  1.015339  1        1.007640

According to the test of Variance Inflation Factors, values are less than 5, By this, the multicollinearity problem is absent. In other words, there is a situation in which two or more explanatory variables in a multiple regression model are not highly linearly related that is good.

Outliers

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 3992 -3.566067          0.0003661           NA

## [1] 1636 3992

There are a few observations (see the plot and results of the test) that have to be removed due to a reason that they are “out of the scope”.

Leverage

As for leverages, it can be said that there are not detected in the model.

Studentized residuals

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.566754 -0.676211  0.037246 -0.000012  0.682041  3.366918

The distribution of studentized residuals is bell-shaped and looks like normal.

Homoscedasticity

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 9.749616, Df = 1, p = 0.0017936

There is a sign of heteroscedasticity. Let’s try to solve it.

## Box-Cox Transformation
## 
## 4572 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   273.1   484.5   540.5   538.3   593.9   819.8 
## 
## Largest/Smallest: 3 
## Sample Skewness: -0.126 
## 
## Estimated Lambda: 1.3

Some changes are observed in distributions regarding the variable before and after its transformation.

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3.340332, Df = 1, p = 0.067601

New results of the test revealed improvement (for now, p-value is 0.07).

Let’s see to the regression model after some corrections done in the process of the diagnostics.

	MathScore_1
Predictors	Estimates	CI	p
(Intercept)	2429.25	2313.62 – 2544.87	<0.001
Gender [Girl]	-55.10	-84.12 – -26.08	<0.001
BornInOut [Yes]	-1.61	-81.92 – 78.70	0.969
ParentEduc [PostSecNotUni]	241.64	181.50 – 301.78	<0.001
ParentEduc [SomePrLowSecNoSch]	576.16	138.15 – 1014.18	0.010
ParentEduc [UniOrHigher]	345.93	287.62 – 404.24	<0.001
ParentEduc [Unknow]	136.03	69.59 – 202.46	<0.001
ParentEduc [UpperSec]	19.68	-46.20 – 85.57	0.558
BooksHome101-200	180.09	111.52 – 248.66	<0.001
BooksHome11-25	38.25	-24.51 – 101.01	0.232
BooksHome26-100	94.85	32.81 – 156.88	0.003
BooksHome [More than 200]	150.18	75.15 – 225.21	<0.001
CompOwn [Yes]	-95.44	-135.80 – -55.08	<0.001
CompShare [Yes]	126.91	87.69 – 166.13	<0.001
Observations	4572
R² / R² adjusted	0.102 / 0.099

Many general tendencies remained the same. However, estimated values changes, as well as intercept increased after Box-Cox and exponential transformation. In details, the intercept is 2429.25.

Girls have scores in math lower in 55.10 compared to boys.

As for education obtained by parents, student’s math scores will go up in 241.64 in case of “Post-secondary but not University”, or in 576.16 in case of “Some Primary, Lower Secondary or No School”, or in 345.93 in case of “University or Higher”, or in 136.03 in case if a student does not know a level of his/her parents’ education, or in 19.68 in case of “Upper Secondary” level of education recieved by parents (this one is marked as insignificant) - all of that is in comparison with students whose parents have “Lower-Secondary” education.

As for books, studnet’s math scores will increase in 38.25 in case of 11-25 books (marked as insignificant), or in 94.85 in case of 26-100 books, or in 180.09 in case of 101-200 books, or in 150.18 in case of having more than 200 books - all of that in comparison with students who possess no more than 10 books at home.

In addition, if a student has a personal PC or tablet, his or her math scores will decrease in 95.44 points. This trend remans the same as it was before transformations. Similarly, if a student has to share a computer or tablet with someone else, his or her math scores become bigger in 126.91.

Overall, transformations were resulted in estimates and changed a lot all values remaining the general trends the same as they were. Roughtly speaking, the more books a student has and the better education of his/her parents, he or she is more probable to recieve higher scores in math achievements.

Exploratory Factor Analysis, EFA

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                 CompOwn  CompShare  StudyDesk    RoomOwn InternetCon  MobileOwn
## CompOwn               1 Polychoric Polychoric Polychoric  Polychoric Polychoric
## CompShare      -0.07312          1 Polychoric Polychoric  Polychoric Polychoric
## StudyDesk        0.3785  -0.008696          1 Polychoric  Polychoric Polychoric
## RoomOwn          0.2547   0.002198     0.5027          1  Polychoric Polychoric
## InternetCon      0.3291      0.372     0.3931     0.1609           1 Polychoric
## MobileOwn        0.2582     0.3336     0.3585     0.1727      0.5743          1
## GamingSys        0.2804   -0.05849    0.06853     0.1032      0.1569     0.2284
## MusicInst    -0.0003254     0.1113      0.169    0.05956      0.2099     0.2623
## Car              0.1831     0.2058     0.1886     0.1923      0.3169     0.2407
## HomeAtLeast4     0.1314    0.02927     0.1642     0.3308     0.09625    0.04024
## Dishwasher       0.1893    0.04221     0.2343     0.2099      0.2213     0.3684
##               GamingSys  MusicInst        Car HomeAtLeast4 Dishwasher
## CompOwn      Polychoric Polychoric Polychoric   Polychoric Polychoric
## CompShare    Polychoric Polychoric Polychoric   Polychoric Polychoric
## StudyDesk    Polychoric Polychoric Polychoric   Polychoric Polychoric
## RoomOwn      Polychoric Polychoric Polychoric   Polychoric Polychoric
## InternetCon  Polychoric Polychoric Polychoric   Polychoric Polychoric
## MobileOwn    Polychoric Polychoric Polychoric   Polychoric Polychoric
## GamingSys             1 Polychoric Polychoric   Polychoric Polychoric
## MusicInst         0.117          1 Polychoric   Polychoric Polychoric
## Car              0.2356     0.1726          1   Polychoric Polychoric
## HomeAtLeast4     0.1378    0.07935     0.3366            1 Polychoric
## Dishwasher       0.3875     0.2259     0.3166       0.2459          1
## 
## Standard Errors:
##              CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn                                                               
## CompShare    0.03451                                                  
## StudyDesk    0.03323   0.04021                                        
## RoomOwn      0.02757   0.02892   0.02751                              
## InternetCon  0.04458   0.04263   0.04691 0.04435                      
## MobileOwn    0.05615   0.05259   0.05734  0.0535     0.05207          
## GamingSys    0.03051   0.02963   0.03644 0.02595     0.05007   0.06345
## MusicInst    0.02896   0.02842   0.03432 0.02464      0.0465   0.05778
## Car          0.02834    0.0276   0.03319 0.02427     0.04165   0.05222
## HomeAtLeast4 0.02805   0.02781   0.03294 0.02228     0.04437   0.05407
## Dishwasher   0.03275   0.03171   0.03979 0.02703     0.05513   0.07743
##              GamingSys MusicInst     Car HomeAtLeast4
## CompOwn                                              
## CompShare                                            
## StudyDesk                                            
## RoomOwn                                              
## InternetCon                                          
## MobileOwn                                            
## GamingSys                                            
## MusicInst      0.02514                               
## Car            0.02539   0.02428                     
## HomeAtLeast4   0.02467   0.02361 0.02221             
## Dishwasher     0.02441    0.0256  0.0263      0.02527
## 
## n = 4574

The polychoric method was used. However, values represent not strong correlations.

Converting binomial variables to numeric ones for EFA analysis…

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn              1   Pearson   Pearson Pearson     Pearson   Pearson
## CompShare     -0.03062         1   Pearson Pearson     Pearson   Pearson
## StudyDesk       0.1755 -0.003188         1 Pearson     Pearson   Pearson
## RoomOwn         0.1357  0.001124    0.2424       1     Pearson   Pearson
## InternetCon     0.1174    0.1372    0.1409 0.05442           1   Pearson
## MobileOwn      0.07369    0.1012    0.1069  0.0485      0.2098         1
## GamingSys       0.1229  -0.02942    0.0273 0.05793     0.04376   0.04854
## MusicInst    -8.71e-05   0.05678   0.06994 0.03558      0.0628   0.06159
## Car            0.09508    0.1103   0.08462  0.1161      0.1113   0.06882
## HomeAtLeast4   0.06857   0.01555   0.07241   0.206     0.03189     0.011
## Dishwasher     0.08001   0.01948   0.07895  0.1092     0.05379   0.06016
##              GamingSys MusicInst     Car HomeAtLeast4 Dishwasher
## CompOwn        Pearson   Pearson Pearson      Pearson    Pearson
## CompShare      Pearson   Pearson Pearson      Pearson    Pearson
## StudyDesk      Pearson   Pearson Pearson      Pearson    Pearson
## RoomOwn        Pearson   Pearson Pearson      Pearson    Pearson
## InternetCon    Pearson   Pearson Pearson      Pearson    Pearson
## MobileOwn      Pearson   Pearson Pearson      Pearson    Pearson
## GamingSys            1   Pearson Pearson      Pearson    Pearson
## MusicInst      0.06875         1 Pearson      Pearson    Pearson
## Car             0.1301    0.1025       1      Pearson    Pearson
## HomeAtLeast4   0.08161    0.0495  0.2097            1    Pearson
## Dishwasher      0.2256    0.1288  0.1612       0.1385          1
## 
## Standard Errors:
##              CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn                                                               
## CompShare    0.01477                                                  
## StudyDesk    0.01433   0.01479                                        
## RoomOwn      0.01452   0.01479   0.01392                              
## InternetCon  0.01458   0.01451   0.01449 0.01474                      
## MobileOwn    0.01471   0.01464   0.01462 0.01475     0.01414          
## GamingSys    0.01456   0.01477   0.01478 0.01474     0.01476   0.01475
## MusicInst    0.01479   0.01474   0.01472 0.01477     0.01473   0.01473
## Car          0.01465   0.01461   0.01468 0.01459      0.0146   0.01472
## HomeAtLeast4 0.01472   0.01478   0.01471 0.01416     0.01477   0.01479
## Dishwasher   0.01469   0.01478    0.0147 0.01461     0.01474   0.01473
##              GamingSys MusicInst     Car HomeAtLeast4
## CompOwn                                              
## CompShare                                            
## StudyDesk                                            
## RoomOwn                                              
## InternetCon                                          
## MobileOwn                                            
## GamingSys                                            
## MusicInst      0.01472                               
## Car            0.01454   0.01463                     
## HomeAtLeast4   0.01469   0.01475 0.01414             
## Dishwasher     0.01404   0.01454  0.0144       0.0145
## 
## n = 4574 
## 
## P-values for Tests of Bivariate Normality:
##              CompOwn CompShare StudyDesk RoomOwn InternetCon MobileOwn
## CompOwn                                                               
## CompShare          0                                                  
## StudyDesk          0         0                                        
## RoomOwn            0         0         0                              
## InternetCon        0         0         0       0                      
## MobileOwn          0         0         0       0           0          
## GamingSys          0         0         0       0           0         0
## MusicInst          0         0         0       0           0         0
## Car                0         0         0       0           0         0
## HomeAtLeast4       0         0         0       0           0         0
## Dishwasher         0         0         0       0           0         0
##              GamingSys MusicInst Car HomeAtLeast4
## CompOwn                                          
## CompShare                                        
## StudyDesk                                        
## RoomOwn                                          
## InternetCon                                      
## MobileOwn                                        
## GamingSys                                        
## MusicInst            0                           
## Car                  0         0                 
## HomeAtLeast4         0         0   0             
## Dishwasher           0         0   0            0

##                   CompOwn    CompShare    StudyDesk      RoomOwn  InternetCon
## CompOwn      0.000000e+00 3.454333e-01 2.827048e-31 1.392336e-18 6.668991e-14
## CompShare    3.838148e-02 0.000000e+00 1.000000e+00 1.000000e+00 5.371914e-19
## StudyDesk    5.654096e-33 8.293448e-01 0.000000e+00 1.933944e-60 4.764962e-20
## RoomOwn      3.094081e-20 9.394134e-01 3.516261e-62 0.000000e+00 3.925592e-03
## InternetCon  1.626583e-15 1.167807e-20 9.927004e-22 2.309172e-04 0.000000e+00
## MobileOwn    6.069895e-07 7.005627e-12 4.126251e-13 1.034738e-03 1.138183e-46
## GamingSys    7.198714e-17 4.664933e-02 6.490567e-02 8.847064e-05 3.076208e-03
## MusicInst    9.953012e-01 1.218446e-04 2.192299e-06 1.611820e-02 2.137414e-05
## Car          1.170932e-10 7.574369e-14 9.941703e-09 3.346403e-15 4.382150e-14
## HomeAtLeast4 3.459793e-06 2.929141e-01 9.459965e-07 5.117905e-45 3.101187e-02
## Dishwasher   6.010529e-08 1.877588e-01 8.985477e-08 1.287644e-13 2.731890e-04
##                 MobileOwn    GamingSys    MusicInst          Car HomeAtLeast4
## CompOwn      1.699571e-05 3.023460e-15 1.000000e+00 3.864074e-09 7.964984e-05
## CompShare    2.381913e-10 3.731947e-01 2.193203e-03 2.878260e-12 1.000000e+00
## StudyDesk    1.485450e-11 4.543397e-01 5.699978e-05 3.181345e-07 2.554190e-05
## RoomOwn      1.432162e-02 1.680942e-03 1.773002e-01 1.338561e-13 2.610131e-43
## InternetCon  6.032370e-45 3.691450e-02 4.702312e-04 1.709039e-12 3.101187e-01
## MobileOwn    0.000000e+00 1.432162e-02 6.450884e-04 7.964984e-05 1.000000e+00
## GamingSys    1.022973e-03 0.000000e+00 7.964984e-05 4.524732e-17 1.009208e-06
## MusicInst    3.071850e-05 3.252775e-06 0.000000e+00 1.289028e-10 1.215871e-02
## Car          3.185994e-06 1.028348e-18 3.682937e-12 0.000000e+00 6.346882e-45
## HomeAtLeast4 4.571685e-01 3.255510e-08 8.105804e-04 1.220554e-46 0.000000e+00
## Dishwasher   4.680435e-05 7.302184e-54 2.168822e-18 5.169642e-28 5.118890e-21
##                Dishwasher
## CompOwn      1.803159e-06
## CompShare    1.000000e+00
## StudyDesk    2.605788e-06
## RoomOwn      4.764283e-12
## InternetCon  4.371023e-03
## MobileOwn    9.360870e-04
## GamingSys    3.943179e-52
## MusicInst    9.325934e-17
## Car          2.533125e-26
## HomeAtLeast4 2.405878e-19
## Dishwasher   0.000000e+00

Correlations have become weaker than they were. The table representes p-values resulted as correlation test.

Kaiser-Meyer-Olkin factor adequacy

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = efa_data2)
## Overall MSA =  0.66
## MSA for each item = 
##      CompOwn    CompShare    StudyDesk      RoomOwn  InternetCon    MobileOwn 
##         0.69         0.57         0.65         0.65         0.65         0.66 
##    GamingSys    MusicInst          Car HomeAtLeast4   Dishwasher 
##         0.64         0.69         0.70         0.66         0.68

## 
##  Bartlett test of homogeneity of variances
## 
## data:  efa_data2
## Bartlett's K-squared = 11896, df = 10, p-value < 2.2e-16

KMO is more than 0.5, the sample can be treated as adequate for EFA. As for Berlett’s test of spherity, p-value is less than 0.05. By this, there might be statistically significant relations among variables in the data. Overall, EFA can be performed on this data.

## Parallel analysis suggests that the number of factors =  5  and the number of components =  4

##  [1]  1.02636489  0.31527128  0.26993778  0.14698541  0.03687583 -0.05018421
##  [7] -0.10235980 -0.12685946 -0.14058901 -0.15685137 -0.19222728

According to the Parallel Analysis Scree Plot, 5 factors should be used in EFA.

Model Creation

In the process of model creation, parameters are set mainly by hand. So, many different options have been tried. The presented and compared below models illustrate the best versions (visually) that were picked up. These models contain 2 or 3 factors. In case of folliwing to the recommendation of the arallel Analysis Scree Plot and setting 4 or 5 factors, some resulted factors have less than 3 variables which is undesirable in the analysis.

Model 1

Settings:

Number of factors: 3
Rotation: varimax, orthogonal
Factoring method: ml, maximum likelihood factor analysis
Method of finding correlation: mixed

To improve visual view of the model, varialbe related to having musical instrument has to be removed.

In details…

## Factor Analysis using method =  ml
## Call: fa(r = efa_data3, nfactors = 3, rotate = "varimax", fm = "ml", 
##     cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                ML1   ML2   ML3   h2    u2 com
## CompOwn       0.36  0.16  0.29 0.24 0.761 2.3
## CompShare    -0.11  0.56 -0.03 0.32 0.678 1.1
## StudyDesk     0.93  0.17  0.06 0.90 0.098 1.1
## RoomOwn       0.53 -0.01  0.24 0.33 0.668 1.4
## InternetCon   0.28  0.73  0.17 0.64 0.359 1.4
## MobileOwn     0.25  0.64  0.27 0.54 0.460 1.7
## GamingSys     0.03  0.06  0.58 0.34 0.657 1.0
## Car           0.13  0.26  0.43 0.27 0.731 1.8
## HomeAtLeast4  0.17 -0.04  0.38 0.18 0.825 1.4
## Dishwasher    0.18  0.15  0.60 0.42 0.582 1.3
## 
##                        ML1  ML2  ML3
## SS loadings           1.50 1.40 1.28
## Proportion Var        0.15 0.14 0.13
## Cumulative Var        0.15 0.29 0.42
## Proportion Explained  0.36 0.34 0.31
## Cumulative Proportion 0.36 0.69 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  45  and the objective function was  2.27 with Chi Square of  10353.9
## The degrees of freedom for the model are 18  and the objective function was  0.27 
## 
## The root mean square of the residuals (RMSR) is  0.06 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  4574 with the empirical chi square  1271.71  with prob <  4.8e-259 
## The total number of observations was  4574  with Likelihood Chi Square =  1247.15  with prob <  8.8e-254 
## 
## Tucker Lewis Index of factoring reliability =  0.702
## RMSEA index =  0.122  and the 90 % confidence intervals are  0.116 0.128
## BIC =  1095.45
## Fit based upon off diagonal values = 0.95
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3
## Correlation of (regression) scores with factors   0.94 0.84 0.79
## Multiple R square of scores with factors          0.89 0.71 0.62
## Minimum correlation of possible factor scores     0.78 0.42 0.24

Parameters:

Proportion Explained: 0.36, 0.34, 0.31 (acceptable)
Cum. Var.: 42%
TLI: 0.707 (should be at least .90)
RMSEA: 0.121 (hardly acceptable)
RMSR: 0.06
BIC: 1071.82

Loadings show that there are no any variable that could be related to a few factors at the same time.

## 
## Loadings:
##              ML1    ML2    ML3   
## CompOwn       0.362              
## CompShare            0.557       
## StudyDesk     0.932              
## RoomOwn       0.526              
## InternetCon          0.733       
## MobileOwn            0.637       
## GamingSys                   0.582
## Car                         0.432
## HomeAtLeast4                0.382
## Dishwasher                  0.602
## 
##                  ML1   ML2   ML3
## SS loadings    1.502 1.402 1.278
## Proportion Var 0.150 0.140 0.128
## Cumulative Var 0.150 0.290 0.418

Model 2

Settings:

Number of factors: 2
Rotation: varimax, orthogonal
Factoring method: ml, maximum likelihood factor analysis
Method of finding correlation: mixed

To improve visual view of the model, varialbe related to having a musical instrument and a gaming system have to be removed.

In details…

## Factor Analysis using method =  ml
## Call: fa(r = efa_data4, nfactors = 2, rotate = "varimax", fm = "ml", 
##     cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                ML2   ML1   h2   u2 com
## CompOwn       0.43  0.23 0.24 0.76 1.5
## CompShare    -0.10  0.50 0.26 0.74 1.1
## StudyDesk     0.67  0.25 0.52 0.48 1.3
## RoomOwn       0.71 -0.01 0.50 0.50 1.0
## InternetCon   0.26  0.76 0.65 0.35 1.2
## MobileOwn     0.25  0.69 0.54 0.46 1.3
## Car           0.28  0.31 0.18 0.82 2.0
## HomeAtLeast4  0.39  0.00 0.15 0.85 1.0
## Dishwasher    0.33  0.26 0.18 0.82 1.9
## 
##                        ML2  ML1
## SS loadings           1.63 1.58
## Proportion Var        0.18 0.18
## Cumulative Var        0.18 0.36
## Proportion Explained  0.51 0.49
## Cumulative Proportion 0.51 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  36  and the objective function was  2 with Chi Square of  9136.11
## The degrees of freedom for the model are 19  and the objective function was  0.34 
## 
## The root mean square of the residuals (RMSR) is  0.07 
## The df corrected root mean square of the residuals is  0.1 
## 
## The harmonic number of observations is  4574 with the empirical chi square  1663.95  with prob <  0 
## The total number of observations was  4574  with Likelihood Chi Square =  1564.48  with prob <  6.299337e-321 
## 
## Tucker Lewis Index of factoring reliability =  0.678
## RMSEA index =  0.133  and the 90 % confidence intervals are  0.128 0.139
## BIC =  1404.34
## Fit based upon off diagonal values = 0.93
## Measures of factor score adequacy             
##                                                    ML2  ML1
## Correlation of (regression) scores with factors   0.84 0.86
## Multiple R square of scores with factors          0.71 0.74
## Minimum correlation of possible factor scores     0.41 0.47

Parameters:

Proportion Explained: 0.50, 0.50 (equal that is good)
Cum. Var.: 36%
TLI: 0.684 (should be at least .90)
RMSEA: 0.132 (hardly acceptable)
RMSR: 0.07
BIC: 1373.07

Loadings show that there are no any variable that could be related to a few factors at the same time.

## 
## Loadings:
##              ML2    ML1   
## CompOwn       0.431       
## CompShare            0.501
## StudyDesk     0.675       
## RoomOwn       0.706       
## InternetCon          0.760
## MobileOwn            0.688
## Car                  0.311
## HomeAtLeast4  0.387       
## Dishwasher    0.333       
## 
##                  ML2   ML1
## SS loadings    1.626 1.579
## Proportion Var 0.181 0.175
## Cumulative Var 0.181 0.356

Model 3

Settings:

Number of factors: 2
Rotation: oblimin, oblique transformation
Factoring method: ml, maximum likelihood factor analysis
Method of finding correlation: mixed

To improve visual view of the model, varialbe related to having a musical instrument, a gaming system, a car, and a dishwasher have to be removed.

In details…

## Factor Analysis using method =  ml
## Call: fa(r = efa_data5, nfactors = 2, rotate = "oblimin", fm = "ml", 
##     cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                ML1   ML2    h2   u2 com
## CompOwn       0.17  0.40 0.243 0.76 1.4
## CompShare     0.56 -0.26 0.267 0.73 1.4
## StudyDesk     0.15  0.71 0.606 0.39 1.1
## RoomOwn      -0.10  0.70 0.449 0.55 1.0
## InternetCon   0.81  0.07 0.701 0.30 1.0
## MobileOwn     0.66  0.09 0.492 0.51 1.0
## HomeAtLeast4 -0.05  0.33 0.099 0.90 1.0
## 
##                        ML1  ML2
## SS loadings           1.49 1.37
## Proportion Var        0.21 0.20
## Cumulative Var        0.21 0.41
## Proportion Explained  0.52 0.48
## Cumulative Proportion 0.52 1.00
## 
##  With factor correlations of 
##      ML1  ML2
## ML1 1.00 0.38
## ML2 0.38 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  21  and the objective function was  1.49 with Chi Square of  6810.31
## The degrees of freedom for the model are 8  and the objective function was  0.12 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.08 
## 
## The harmonic number of observations is  4574 with the empirical chi square  514.07  with prob <  6.7e-106 
## The total number of observations was  4574  with Likelihood Chi Square =  559.52  with prob <  1.2e-115 
## 
## Tucker Lewis Index of factoring reliability =  0.787
## RMSEA index =  0.123  and the 90 % confidence intervals are  0.114 0.132
## BIC =  492.09
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    ML1  ML2
## Correlation of (regression) scores with factors   0.89 0.86
## Multiple R square of scores with factors          0.79 0.74
## Minimum correlation of possible factor scores     0.58 0.47

Parameters:

Proportion Explained: 0.52, 0.48 (almost equal that is good)
Cum. Var.: 41%
TLI: 0.788 (should be at least .90)
RMSEA: 0.123 (hardly acceptable)
RMSR: 0.05
BIC: 489.47

Loadings show that there are no any variable that could be related to a few factors at the same time.

## 
## Loadings:
##              ML1    ML2   
## CompOwn              0.400
## CompShare     0.556       
## StudyDesk            0.710
## RoomOwn              0.702
## InternetCon   0.810       
## MobileOwn     0.664       
## HomeAtLeast4         0.331
## 
##                  ML1   ML2
## SS loadings    1.471 1.347
## Proportion Var 0.210 0.192
## Cumulative Var 0.210 0.403

Summing up

Models	NumFactors	PropExp	CumVar	ValueTLI	ValueRMSEA	ValueRMSR	ValueBIC
1	3	0.36, 0.34, 0.31	0.42	0.707	0.121	0.06	1071.82
2	2	0.50, 0.50	0.36	0.684	0.132	0.07	1373.07
3	2	0.52, 0.48	0.41	0.788	0.123	0.05	489.47

It can be said that in these three cases proportion of explained is distributed relatively equally among factors. Although any RMSEA values of the model are unsatisfied, the 3rd EFA model can be named as the best because it has the biggest value of TLI (0.788), its RMSR the more closer to 0 compared to the others, the BIC is the lowest one. As a result, there are two factors. The first one can be treated as accessibility to gadgets (there are internet connection, personal mobile phone, sharing computer), while the second one might be related to possessing of objects, individual conditions for studying (it includes personal room, personal study desk, personal PC/tablet, home with at least 4 rooms).

However, the first EFA model also has relatively nice extimates. Moreover, there are 3 factors that can also be rather meaningfully interpreted:(1) personal settings for studying, (2) technical issues, and (3) home items.

Data Consistency

Cronbach’s alpha is

For the 1st EFA model

##  raw_alpha std.alpha   G6(smc) average_r       S/N        ase     mean
##   0.378326 0.4043881 0.3151827  0.184549 0.6789457 0.01525911 1.185104
##         sd  median_r
##  0.2504463 0.1755361

##  raw_alpha std.alpha   G6(smc) average_r       S/N        ase     mean
##   0.258524   0.34506 0.2637287 0.1493844 0.5268574 0.01574247 1.071855
##         sd  median_r
##  0.1587677 0.1371855

##  raw_alpha std.alpha   G6(smc) average_r       S/N        ase     mean      sd
##  0.4238019 0.4283606 0.3661256 0.1577803 0.7493545 0.01389692 1.585101 0.27524
##   median_r
##  0.1498401

The Cronbach’s alpha revealed unaccaptable results (1 - .40, 2 - .34, 3 - .43).

For the 3rd EFA model

##  raw_alpha std.alpha   G6(smc) average_r       S/N        ase     mean
##   0.258524   0.34506 0.2637287 0.1493844 0.5268574 0.01574247 1.071855
##         sd  median_r
##  0.1587677 0.1371855

##  raw_alpha std.alpha   G6(smc) average_r       S/N        ase     mean
##  0.3951876 0.4139894 0.3565513 0.1501032 0.7064537 0.01408232 1.261095
##         sd  median_r
##  0.2443319 0.1556028

Here, the Cronbach’s alpha also revealed unaccaptable results (1 - .34, 2 - .41).

Split-half Reliability

For the 1st EFA model

## Split half reliabilities  
## Call: splitHalf(r = efa_data3)
## 
## Maximum split half reliability (lambda 4) =  0.58
## Guttman lambda 6                          =  0.5
## Average split half reliability            =  0.5
## Guttman lambda 3 (alpha)                  =  0.5
## Guttman lambda 2                          =  0.51
## Minimum split half reliability  (beta)    =  0.36
## Average interitem r =  0.09  with median =  0.08

For the 3rd EFA model

## Split half reliabilities  
## Call: splitHalf(r = efa_data5)
## 
## Maximum split half reliability (lambda 4) =  0.48
## Guttman lambda 6                          =  0.4
## Average split half reliability            =  0.42
## Guttman lambda 3 (alpha)                  =  0.41
## Guttman lambda 2                          =  0.43
## Minimum split half reliability  (beta)    =  0.3
## Average interitem r =  0.09  with median =  0.07

Overall, the 1st model has a bit bigger internal consistency than the 3rd one.

Taking everything into account, the first EFA model is chosen for further analysis due to its relative compelixy, interpretability, and slightly better statistical estimates.

Factor 1: Personal Settings for Studying

This variables consists of having (1) study desk, (2) personal room, and (3) personal computer.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0656 -0.5154 -0.3712  0.0000  0.1548  5.0524

Factor 2: Technical Issues

This variables consists of (1) having internet connection, (2) having personal mobile, and (3) sharing computer with someone else.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.85999 -0.47568 -0.29091  0.00000 -0.02974  7.69651

Factor 3: Home Items

This variables consists of (1) having a dishwasher, (2) having gaming system, (3) having a car, and (3) there are at least 4 room in a flat/house in which a student resides.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.51874 -0.53387  0.06417  0.00000  0.59631  3.25450

Linear Regression 2.0

As a result of the EFA, there are 3 new variables. For now, it is possible to use them in order to explain math achievement scores.

On A, B, and C plots above, there are seen how values of new factors are distributed regarding math scores. Rather interestingly, they look identically. As for D, E, and F plots, hardly linear relations can be observed, especially on D and E cases.

Let’s add those three variables to the previously created linear regression model that was marked as the best one (before diagnostics).

	MathScore
Predictors	Estimates	CI	p
(Intercept)	496.64	480.20 – 513.08	<0.001
BornInOut [Yes]	-1.49	-13.72 – 10.74	0.811
Gender [Girl]	-9.11	-13.57 – -4.65	<0.001
ParentEduc [PostSecNotUni]	35.59	26.39 – 44.80	<0.001
ParentEduc [SomePrLowSecNoSch]	94.31	27.53 – 161.09	0.006
ParentEduc [UniOrHigher]	52.71	43.73 – 61.69	<0.001
ParentEduc [Unknow]	19.01	8.85 – 29.17	<0.001
ParentEduc [UpperSec]	2.28	-7.77 – 12.33	0.656
BooksHome101-200	27.33	16.86 – 37.81	<0.001
BooksHome11-25	5.79	-3.78 – 15.35	0.236
BooksHome26-100	14.24	4.77 – 23.72	0.003
BooksHome [More than 200]	23.37	11.92 – 34.82	<0.001
PersonalSettings	1.90	-0.37 – 4.16	0.100
TechIssue	-8.71	-11.07 – -6.34	<0.001
HomeItems	5.78	3.09 – 8.46	<0.001
Observations	4574
R² / R² adjusted	0.102 / 0.099

The model exmplains 10% of the data (R-squared is 0.10) as it was previously. Being born inside or outside the country and such level of parents’ education as “Upper-Secondary” are still marked as insignificant. As for new variables, two of them play roles: personal settings are revealed as not so important.

So now, the intercept is 496.64, while before it was 489.55.

Girls have scores in math lower in 9.11 (before 8.39) compared to boys.

As for education obtained by parents, student’s math scores will go up in 35.59 (37.36) in case of “Post-secondary but not University”, or in 94.31 (87.78) in case of “Some Primary, Lower Secondary or No School”, or in 52.71 (53.14) in case of “University or Higher”, or in 19.01 (21.25) in case if a student does not know a level of his/her parents’ education, or in 2.28 (3.35) in case of “Upper Secondary” level of education recieved by parents (this one is still marked as insignificant) - all of that is in comparison with students whose parents have “Lower-Secondary” education.

As for having books at home, the general trend is that the more books a student has at home, the more his or her math scores can be. Preciesly, studnet’s math scores will increase in 5.79 (6.27) in case of 11-25 books (this one is marked as insignificant), or in 14.24 (14.82) in case of 26-100 books, or in 27.33 (27.62) in case of 101-200 books, or in 23.37 (23.12) in case of having more than 200 books - all of that in comparison with students who possess no more than 10 books at home.

Taking about new added variables, it has to be mentioned that personal settings for studying (having study desk, personal room, and personal computer) are insignificant towards math scores. Meanwhile, home items, that are meaningfully more related to well-being and welfare, can increase math scores in 5.78. At the same time techinical issues such as having internet connection, having personal mobile, and sharing computer with someone else can decrease math scores in 8.71.

Taking everything into account, it can be concluded that books more matter for 8th-grade student achievement nowadays in Russia than computers and technical items/issues.

Diagnostics 2.0

As for the finishing touch, let’s check the model and perform simple diagnostics.

Multicollinearity

##                      GVIF Df GVIF^(1/(2*Df))
## BornInOut        1.006785  1        1.003387
## Gender           1.040444  1        1.020021
## ParentEduc       1.156237  5        1.014623
## BooksHome        1.125877  4        1.014931
## PersonalSettings 1.122668  1        1.059560
## TechIssue        1.171008  1        1.082131
## HomeItems        1.201389  1        1.096079

According to the test of Variance Inflation Factors, values do not exceed 5 that is good.

Outliers

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 3992 -3.507572         0.00045658           NA

## [1] 1636 3992

There are a few observations (see the plot and results of the test) that have to be removed due to a reason that they are “out of the scope”.

Leverage

As for leverages, they are detected: 1061 row has to be removed.

Studentized residuals

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -3.200286 -0.688121  0.024220  0.000037  0.696024  3.109911         2

The distribution of studentized residuals is bell-shaped and looks like normal.

Homoscedasticity

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 12.72465, Df = 1, p = 0.00036087

There is a sign of heteroscedasticity. Let’s try to solve it.

## Box-Cox Transformation
## 
## 4569 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   273.1   484.5   540.5   538.2   593.9   786.6 
## 
## Largest/Smallest: 2.88 
## Sample Skewness: -0.131 
## 
## Estimated Lambda: 1.3

Some changes are observed in distributions regarding the variable before and after its transformation.

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 161.7732, Df = 1, p = < 2.22e-16

New results of the Non-constant Variance Score test do not show any improvement (for now, p-value is less than 0.05). The done steps have not helped and, as a result, another method has to be implemented in order to deal with heteroscedasticity problem.

Project - TIMSS 2015

Data Analysis in Sociology, 3rd Course

About project

Data Pre-processing

Patterns in Data

Linear Regression

Interaction Effect

Model Comparison

Diagnostics

Exploratory Factor Analysis, EFA

Model Creation

Model 1

Model 2

Model 3

Summing up

Data Consistency

Linear Regression 2.0

Diagnostics 2.0