Principal component analysis, Scree plot and more

Question 1

This section provides results on analysis of interventions that try to promote growth mindsets in children learning to read. There are two groups in the analysis, namely a control group (control), and an intervention group (growth, or growth_mindset). The control group receives the usual classroom activities, whereas the growth mindset group spends an hour each week of the year doing activities aimed at promoting a growth mindset. Each child is tested at the beginning of the program (January), and then halfway through the program (June), and again at the end of the program (December). The analysis will provide descriptive statistics of the results between the 2 groups.

1.1 Explore the data descriptively, creating appropriate tables and figures, where needed.

Loading essential libraries

Loading the RDs files by the below command

We can use the 'clean names command from the janitor package to clean column names according to tutorial 1 by the below code

Getting summary statistics of the individual dataset.

participant_id	time	reading_score
Min. :1.0	Length:15	Min. : 7.00
1st Qu.:4.0	Class :character	1st Qu.: 9.00
Median :6.0	Mode :character	Median :14.00
Mean :5.4	NA	Mean :14.33
3rd Qu.:7.0	NA	3rd Qu.:17.50
Max. :9.0	NA	Max. :22.00

participant_id	time	reading_score
Min. : 2.0	Length:15	Min. : 9.00
1st Qu.: 3.0	Class :character	1st Qu.:11.00
Median : 5.0	Mode :character	Median :19.50
Mean : 5.6	NA	Mean :18.93
3rd Qu.: 8.0	NA	3rd Qu.:26.00
Max. :10.0	NA	Max. :32.00
NA	NA	NA’s :1

Individual summary statistics showed that there is one participant in growth for which there is no data available for score while for control group all data is given. The quartiles are given which will be analysed by box plot later. We row wise bind two datasets below.

Joining the two datasets and viewing data types in the columns

## Classes 'data.table' and 'data.frame':   30 obs. of  3 variables:
##  $ participant_id: int  1 1 1 4 4 4 6 6 6 7 ...
##  $ time          : chr  "January" "June" "December" "January" ...
##  $ reading_score : num  9 14 22 9 13 18 7 9 16 9 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We can observe that the column participant_id is of interger type, column reading score which is an individual vector has data type characters while the reading_score from both groups has numeric data type. This can help in several way to visualize the data by using boxplots , density plots etc.

The following table provides descriptive statistics of the two groups regards reading scores, The mean is the arithmetic average while the standard deviation measures how far the results deviate from the mean. Notably, the standard deviation is larger than 2 for reading_scores, showing that the score range was large: minimum was 7 and maximum 32 for the entire group. On the other hand the time taken by the individual groups is not varying much between max and minimun values. Moreover the median value is 16 which can also be visualized with the help of boxplots below.

## Warning in describeBy(df): no grouping variable requested

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
participant_id	1	30	5.50000	2.9213837	5.5	5.50	3.7065	1	10	9	0.0000000	-1.340653	0.5333693
time*	2	30	2.00000	0.8304548	2.0	2.00	1.4826	1	3	2	0.0000000	-1.598333	0.1516196
reading_score	3	29	16.55172	6.9825610	16.0	16.08	8.8956	7	32	25	0.5130776	-0.827273	1.2966290

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

The boxplot is an excellent illustration for the two datasets. We can observe that the mean value which is indicated by middle black line in three boxplots above is not at the same level which in turn show that there is large variation in mean scores for both groups. Furthermore the scores in the january month is very much less than the other two months for both groups. There is one outler for the january month in the two groups. We can analyse the individual groups with the same plot as

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

We can clearly see that the mean scores for the growth groups are higher overall in all months from the boxplot above. This has also been the case for 1st and 3rd quartiles in both groups. Fruther corelation between scores can be check by corelation test. Corelation for reading scores

## 
##  Pearson's product-moment correlation
## 
## data:  control$reading_score and growth$reading_score
## t = 4.3859, df = 12, p-value = 0.0008871
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4354933 0.9286596
## sample estimates:
##       cor 
## 0.7847463

After Applying the corelation test results show that corelation coefficient value of 0.78. It indicates that there is postive corelation between the reading scores of two groups since this value is between 0.7 and 0.9 indicating that the relationship between the variables is strong. We can say with confidence that there is a relationship because both test’s p-values are much smaller than the significance value of 0.05. It is possible that the correlation is meaningless if variables do not have a linear relationship which can be checked by scatter plot below.

The above scatter plots validates that apparently with increase of reading score of one group the reading score also increases which has been given by corelation test as well. The normality of the read scores for both groups can be visualized with the help density plot below. It will be helpful in the statistical tests if we opt for ANOVA tests.

## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Histogram of reading scores of individual groups.

The above histograms indicate that the highest scores are observed for growth group. Similarly number of participants with high reading scores in growth group is higher as compared to control group. None of the scores are normally distributed as well.

1.2 Mixed Linear Model analyses

The correlation coefficient value of 0.78 indicated that there is high covariance between groups so regression analysis will produce correct results. According to the given condition the two hypothesis are

Null Hypothesis The growth group will improve more than the control group does over time

Alternate Hypothesis Both groups will improve over time

This hypothesis can be check by mixed Linear regression model to show the differences in scores between control and intervention group. We cannot apply ANOVA test here since the groups compared are 2 which is against the basic requirement for the analysis of varaince. The option could have been a t-test for analysis of means.

We check our first hypotheses defined above and compare the reading scores for two groups.

## 
## Call:
## lm(formula = growth$reading_score ~ control$reading_score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6663 -3.5473 -0.7782  4.0801  8.4385 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.533      4.206   0.365 0.721818    
## control$reading_score    1.224      0.279   4.386 0.000887 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.238 on 12 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6158, Adjusted R-squared:  0.5838 
## F-statistic: 19.24 on 1 and 12 DF,  p-value: 0.0008871

The results from the Linear regression model indicates that the model is fit, as the F-statistics is significant at 5% significance level with p-value very much less than 0.05. The R-squared, which is explanatory power of the model is 0.61, meaning that the model explains the variance between the intervention and control group 61% of the times. This is good explanatory power, caused by few variables in the model. The regression equation proposed is 1.224*control$reading_score+1.533. The QQ plot shows that that there are outliers on both ends of the combined dataset. data and fitted line is exatly stright so we can say that data for reading scores is not normalized. The adjusted $R^2$ is good at 0.59 suggesting that the model can only explain 59% of the variance in error-related negativity values. The t-value is quite large and postive. The three asterics next to the pr(>|t|) indicate coefficient reading score for control group is significant in our model. In view of above results we reject our hypothesis that growth group will improve more than control group over time and accept the alternate hypothesis that both groups will improve over time.

Question 2

2.1 Descriptive results

Before running a factor analysis, it is important to explore the distribution of the data, through mean scores, and how far the average score deviates from the norm (standard deviation). In this section we run descriptive summaries.

## Rows: 500 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (23): CNTSTUID, AGE, RESILIENCE, COMPETE, GFOFAIL, HISEI, HEDRES, PARED,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Warning in describeBy(pisa): no grouping variable requested

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
CNTSTUID	1	500	5.540405e+07	2439.6899810	55404165.0000	5.540406e+07	3300.2676000	55400007.0000	5.540812e+07	8113.0000	-0.0353487	-1.3215724	109.1062528
AGE	2	500	1.578112e+01	0.2796970	15.8300	1.578450e+01	0.3706500	15.2500	1.633000e+01	1.0800	-0.0855295	-1.0633293	0.0125084
RESILIENCE	3	481	-3.750350e-02	0.8491172	-0.0614	-9.818680e-02	0.6763621	-2.5400	2.369300e+00	4.9093	0.6925647	0.7358690	0.0387164
COMPETE	4	485	6.569710e-02	0.9270388	0.1956	4.544420e-02	0.8769579	-2.3450	2.005400e+00	4.3504	0.0589005	0.2644566	0.0420947
GFOFAIL	5	483	2.307613e-01	1.0117618	0.1097	2.492607e-01	1.0541286	-1.8939	1.890500e+00	3.7844	-0.0296047	-0.4881378	0.0460368
HISEI	6	472	5.738426e+01	20.6198941	62.3900	5.878032e+01	20.9046600	14.2100	8.870000e+01	74.4900	-0.5670280	-0.9092040	0.9491076
HEDRES	7	493	-2.132900e-02	1.0544390	0.0477	7.008910e-02	1.6314530	-4.4106	1.179300e+00	5.5899	-0.5743895	0.2289931	0.0474895
PARED	8	477	1.312264e+01	2.1589055	14.0000	1.334726e+01	1.4826000	3.0000	1.500000e+01	12.0000	-1.0608398	1.1561274	0.0988495
SCREADCOMP	9	488	6.256450e-02	1.0106729	0.1222	4.666940e-02	0.8557567	-2.4403	1.883900e+00	4.3242	0.0725264	-0.1001044	0.0457510
SCREADDIFF	10	491	1.506204e-01	1.0475901	0.3059	1.703288e-01	0.8904496	-1.8876	2.775200e+00	4.6628	-0.0218754	0.0332926	0.0472771
JOYREAD	11	493	-5.987650e-02	1.1559570	-0.1358	-6.934530e-02	0.9246976	-2.7316	2.613100e+00	5.3447	0.0756668	0.2492023	0.0520617
STIMREAD	12	489	1.362789e-01	0.9594024	0.2432	1.401313e-01	0.8501228	-2.3003	2.087100e+00	4.3874	-0.1390130	0.3585063	0.0433857
TMINS	13	363	1.571915e+03	338.9679834	1500.0000	1.521065e+03	185.3250000	500.0000	3.000000e+03	2500.0000	1.6577159	4.3127298	17.7912051
ST188Q01HA	14	480	3.106250e+00	0.4996476	3.0000	3.098958e+00	0.0000000	1.0000	4.000000e+00	3.0000	0.0051398	1.7035229	0.0228057
ST188Q02HA	15	480	3.231250e+00	0.6118715	3.0000	3.273438e+00	0.0000000	1.0000	4.000000e+00	3.0000	-0.4487851	0.7416076	0.0279280
ST188Q03HA	16	479	2.778706e+00	0.7550982	3.0000	2.781818e+00	0.0000000	1.0000	4.000000e+00	3.0000	-0.2515310	-0.2203333	0.0345013
ST188Q06HA	17	479	2.759917e+00	0.7966856	3.0000	2.789610e+00	0.0000000	1.0000	4.000000e+00	3.0000	-0.3874875	-0.1894055	0.0364015
ST188Q07HA	18	478	3.031381e+00	0.6117804	3.0000	3.062500e+00	0.0000000	1.0000	4.000000e+00	3.0000	-0.5100236	1.3750783	0.0279822
ST183Q01HA	19	484	2.766529e+00	0.9265106	3.0000	2.832474e+00	1.4826000	1.0000	4.000000e+00	3.0000	-0.3492380	-0.7165204	0.0421141
ST183Q02HA	20	485	2.723711e+00	0.9033653	3.0000	2.776350e+00	1.4826000	1.0000	4.000000e+00	3.0000	-0.2192383	-0.7534035	0.0410197
ST183Q03HA	21	484	2.828512e+00	0.9421865	3.0000	2.907217e+00	1.4826000	1.0000	4.000000e+00	3.0000	-0.3507155	-0.8145897	0.0428267
reading_fluency	22	500	1.610200e+01	2.7111430	17.0000	1.649750e+01	1.4826000	0.0000	1.900000e+01	19.0000	-1.7352602	4.5304120	0.1212460
ST184Q01HA	23	484	2.154959e+00	0.9001703	2.0000	2.095361e+00	1.4826000	1.0000	4.000000e+00	3.0000	0.3378031	-0.7078958	0.0409168

We observed that there are lots of NaN values in the dataset which need to be removed. The mean value of all variables are in between range of 2 and 4. Firstly we want to remove NaN in dataset.

As required in question the variable resilience can be checked by frequency bar chart

Similarly for the Fear of failure

It indicates that only participants have normally distributed scores for resilience while the for fear of failure the data is rightly skewed.

We can remove the columns not needed for analysis. We should remove all other variables other than resilience scales and fail of fear scales as given in question to compute for a two factor structure.

Computing the Summary statistics of remaining columns.

## Warning in describeBy(pisa): no grouping variable requested

##            vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## ST188Q01HA    1 324 3.17 0.50      3    3.16 0.00   2   4     2  0.32     0.32
## ST188Q02HA    2 324 3.28 0.63      3    3.33 0.00   1   4     3 -0.59     0.76
## ST188Q03HA    3 324 2.86 0.77      3    2.88 0.00   1   4     3 -0.31    -0.26
## ST188Q06HA    4 324 2.75 0.83      3    2.78 1.48   1   4     3 -0.35    -0.36
## ST188Q07HA    5 324 3.04 0.62      3    3.07 0.00   1   4     3 -0.49     1.23
## ST183Q01HA    6 324 2.75 0.95      3    2.81 1.48   1   4     3 -0.31    -0.83
## ST183Q02HA    7 324 2.71 0.91      3    2.77 1.48   1   4     3 -0.19    -0.80
## ST183Q03HA    8 324 2.84 0.96      3    2.91 1.48   1   4     3 -0.30    -0.96
##              se
## ST188Q01HA 0.03
## ST188Q02HA 0.04
## ST188Q03HA 0.04
## ST188Q06HA 0.05
## ST188Q07HA 0.03
## ST183Q01HA 0.05
## ST183Q02HA 0.05
## ST183Q03HA 0.05

We observed after the variables selection from likert scale that mean values are range 2-3.3 hence there is not much variability in the scale record. The standard error se is also very less although some items are skewed.

Before running principal component analysis we need to have a inspect corelation matrix. In order to have a clear understanding for CFA we should have a look at the correlations among our variables

With regards to corelation between Resilience and GFOFAIL likert scales the corelation coefficent values are ranging between 0-0.68 which indicates that all type of postive and no corelation can exist between given values. Some negative values indicate that there are items with negative correlation with each other. To look at diagnostics we run the Bartlett test.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  pisa
## Bartlett's K-squared = 232.83, df = 7, p-value < 2.2e-16

The KMO required for the factor analysis is >0.6. We can check it by running KMO command below.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pisa)
## Overall MSA =  0.69
## MSA for each item = 
## ST188Q01HA ST188Q02HA ST188Q03HA ST188Q06HA ST188Q07HA ST183Q01HA ST183Q02HA 
##       0.60       0.67       0.81       0.70       0.65       0.73       0.67 
## ST183Q03HA 
##       0.69

The results of above two tests propose that factor analysis is possible since we are the threshold values of 0.6 for MSA. Moreover the small p-value indicates factor analysis is possible. The KMO values for all items are greater than 0.6 hence we have enough sampling accuracy.

**Factor model* Principal component analysis

We now need to determine how many factors will be appropriate to use. We will run a PCA and analyse the eigenvalues. We will then use the Kaiser criterion to decide number of potential components.

## [1] 2.7401522 1.8338042 0.8166770 0.8023333 0.6212019 0.5785285 0.3531824
## [8] 0.2541204

Eigenvalues indicate how much of the variance in the original dataset each eigenvector explains. The vector of values shows that when using Kaiser’s rule for extracting factors with Eigen values greater than 1, there are 2 maybe 3 potential components. Let’s check the scree plot to see if it can explain it.

The scree plot shows that there are at least 1 components, after which the steepness of the eigenvalue line (amount of variance accounted for) levels out.

According to two rules of thumb our results may lead to different conclusions. Rather use Horn’s parallel analysis: * generates many random samples of uncorrelated data of the same size as the study sample (i.e. ‘nonsense’ data) * computes a correlation matrix for each of the random samples * these are then factor analysed ans eigenvaues are generated * if an eigenvalue from the study sample is greater than the corresponding eigenvalue from the random, uncorrelated data, we can conclude there is a ‘true’ component explaining the variance in the variables

We use the fa() function to with the rotation method equal to rotational since it is not required in this rotation method that the given factors are corelated as in our case.

The results above indicate that our overall KMO < 0.6 Hence we do not meet the condition for two structure factor analysis.

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

Parallel analysis suggests that the number of factors = NA and the number of components =3

Considering the 2 factor solution according to question

## [1] 2.7401522 1.8338042 0.8166770 0.8023333 0.6212019 0.5785285 0.3531824
## [8] 0.2541204

## 
## Loadings:
##            PC1    PC2   
## ST188Q01HA  0.343  0.556
## ST188Q02HA  0.218  0.565
## ST188Q03HA  0.556  0.442
## ST188Q06HA  0.632  0.303
## ST188Q07HA  0.625  0.441
## ST183Q01HA -0.646  0.537
## ST183Q02HA -0.756  0.482
## ST183Q03HA -0.698  0.452
## 
##                  PC1   PC2
## SS loadings    2.740 1.834
## Proportion Var 0.343 0.229
## Cumulative Var 0.343 0.572

The items in fear of failure can be problematic in the analysis since they have negative loadings as calculated above. Now we will compute the cumulative variance for these items

x
0.3425190
0.5717446
0.6738292
0.7741208
0.8517711
0.9240872
0.9682349
1.0000000

## Principal Components Analysis
## Call: principal(r = pisa, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              PC1  PC2   h2   u2 com
## ST188Q01HA  0.34 0.56 0.43 0.57 1.7
## ST188Q02HA  0.22 0.56 0.37 0.63 1.3
## ST188Q03HA  0.56 0.44 0.50 0.50 1.9
## ST188Q06HA  0.63 0.30 0.49 0.51 1.4
## ST188Q07HA  0.63 0.44 0.59 0.41 1.8
## ST183Q01HA -0.65 0.54 0.70 0.30 1.9
## ST183Q02HA -0.76 0.48 0.80 0.20 1.7
## ST183Q03HA -0.70 0.45 0.69 0.31 1.7
## 
##                        PC1  PC2
## SS loadings           2.74 1.83
## Proportion Var        0.34 0.23
## Cumulative Var        0.34 0.57
## Proportion Explained  0.60 0.40
## Cumulative Proportion 0.60 1.00
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  192.14  with prob <  5.6e-34 
## 
## Fit based upon off diagonal values = 0.88

It was found that there were two clear factors with eigenvalues of 2.71 and 1.89 which together accounted for 59% of the total variance. Using Horn’s parallel analysis produced a neat two factor solution.

The new corelation matrix as given by

All much smaller than 0.1 showing there are really small differences between original and reproduced. This shows it is suitable for factor analysis

Internal reliability

ωt - total common variance ωg – variance due to one factor ωh - used for when there are 3 or more factors

## 
## Three factors are required for identification -- general factor loadings set to be equal. 
## Proceed with caution. 
## Think about redoing the analysis with alternative values of the 'option' setting.

## [1] 0.706547

##         total   general     group
## g   0.7884379 0.3467895 0.4587300
## F1* 0.8374006 0.2403100 0.5970906
## F2* 0.6934474 0.1844601 0.5089873

The raw alpha shows a Cronbach’s of α for the overall computed scale of 0.71, which is questionable. This could be due to the that most respondents were biased toward agreement.

Question 3

It should also be noted that while a high value for Cronbach’s alpha indicates good internal consistency of the items in the scale, it does not mean that the scale is unidimensional. Factor analysis is a method to determine the dimensionality of a scale, in this case a Principal Components Analysis (PCA) will be applied. According to the requirement of the question we need to select two variables reading fluency which will become a dependent variable. The other independent variable/variables can be our choice. We proceed with the reloading of dataset and removing NaN.

## Rows: 500 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (23): CNTSTUID, AGE, RESILIENCE, COMPETE, GFOFAIL, HISEI, HEDRES, PARED,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Viewing the column names to decide about predictor variables for linear regression model

##  [1] "CNTSTUID"        "AGE"             "RESILIENCE"      "COMPETE"        
##  [5] "GFOFAIL"         "HISEI"           "HEDRES"          "PARED"          
##  [9] "SCREADCOMP"      "SCREADDIFF"      "JOYREAD"         "STIMREAD"       
## [13] "TMINS"           "ST188Q01HA"      "ST188Q02HA"      "ST188Q03HA"     
## [17] "ST188Q06HA"      "ST188Q07HA"      "ST183Q01HA"      "ST183Q02HA"     
## [21] "ST183Q03HA"      "reading_fluency" "ST184Q01HA"

We can see that there are lots of variables which can have an effect on the reading fluency. For example reading fluency can depend upon Parent’s occupation status HISEI, Compeition COMPETE, Enjoyment in reading JOYREAD etc. We can choose all variables at once or do the analysis one by one.

We will start our analysis for regression model by choosing two variables COMPETE and STIMREAD. Here the Teachers’ stimulation of reading engagement will serve as control variable which means it will be treated as constant during analysis.

## 
## Call:
## lm(formula = reading_fluency ~ COMPETE + STIMREAD, data = pisa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.545  -1.152   0.556   1.652   3.225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.3510     0.1359 120.331   <2e-16 ***
## COMPETE       0.2186     0.1465   1.492    0.137    
## STIMREAD      0.1913     0.1311   1.460    0.145    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.411 on 321 degrees of freedom
## Multiple R-squared:  0.01335,    Adjusted R-squared:  0.007203 
## F-statistic: 2.172 on 2 and 321 DF,  p-value: 0.1157

The above results indicate that none of the two variables we chose are good predictors for reading_fleuency since the three aesteics are not shown for the two variables with significant relationship to the dependent variable. Moreoever the p-value is much higher than the assumed significance level of 5% which inturn indicates the dependent variables do not have signficance influence over the dependent variable. Remember that our two hypothesis are

Null hypothesis $H_o$: Competetion and Teachers’ stimulation of reading engagement do not have significant effect on reading influency

ALternate Hypothesis $H_A$: Competetion and Teachers’ stimulation of reading engagement have significant effect on reading influency Hence we accept our null hypothesis.

Now we will proceed with the just 1 predictor variable.

The two hypothesis now are;

Null hypothesis $H_o$: Learning time per week do not have significant effect on reading fluency

ALternate Hypothesis $H_A$: Learning time per week have significant effect on reading fluency

## 
## Call:
## lm(formula = reading_fluency ~ TMINS, data = pisa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4748  -0.9511   0.4497   1.5252   3.6576 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.362208   0.649018  28.292  < 2e-16 ***
## TMINS       -0.001258   0.000407  -3.092  0.00217 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.389 on 322 degrees of freedom
## Multiple R-squared:  0.02883,    Adjusted R-squared:  0.02581 
## F-statistic: 9.557 on 1 and 322 DF,  p-value: 0.002165

The new test result indicates that reading fluency is influenced by the learning since p-value is less than 0.05. One more indicator is the F-stat which is much higher this time as compared to previous 2 variables. One important thing to notice that even with and without control variables our $R^2$ is very low. We can further plot the two result from the regression plots as

The regression plot indicate that the residuals this time are still spread around the best line which does not show a good correlation for the two variables in our model. QQ plot is much away from the normalized straight indicating a range of outliers in the subset.

In conclusion we reject our null hypothesis that Learning time per week do not have significant effect on reading fluency.

The above two models have not been fully able to predict our dependent variable so we select a 2 more control variables this time to see the final relation.

## 
## Call:
## lm(formula = reading_fluency ~ JOYREAD + COMPETE + STIMREAD, 
##     data = pisa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5409  -1.0951   0.5088   1.5929   3.4748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.3547     0.1356 120.648   <2e-16 ***
## JOYREAD       0.1816     0.1118   1.624    0.105    
## COMPETE       0.2199     0.1461   1.505    0.133    
## STIMREAD      0.1628     0.1319   1.234    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.405 on 320 degrees of freedom
## Multiple R-squared:  0.02142,    Adjusted R-squared:  0.01224 
## F-statistic: 2.335 on 3 and 320 DF,  p-value: 0.07387

In this case we observe that p-value is still greater than significance level $\alpha$ at 95% confidence interval. The reading fluency is not dependent upon the joy to read in the studies of the students according to result above. In the new model the adjusted $R^2$ is still less than 1. One important thing to notice here is the aesteric shown on JOYREAD which indicates that the reading fluency can be soemwhat significantly related to the joy to read. The residuals can be viewed in graphical form below

The trend line for residuals is at a minimum slope as compared to previous models which indicates the less residual values. The cook distance graph indicates a value of less than 0.01 which means that the outliers are not affecting our results to large degree. So we conclude that even with addition of new control variables the reading fluency can be dependent upon other factors.

The coefficient values can also be written for the three models as

	x
(Intercept)	18.3622080
TMINS	-0.0012583

	x
(Intercept)	16.3509697
COMPETE	0.2185807
STIMREAD	0.1913300

	x
(Intercept)	16.3547401
JOYREAD	0.1816142
COMPETE	0.2199192
STIMREAD	0.1627531

The intercept values are not much different in the all models with and without control variables.