This section provides results on analysis of interventions that try to promote growth mindsets in children learning to read. There are two groups in the analysis, namely a control group (control), and an intervention group (growth, or growth_mindset). The control group receives the usual classroom activities, whereas the growth mindset group spends an hour each week of the year doing activities aimed at promoting a growth mindset. Each child is tested at the beginning of the program (January), and then halfway through the program (June), and again at the end of the program (December). The analysis will provide descriptive statistics of the results between the 2 groups.
Loading essential libraries
Loading the RDs files by the below command
We can use the 'clean names command from the
janitor package to clean column names according to tutorial
1 by the below code
Getting summary statistics of the individual dataset.
| participant_id | time | reading_score | |
|---|---|---|---|
| Min. :1.0 | Length:15 | Min. : 7.00 | |
| 1st Qu.:4.0 | Class :character | 1st Qu.: 9.00 | |
| Median :6.0 | Mode :character | Median :14.00 | |
| Mean :5.4 | NA | Mean :14.33 | |
| 3rd Qu.:7.0 | NA | 3rd Qu.:17.50 | |
| Max. :9.0 | NA | Max. :22.00 |
| participant_id | time | reading_score | |
|---|---|---|---|
| Min. : 2.0 | Length:15 | Min. : 9.00 | |
| 1st Qu.: 3.0 | Class :character | 1st Qu.:11.00 | |
| Median : 5.0 | Mode :character | Median :19.50 | |
| Mean : 5.6 | NA | Mean :18.93 | |
| 3rd Qu.: 8.0 | NA | 3rd Qu.:26.00 | |
| Max. :10.0 | NA | Max. :32.00 | |
| NA | NA | NA’s :1 |
Individual summary statistics showed that there is one participant in growth for which there is no data available for score while for control group all data is given. The quartiles are given which will be analysed by box plot later. We row wise bind two datasets below.
Joining the two datasets and viewing data types in the columns
## Classes 'data.table' and 'data.frame': 30 obs. of 3 variables:
## $ participant_id: int 1 1 1 4 4 4 6 6 6 7 ...
## $ time : chr "January" "June" "December" "January" ...
## $ reading_score : num 9 14 22 9 13 18 7 9 16 9 ...
## - attr(*, ".internal.selfref")=<externalptr>
We can observe that the column participant_id is of
interger type, column reading score which is an individual
vector has data type characters while the reading_score
from both groups has numeric data type. This can help in several way to
visualize the data by using boxplots , density plots etc.
The following table provides descriptive statistics of the two groups regards reading scores, The mean is the arithmetic average while the standard deviation measures how far the results deviate from the mean. Notably, the standard deviation is larger than 2 for reading_scores, showing that the score range was large: minimum was 7 and maximum 32 for the entire group. On the other hand the time taken by the individual groups is not varying much between max and minimun values. Moreover the median value is 16 which can also be visualized with the help of boxplots below.
## Warning in describeBy(df): no grouping variable requested
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| participant_id | 1 | 30 | 5.50000 | 2.9213837 | 5.5 | 5.50 | 3.7065 | 1 | 10 | 9 | 0.0000000 | -1.340653 | 0.5333693 |
| time* | 2 | 30 | 2.00000 | 0.8304548 | 2.0 | 2.00 | 1.4826 | 1 | 3 | 2 | 0.0000000 | -1.598333 | 0.1516196 |
| reading_score | 3 | 29 | 16.55172 | 6.9825610 | 16.0 | 16.08 | 8.8956 | 7 | 32 | 25 | 0.5130776 | -0.827273 | 1.2966290 |
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
The boxplot is an excellent illustration for the two datasets. We can
observe that the mean value which is indicated by middle black line in
three boxplots above is not at the same level which in turn show that
there is large variation in mean scores for both groups. Furthermore the
scores in the january month is very much less than the other two months
for both groups. There is one outler for the january month in the two
groups. We can analyse the individual groups with the same plot as
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
We can clearly see that the mean scores for the growth groups are higher overall in all months from the boxplot above. This has also been the case for 1st and 3rd quartiles in both groups. Fruther corelation between scores can be check by corelation test. Corelation for reading scores
##
## Pearson's product-moment correlation
##
## data: control$reading_score and growth$reading_score
## t = 4.3859, df = 12, p-value = 0.0008871
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4354933 0.9286596
## sample estimates:
## cor
## 0.7847463
After Applying the corelation test results show that corelation coefficient value of 0.78. It indicates that there is postive corelation between the reading scores of two groups since this value is between 0.7 and 0.9 indicating that the relationship between the variables is strong. We can say with confidence that there is a relationship because both test’s p-values are much smaller than the significance value of 0.05. It is possible that the correlation is meaningless if variables do not have a linear relationship which can be checked by scatter plot below.
The above scatter plots validates that apparently with increase of
reading score of one group the reading score also increases which has
been given by corelation test as well. The normality of the read scores
for both groups can be visualized with the help density plot below. It
will be helpful in the statistical tests if we opt for ANOVA tests.
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Histogram of reading scores of individual groups.
The above histograms indicate that the highest scores are observed for growth group. Similarly number of participants with high reading scores in growth group is higher as compared to control group. None of the scores are normally distributed as well.
The correlation coefficient value of 0.78 indicated that there is high covariance between groups so regression analysis will produce correct results. According to the given condition the two hypothesis are
Null Hypothesis The growth group will improve more than the control group does over time
Alternate Hypothesis Both groups will improve over time
This hypothesis can be check by mixed Linear regression model to show the differences in scores between control and intervention group. We cannot apply ANOVA test here since the groups compared are 2 which is against the basic requirement for the analysis of varaince. The option could have been a t-test for analysis of means.
We check our first hypotheses defined above and compare the reading scores for two groups.
##
## Call:
## lm(formula = growth$reading_score ~ control$reading_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6663 -3.5473 -0.7782 4.0801 8.4385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.533 4.206 0.365 0.721818
## control$reading_score 1.224 0.279 4.386 0.000887 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.238 on 12 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6158, Adjusted R-squared: 0.5838
## F-statistic: 19.24 on 1 and 12 DF, p-value: 0.0008871
The results from the Linear regression model indicates that the model
is fit, as the F-statistics is significant at 5% significance level with
p-value very much less than 0.05. The R-squared, which is explanatory
power of the model is 0.61, meaning that the model explains the variance
between the intervention and control group 61% of the times. This is
good explanatory power, caused by few variables in the model. The
regression equation proposed is
1.224*control$reading_score+1.533. The QQ plot shows that
that there are outliers on both ends of the combined dataset. data and
fitted line is exatly stright so we can say that data for reading scores
is not normalized. The adjusted \(R^2\)
is good at 0.59 suggesting that the model can only explain 59% of the
variance in error-related negativity values. The t-value is quite large
and postive. The three asterics next to the pr(>|t|) indicate
coefficient reading score for control group is significant in our model.
In view of above results we reject our hypothesis that growth group will
improve more than control group over time and accept the alternate
hypothesis that both groups will improve over time.
Before running a factor analysis, it is important to explore the distribution of the data, through mean scores, and how far the average score deviates from the norm (standard deviation). In this section we run descriptive summaries.
## Rows: 500 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (23): CNTSTUID, AGE, RESILIENCE, COMPETE, GFOFAIL, HISEI, HEDRES, PARED,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning in describeBy(pisa): no grouping variable requested
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CNTSTUID | 1 | 500 | 5.540405e+07 | 2439.6899810 | 55404165.0000 | 5.540406e+07 | 3300.2676000 | 55400007.0000 | 5.540812e+07 | 8113.0000 | -0.0353487 | -1.3215724 | 109.1062528 |
| AGE | 2 | 500 | 1.578112e+01 | 0.2796970 | 15.8300 | 1.578450e+01 | 0.3706500 | 15.2500 | 1.633000e+01 | 1.0800 | -0.0855295 | -1.0633293 | 0.0125084 |
| RESILIENCE | 3 | 481 | -3.750350e-02 | 0.8491172 | -0.0614 | -9.818680e-02 | 0.6763621 | -2.5400 | 2.369300e+00 | 4.9093 | 0.6925647 | 0.7358690 | 0.0387164 |
| COMPETE | 4 | 485 | 6.569710e-02 | 0.9270388 | 0.1956 | 4.544420e-02 | 0.8769579 | -2.3450 | 2.005400e+00 | 4.3504 | 0.0589005 | 0.2644566 | 0.0420947 |
| GFOFAIL | 5 | 483 | 2.307613e-01 | 1.0117618 | 0.1097 | 2.492607e-01 | 1.0541286 | -1.8939 | 1.890500e+00 | 3.7844 | -0.0296047 | -0.4881378 | 0.0460368 |
| HISEI | 6 | 472 | 5.738426e+01 | 20.6198941 | 62.3900 | 5.878032e+01 | 20.9046600 | 14.2100 | 8.870000e+01 | 74.4900 | -0.5670280 | -0.9092040 | 0.9491076 |
| HEDRES | 7 | 493 | -2.132900e-02 | 1.0544390 | 0.0477 | 7.008910e-02 | 1.6314530 | -4.4106 | 1.179300e+00 | 5.5899 | -0.5743895 | 0.2289931 | 0.0474895 |
| PARED | 8 | 477 | 1.312264e+01 | 2.1589055 | 14.0000 | 1.334726e+01 | 1.4826000 | 3.0000 | 1.500000e+01 | 12.0000 | -1.0608398 | 1.1561274 | 0.0988495 |
| SCREADCOMP | 9 | 488 | 6.256450e-02 | 1.0106729 | 0.1222 | 4.666940e-02 | 0.8557567 | -2.4403 | 1.883900e+00 | 4.3242 | 0.0725264 | -0.1001044 | 0.0457510 |
| SCREADDIFF | 10 | 491 | 1.506204e-01 | 1.0475901 | 0.3059 | 1.703288e-01 | 0.8904496 | -1.8876 | 2.775200e+00 | 4.6628 | -0.0218754 | 0.0332926 | 0.0472771 |
| JOYREAD | 11 | 493 | -5.987650e-02 | 1.1559570 | -0.1358 | -6.934530e-02 | 0.9246976 | -2.7316 | 2.613100e+00 | 5.3447 | 0.0756668 | 0.2492023 | 0.0520617 |
| STIMREAD | 12 | 489 | 1.362789e-01 | 0.9594024 | 0.2432 | 1.401313e-01 | 0.8501228 | -2.3003 | 2.087100e+00 | 4.3874 | -0.1390130 | 0.3585063 | 0.0433857 |
| TMINS | 13 | 363 | 1.571915e+03 | 338.9679834 | 1500.0000 | 1.521065e+03 | 185.3250000 | 500.0000 | 3.000000e+03 | 2500.0000 | 1.6577159 | 4.3127298 | 17.7912051 |
| ST188Q01HA | 14 | 480 | 3.106250e+00 | 0.4996476 | 3.0000 | 3.098958e+00 | 0.0000000 | 1.0000 | 4.000000e+00 | 3.0000 | 0.0051398 | 1.7035229 | 0.0228057 |
| ST188Q02HA | 15 | 480 | 3.231250e+00 | 0.6118715 | 3.0000 | 3.273438e+00 | 0.0000000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.4487851 | 0.7416076 | 0.0279280 |
| ST188Q03HA | 16 | 479 | 2.778706e+00 | 0.7550982 | 3.0000 | 2.781818e+00 | 0.0000000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.2515310 | -0.2203333 | 0.0345013 |
| ST188Q06HA | 17 | 479 | 2.759917e+00 | 0.7966856 | 3.0000 | 2.789610e+00 | 0.0000000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.3874875 | -0.1894055 | 0.0364015 |
| ST188Q07HA | 18 | 478 | 3.031381e+00 | 0.6117804 | 3.0000 | 3.062500e+00 | 0.0000000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.5100236 | 1.3750783 | 0.0279822 |
| ST183Q01HA | 19 | 484 | 2.766529e+00 | 0.9265106 | 3.0000 | 2.832474e+00 | 1.4826000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.3492380 | -0.7165204 | 0.0421141 |
| ST183Q02HA | 20 | 485 | 2.723711e+00 | 0.9033653 | 3.0000 | 2.776350e+00 | 1.4826000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.2192383 | -0.7534035 | 0.0410197 |
| ST183Q03HA | 21 | 484 | 2.828512e+00 | 0.9421865 | 3.0000 | 2.907217e+00 | 1.4826000 | 1.0000 | 4.000000e+00 | 3.0000 | -0.3507155 | -0.8145897 | 0.0428267 |
| reading_fluency | 22 | 500 | 1.610200e+01 | 2.7111430 | 17.0000 | 1.649750e+01 | 1.4826000 | 0.0000 | 1.900000e+01 | 19.0000 | -1.7352602 | 4.5304120 | 0.1212460 |
| ST184Q01HA | 23 | 484 | 2.154959e+00 | 0.9001703 | 2.0000 | 2.095361e+00 | 1.4826000 | 1.0000 | 4.000000e+00 | 3.0000 | 0.3378031 | -0.7078958 | 0.0409168 |
We observed that there are lots of NaN values in the
dataset which need to be removed. The mean value of all variables are in
between range of 2 and 4. Firstly we want to remove NaN in
dataset.
As required in question the variable resilience can be checked by frequency bar chart
Similarly for the Fear of failure
It indicates that only participants have normally distributed scores for resilience while the for fear of failure the data is rightly skewed.
We can remove the columns not needed for analysis. We should remove all other variables other than resilience scales and fail of fear scales as given in question to compute for a two factor structure.
Computing the Summary statistics of remaining columns.
## Warning in describeBy(pisa): no grouping variable requested
## vars n mean sd median trimmed mad min max range skew kurtosis
## ST188Q01HA 1 324 3.17 0.50 3 3.16 0.00 2 4 2 0.32 0.32
## ST188Q02HA 2 324 3.28 0.63 3 3.33 0.00 1 4 3 -0.59 0.76
## ST188Q03HA 3 324 2.86 0.77 3 2.88 0.00 1 4 3 -0.31 -0.26
## ST188Q06HA 4 324 2.75 0.83 3 2.78 1.48 1 4 3 -0.35 -0.36
## ST188Q07HA 5 324 3.04 0.62 3 3.07 0.00 1 4 3 -0.49 1.23
## ST183Q01HA 6 324 2.75 0.95 3 2.81 1.48 1 4 3 -0.31 -0.83
## ST183Q02HA 7 324 2.71 0.91 3 2.77 1.48 1 4 3 -0.19 -0.80
## ST183Q03HA 8 324 2.84 0.96 3 2.91 1.48 1 4 3 -0.30 -0.96
## se
## ST188Q01HA 0.03
## ST188Q02HA 0.04
## ST188Q03HA 0.04
## ST188Q06HA 0.05
## ST188Q07HA 0.03
## ST183Q01HA 0.05
## ST183Q02HA 0.05
## ST183Q03HA 0.05
We observed after the variables selection from likert scale that mean values are range 2-3.3 hence there is not much variability in the scale record. The standard error se is also very less although some items are skewed.
Before running principal component analysis we need to have a inspect corelation matrix. In order to have a clear understanding for CFA we should have a look at the correlations among our variables
With regards to corelation between Resilience and GFOFAIL likert scales the corelation coefficent values are ranging between 0-0.68 which indicates that all type of postive and no corelation can exist between given values. Some negative values indicate that there are items with negative correlation with each other. To look at diagnostics we run the Bartlett test.
##
## Bartlett test of homogeneity of variances
##
## data: pisa
## Bartlett's K-squared = 232.83, df = 7, p-value < 2.2e-16
The KMO required for the factor analysis is >0.6. We can check it
by running KMO command below.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pisa)
## Overall MSA = 0.69
## MSA for each item =
## ST188Q01HA ST188Q02HA ST188Q03HA ST188Q06HA ST188Q07HA ST183Q01HA ST183Q02HA
## 0.60 0.67 0.81 0.70 0.65 0.73 0.67
## ST183Q03HA
## 0.69
The results of above two tests propose that factor analysis is possible since we are the threshold values of 0.6 for MSA. Moreover the small p-value indicates factor analysis is possible. The KMO values for all items are greater than 0.6 hence we have enough sampling accuracy.
**Factor model* Principal component analysis
We now need to determine how many factors will be appropriate to use. We will run a PCA and analyse the eigenvalues. We will then use the Kaiser criterion to decide number of potential components.
## [1] 2.7401522 1.8338042 0.8166770 0.8023333 0.6212019 0.5785285 0.3531824
## [8] 0.2541204
Eigenvalues indicate how much of the variance in the original dataset each eigenvector explains. The vector of values shows that when using Kaiser’s rule for extracting factors with Eigen values greater than 1, there are 2 maybe 3 potential components. Let’s check the scree plot to see if it can explain it.
The scree plot shows that there are at least 1 components, after which the steepness of the eigenvalue line (amount of variance accounted for) levels out.
According to two rules of thumb our results may lead to different conclusions. Rather use Horn’s parallel analysis: * generates many random samples of uncorrelated data of the same size as the study sample (i.e. ‘nonsense’ data) * computes a correlation matrix for each of the random samples * these are then factor analysed ans eigenvaues are generated * if an eigenvalue from the study sample is greater than the corresponding eigenvalue from the random, uncorrelated data, we can conclude there is a ‘true’ component explaining the variance in the variables
We use the fa() function to with the rotation method
equal to rotational since it is not required in this rotation method
that the given factors are corelated as in our case.
The results above indicate that our overall KMO < 0.6 Hence we do
not meet the condition for two structure factor analysis.
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
Parallel analysis suggests that the number of factors = NA and the number of components =3
Considering the 2 factor solution according to question
## [1] 2.7401522 1.8338042 0.8166770 0.8023333 0.6212019 0.5785285 0.3531824
## [8] 0.2541204
##
## Loadings:
## PC1 PC2
## ST188Q01HA 0.343 0.556
## ST188Q02HA 0.218 0.565
## ST188Q03HA 0.556 0.442
## ST188Q06HA 0.632 0.303
## ST188Q07HA 0.625 0.441
## ST183Q01HA -0.646 0.537
## ST183Q02HA -0.756 0.482
## ST183Q03HA -0.698 0.452
##
## PC1 PC2
## SS loadings 2.740 1.834
## Proportion Var 0.343 0.229
## Cumulative Var 0.343 0.572
The items in fear of failure can be problematic in the analysis since they have negative loadings as calculated above. Now we will compute the cumulative variance for these items
| x |
|---|
| 0.3425190 |
| 0.5717446 |
| 0.6738292 |
| 0.7741208 |
| 0.8517711 |
| 0.9240872 |
| 0.9682349 |
| 1.0000000 |
## Principal Components Analysis
## Call: principal(r = pisa, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## ST188Q01HA 0.34 0.56 0.43 0.57 1.7
## ST188Q02HA 0.22 0.56 0.37 0.63 1.3
## ST188Q03HA 0.56 0.44 0.50 0.50 1.9
## ST188Q06HA 0.63 0.30 0.49 0.51 1.4
## ST188Q07HA 0.63 0.44 0.59 0.41 1.8
## ST183Q01HA -0.65 0.54 0.70 0.30 1.9
## ST183Q02HA -0.76 0.48 0.80 0.20 1.7
## ST183Q03HA -0.70 0.45 0.69 0.31 1.7
##
## PC1 PC2
## SS loadings 2.74 1.83
## Proportion Var 0.34 0.23
## Cumulative Var 0.34 0.57
## Proportion Explained 0.60 0.40
## Cumulative Proportion 0.60 1.00
##
## Mean item complexity = 1.7
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 192.14 with prob < 5.6e-34
##
## Fit based upon off diagonal values = 0.88
It was found that there were two clear factors with eigenvalues of 2.71 and 1.89 which together accounted for 59% of the total variance. Using Horn’s parallel analysis produced a neat two factor solution.
The new corelation matrix as given by
All much smaller than 0.1 showing there are really small differences between original and reproduced. This shows it is suitable for factor analysis
ωt - total common variance ωg – variance due to one factor ωh - used for when there are 3 or more factors
##
## Three factors are required for identification -- general factor loadings set to be equal.
## Proceed with caution.
## Think about redoing the analysis with alternative values of the 'option' setting.
## [1] 0.706547
## total general group
## g 0.7884379 0.3467895 0.4587300
## F1* 0.8374006 0.2403100 0.5970906
## F2* 0.6934474 0.1844601 0.5089873
The raw alpha shows a Cronbach’s of α for the overall computed scale of 0.71, which is questionable. This could be due to the that most respondents were biased toward agreement.
It should also be noted that while a high value for Cronbach’s alpha
indicates good internal consistency of the items in the scale, it does
not mean that the scale is unidimensional. Factor analysis is a method
to determine the dimensionality of a scale, in this case a Principal
Components Analysis (PCA) will be applied. According to the requirement
of the question we need to select two variables reading fluency which
will become a dependent variable. The other independent
variable/variables can be our choice. We proceed with the reloading of
dataset and removing NaN.
## Rows: 500 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (23): CNTSTUID, AGE, RESILIENCE, COMPETE, GFOFAIL, HISEI, HEDRES, PARED,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Viewing the column names to decide about predictor variables for linear regression model
## [1] "CNTSTUID" "AGE" "RESILIENCE" "COMPETE"
## [5] "GFOFAIL" "HISEI" "HEDRES" "PARED"
## [9] "SCREADCOMP" "SCREADDIFF" "JOYREAD" "STIMREAD"
## [13] "TMINS" "ST188Q01HA" "ST188Q02HA" "ST188Q03HA"
## [17] "ST188Q06HA" "ST188Q07HA" "ST183Q01HA" "ST183Q02HA"
## [21] "ST183Q03HA" "reading_fluency" "ST184Q01HA"
We can see that there are lots of variables which can have an effect
on the reading fluency. For example reading fluency can depend upon
Parent’s occupation status HISEI, Compeition
COMPETE, Enjoyment in reading JOYREAD etc. We
can choose all variables at once or do the analysis one by one.
We will start our analysis for regression model by choosing two
variables COMPETE and STIMREAD. Here the
Teachers’ stimulation of reading engagement will serve as control
variable which means it will be treated as constant during analysis.
##
## Call:
## lm(formula = reading_fluency ~ COMPETE + STIMREAD, data = pisa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.545 -1.152 0.556 1.652 3.225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.3510 0.1359 120.331 <2e-16 ***
## COMPETE 0.2186 0.1465 1.492 0.137
## STIMREAD 0.1913 0.1311 1.460 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.411 on 321 degrees of freedom
## Multiple R-squared: 0.01335, Adjusted R-squared: 0.007203
## F-statistic: 2.172 on 2 and 321 DF, p-value: 0.1157
The above results indicate that none of the two variables we chose are good predictors for reading_fleuency since the three aesteics are not shown for the two variables with significant relationship to the dependent variable. Moreoever the p-value is much higher than the assumed significance level of 5% which inturn indicates the dependent variables do not have signficance influence over the dependent variable. Remember that our two hypothesis are
Null hypothesis \(H_o\): Competetion and Teachers’ stimulation of reading engagement do not have significant effect on reading influency
ALternate Hypothesis \(H_A\): Competetion and Teachers’ stimulation of reading engagement have significant effect on reading influency Hence we accept our null hypothesis.
Now we will proceed with the just 1 predictor variable.
The two hypothesis now are;
Null hypothesis \(H_o\): Learning time per week do not have significant effect on reading fluency
ALternate Hypothesis \(H_A\): Learning time per week have significant effect on reading fluency
##
## Call:
## lm(formula = reading_fluency ~ TMINS, data = pisa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4748 -0.9511 0.4497 1.5252 3.6576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.362208 0.649018 28.292 < 2e-16 ***
## TMINS -0.001258 0.000407 -3.092 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.389 on 322 degrees of freedom
## Multiple R-squared: 0.02883, Adjusted R-squared: 0.02581
## F-statistic: 9.557 on 1 and 322 DF, p-value: 0.002165
The new test result indicates that reading fluency is influenced by the learning since p-value is less than 0.05. One more indicator is the F-stat which is much higher this time as compared to previous 2 variables. One important thing to notice that even with and without control variables our \(R^2\) is very low. We can further plot the two result from the regression plots as
The regression plot indicate that the residuals this time are still spread around the best line which does not show a good correlation for the two variables in our model. QQ plot is much away from the normalized straight indicating a range of outliers in the subset.
In conclusion we reject our null hypothesis that Learning time per week do not have significant effect on reading fluency.
The above two models have not been fully able to predict our dependent variable so we select a 2 more control variables this time to see the final relation.
##
## Call:
## lm(formula = reading_fluency ~ JOYREAD + COMPETE + STIMREAD,
## data = pisa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5409 -1.0951 0.5088 1.5929 3.4748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.3547 0.1356 120.648 <2e-16 ***
## JOYREAD 0.1816 0.1118 1.624 0.105
## COMPETE 0.2199 0.1461 1.505 0.133
## STIMREAD 0.1628 0.1319 1.234 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.405 on 320 degrees of freedom
## Multiple R-squared: 0.02142, Adjusted R-squared: 0.01224
## F-statistic: 2.335 on 3 and 320 DF, p-value: 0.07387
In this case we observe that p-value is still greater than significance level \(\alpha\) at 95% confidence interval. The reading fluency is not dependent upon the joy to read in the studies of the students according to result above. In the new model the adjusted \(R^2\) is still less than 1. One important thing to notice here is the aesteric shown on JOYREAD which indicates that the reading fluency can be soemwhat significantly related to the joy to read. The residuals can be viewed in graphical form below
The trend line for residuals is at a minimum slope as compared to
previous models which indicates the less residual values. The cook
distance graph indicates a value of less than 0.01 which means that the
outliers are not affecting our results to large degree. So we conclude
that even with addition of new control variables the reading fluency can
be dependent upon other factors.
The coefficient values can also be written for the three models as
| x | |
|---|---|
| (Intercept) | 18.3622080 |
| TMINS | -0.0012583 |
| x | |
|---|---|
| (Intercept) | 16.3509697 |
| COMPETE | 0.2185807 |
| STIMREAD | 0.1913300 |
| x | |
|---|---|
| (Intercept) | 16.3547401 |
| JOYREAD | 0.1816142 |
| COMPETE | 0.2199192 |
| STIMREAD | 0.1627531 |
The intercept values are not much different in the all models with and without control variables.