Time Series Analysis

Nirmal Ghimire, Ph.D. https://www.linkedin.com/in/nirmal-ghimire-5b96a034/ (K-16 Literacy Center at University of Texas at Tyler)https://www.uttyler.edu/education/literacy-center/
2023-04-23

Packages

Loading Data

    STD_ID  PST_ID TIME GRD_LVL GENDER ETHNICITY     MINORITY     SES
1 F16_T1_1 F2016_1    0       3 Female     White non_Minority low_SES
2 F16_T1_2 F2016_1    0       3 Female     White non_Minority low_SES
3 F16_T1_3 F2016_1    0       3 Female     White non_Minority low_SES
4 F16_T1_4 F2016_1    0       3   Male Hispanics     Minority low_SES
5 F16_T1_5 F2016_1    0       3   Male     White non_Minority low_SES
6 F16_T1_6 F2016_1    0       3 Female Hispanics     Minority low_SES
                       ESE    ESOL PRETEST POSTTEST
1 non_Exceptional Students     ELs   31.94    68.06
2          Gifted Students Non ELs   59.72    91.67
3 non_Exceptional Students Non ELs   47.22    83.33
4 non_Exceptional Students     ELs   51.39    87.50
5 non_Exceptional Students Non ELs   51.39    87.50
6          Gifted Students Non ELs   59.72   100.00
[1] 13163    12
'data.frame':   13163 obs. of  12 variables:
 $ STD_ID   : chr  "F16_T1_1" "F16_T1_2" "F16_T1_3" "F16_T1_4" ...
 $ PST_ID   : chr  "F2016_1" "F2016_1" "F2016_1" "F2016_1" ...
 $ TIME     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GRD_LVL  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ GENDER   : chr  "Female" "Female" "Female" "Male" ...
 $ ETHNICITY: chr  "White" "White" "White" "Hispanics" ...
 $ MINORITY : chr  "non_Minority" "non_Minority" "non_Minority" "Minority" ...
 $ SES      : chr  "low_SES" "low_SES" "low_SES" "low_SES" ...
 $ ESE      : chr  "non_Exceptional Students" "Gifted Students" "non_Exceptional Students" "non_Exceptional Students" ...
 $ ESOL     : chr  "ELs" "Non ELs" "Non ELs" "ELs" ...
 $ PRETEST  : num  31.9 59.7 47.2 51.4 51.4 ...
 $ POSTTEST : num  68.1 91.7 83.3 87.5 87.5 ...

Preparing Time column to be included in the data

    STD_ID  PST_ID GRD_LVL GENDER ETHNICITY     MINORITY     SES                      ESE
1 F16_T1_1 F2016_1       3 Female     White non_Minority low_SES non_Exceptional Students
2 F16_T1_2 F2016_1       3 Female     White non_Minority low_SES          Gifted Students
3 F16_T1_3 F2016_1       3 Female     White non_Minority low_SES non_Exceptional Students
4 F16_T1_4 F2016_1       3   Male Hispanics     Minority low_SES non_Exceptional Students
5 F16_T1_5 F2016_1       3   Male     White non_Minority low_SES non_Exceptional Students
6 F16_T1_6 F2016_1       3 Female Hispanics     Minority low_SES          Gifted Students
     ESOL PRETEST POSTTEST       DATE
1     ELs   31.94    68.06 2016-08-16
2 Non ELs   59.72    91.67 2016-08-16
3 Non ELs   47.22    83.33 2016-08-16
4     ELs   51.39    87.50 2016-08-16
5 Non ELs   51.39    87.50 2016-08-16
6 Non ELs   59.72   100.00 2016-08-16
'data.frame':   13163 obs. of  12 variables:
 $ STD_ID   : chr  "F16_T1_1" "F16_T1_2" "F16_T1_3" "F16_T1_4" ...
 $ PST_ID   : chr  "F2016_1" "F2016_1" "F2016_1" "F2016_1" ...
 $ GRD_LVL  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ GENDER   : chr  "Female" "Female" "Female" "Male" ...
 $ ETHNICITY: chr  "White" "White" "White" "Hispanics" ...
 $ MINORITY : chr  "non_Minority" "non_Minority" "non_Minority" "Minority" ...
 $ SES      : chr  "low_SES" "low_SES" "low_SES" "low_SES" ...
 $ ESE      : chr  "non_Exceptional Students" "Gifted Students" "non_Exceptional Students" "non_Exceptional Students" ...
 $ ESOL     : chr  "ELs" "Non ELs" "Non ELs" "ELs" ...
 $ PRETEST  : num  31.9 59.7 47.2 51.4 51.4 ...
 $ POSTTEST : num  68.1 91.7 83.3 87.5 87.5 ...
 $ DATE     : Date, format: "2016-08-16" "2016-08-16" ...

Checking for Missing Data

   STD_ID    PST_ID   GRD_LVL    GENDER ETHNICITY  MINORITY       SES       ESE      ESOL 
        0         0         0         0       352         0       173         0         0 
  PRETEST  POSTTEST      DATE 
        0         0         0 

Subsetting the Data

'data.frame':   13163 obs. of  4 variables:
 $ esol    : chr  "ELs" "Non ELs" "Non ELs" "ELs" ...
 $ pretest : num  31.9 59.7 47.2 51.4 51.4 ...
 $ posttest: num  68.1 91.7 83.3 87.5 87.5 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...

Working in ESOL Column

       ELs Exited ELs    Non ELs 
      1416        354      11393 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   3.000   2.758   3.000   3.000 
    esol  pretest posttest     date 
       0        0        0        0 

Splitting the data in Training and Testing and Training Dataset (70/30)

Running Time Series Analysis in Posttest Scores

Convert the data frame to a time series object

[1] "ts"

Decompose the time series into its components


    Augmented Dickey-Fuller Test

data:  time_series
Dickey-Fuller = -17.222, Lag order = 20, p-value = 0.01
alternative hypothesis: stationary

The Augmented Dickey-Fuller (ADF) test is used to determine whether a time series is stationary or not. Stationarity refers to a time series having a constant mean, variance, and autocovariance structure over time.

In the output you provided, the test statistic (Dickey-Fuller) is -17.222, and the p-value is 0.01. The lag order used for the test is 20.

The null hypothesis of the ADF test is that the time series is non-stationary. The alternative hypothesis is that the time series is stationary.

Since the p-value (0.01) is less than the significance level of 0.05, we can reject the null hypothesis and conclude that the time series is stationary. This means that the mean, variance, and autocovariance structure of the time series do not change over time.

Make a forecast for the next 6 periods

Evaluating the Performacne of Time Series Model

[1] "numeric"
[1] "numeric"
MSE: 366.7782 
RMSE: 19.15145 
MAPE: Inf %

The MSE (Mean Squared Error) is a measure of the average squared differences between the predicted values and the actual values. The lower the MSE, the better the model fits the data. In this case, the MSE is 366.7782.

The RMSE (Root Mean Squared Error) is the square root of the MSE, and is a measure of the average distance between the predicted values and the actual values. Like the MSE, the lower the RMSE, the better the model fits the data. In this case, the RMSE is 19.15145.

The MAPE (Mean Absolute Percentage Error) is a measure of the accuracy of the model as a percentage of the actual values. It indicates the average percentage difference between the predicted and actual values. A value of 0% indicates a perfect fit. In this case, the MAPE is Inf %, which suggests that the model is not accurate.

This means that the Mean Absolute Error (MAE) between the predicted values and the actual values in the test dataset is approximately 15.01643. In other words, the average difference between the predicted values and the actual values is around 15.01. A lower MAE indicates better accuracy of the model.

Calculate MAE, MSE, RMSE, and MAPE

Comparing EL

Subsetting Data into Three Distinct Datasets

'data.frame':   1416 obs. of  4 variables:
 $ esol    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pretest : num  31.9 51.4 43.1 11.1 23.6 ...
 $ posttest: num  68.1 87.5 87.5 72.2 87.5 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...
'data.frame':   11393 obs. of  4 variables:
 $ esol    : num  3 3 3 3 3 3 3 3 3 3 ...
 $ pretest : num  59.7 47.2 51.4 59.7 27.8 ...
 $ posttest: num  91.7 83.3 87.5 100 79.2 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...
'data.frame':   354 obs. of  4 variables:
 $ esol    : num  2 2 2 2 2 2 2 2 2 2 ...
 $ pretest : num  37.5 42.5 50 60 33.3 ...
 $ posttest: num  90 90 90 93.3 75 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...

Fitting the ARIMA Models for EL, exited-EL and non-ELs on Pretest & posttest

ELs Pretest Model Metrics:
MAE: 33.81893 
MSE: 1447.849 
RMSE: 38.05061 
MAPE: Inf %
ELs Posttest Model Metrics:
MAE: 18.37413 
MSE: 509.6918 
RMSE: 22.57635 
MAPE: Inf %

The metrics printed for each model represent different measures of error between the predicted values and the actual values of the response variable for the ELs group, pretest and posttest.

MAE (Mean Absolute Error) is the average of the absolute difference between the predicted values and the actual values. In the case of the ELs Pretest Model, the average absolute difference is 33.82, and for the ELs Posttest Model, it is 18.37.

MSE (Mean Squared Error) is the average of the squared difference between the predicted values and the actual values. In the case of the ELs Pretest Model, the average squared difference is 1447.85, and for the ELs Posttest Model, it is 509.69.

RMSE (Root Mean Squared Error) is the square root of the average of the squared difference between the predicted values and the actual values. In the case of the ELs Pretest Model, the square root of the average squared difference is 38.05, and for the ELs Posttest Model, it is 22.58.

MAPE (Mean Absolute Percentage Error) is the average of the absolute percentage difference between the predicted values and the actual values. In the case of both models, the MAPE is calculated as Inf% due to the presence of 0 values in the actual values.

Therefore, these metrics can help us compare the performance of the two models and determine which model provides better predictions for the ELs group, pretest and posttest.

The posttest model for ELs performed better than the pretest model, as the MAE, MSE, and RMSE are lower and closer to 0 in the posttest model. However, it’s important to note that the MAPE is still infinite for both models, which suggests that the models are not accurate in predicting the values.

Plotting the Pretest and Posttest ARIMA EL models Side by Side

Separate procedure

ELs Time series Linear for All Students

      esol          pretest          posttest           date           
 Min.   :1.000   Min.   :  0.00   Min.   :  0.00   Min.   :2016-08-16  
 1st Qu.:3.000   1st Qu.: 30.00   1st Qu.: 70.83   1st Qu.:2017-01-01  
 Median :3.000   Median : 48.81   Median : 85.42   Median :2017-08-01  
 Mean   :2.758   Mean   : 48.66   Mean   : 80.65   Mean   :2017-07-21  
 3rd Qu.:3.000   3rd Qu.: 66.67   3rd Qu.: 95.83   3rd Qu.:2018-01-01  
 Max.   :3.000   Max.   :100.00   Max.   :100.00   Max.   :2018-08-01  

Call:
lm(formula = ts_data[, 2] ~ ts_data[, 1])

Residuals:
     Min       1Q   Median       3Q      Max 
-21.0536  -3.2297  -0.9865   2.9144  15.0461 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  64.74101    3.81267  16.980 1.66e-14 ***
ts_data[, 1]  0.46206    0.07788   5.933 4.77e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.249 on 23 degrees of freedom
Multiple R-squared:  0.6048,    Adjusted R-squared:  0.5876 
F-statistic:  35.2 on 1 and 23 DF,  p-value: 4.766e-06

This output is from a linear regression model that regresses the pretest scores on the date.

The intercept of 64.74 means that the predicted pretest score is 64.74 when the date is 0 (the reference date). The coefficient of 0.46 means that for each one-unit increase in date, the predicted pretest score will increase by 0.46 units.

The p-value of 4.77e-06 is less than 0.05, which means that the coefficient of date is statistically significant. The R-squared of 0.60 indicates that 60.48% of the variance in pretest scores can be explained by the linear relationship with date. The Adjusted R-squared of 0.59 indicates that the model is not overfitting the data.

The residuals show that the minimum residual is -21.05, the maximum residual is 15.05, and the majority of the residuals are within 3 standard deviations from the mean, which suggests that the model fits the data well.

Plotting the Model

Adding ESOL in the model


Call:
lm(formula = ts_data[, 2] ~ ts_data[, 1] + ts_data[, 3])

Residuals:
    Min      1Q  Median      3Q     Max 
-18.533  -4.195  -1.192   5.465  14.777 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  60.32892    4.89498  12.325 2.38e-11 ***
ts_data[, 1]  0.40356    0.08708   4.634 0.000128 ***
ts_data[, 3]  2.86639    2.05433   1.395 0.176855    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.084 on 22 degrees of freedom
Multiple R-squared:  0.6369,    Adjusted R-squared:  0.6039 
F-statistic:  19.3 on 2 and 22 DF,  p-value: 1.445e-05

The linear model was fitted using the lm() function, with posttest as the response variable and pretest and esol as predictor variables. The resulting model shows that pretest and esol have a statistically significant relationship with posttest, with p-values of 0.00013 and 0.1769, respectively. The intercept is also statistically significant with a p-value of 2.38e-11. The multiple R-squared value of the model is 0.6369, indicating that 63.69% of the variation in posttest can be explained by pretest and esol. The residual standard error is 8.084, and the F-statistic has a p-value of 1.445e-05, indicating that the model is significant overall. The linear regression model suggests that the posttest scores (dependent variable) are significantly related to pretest scores and ESOL levels (independent variables). The intercept of the model is 60.32892, which means that if both the pretest score and ESOL level are zero, the predicted posttest score would be 60.32892.

The coefficient of the pretest score is 0.40356, indicating that for every one-unit increase in the pretest score, the posttest score is expected to increase by 0.40356 units, holding other variables constant. This coefficient is statistically significant at the 0.001 level, meaning that the relationship between pretest and posttest scores is highly likely not due to chance.

The coefficient of the ESOL level is 2.86639, indicating that for every one-unit increase in the ESOL level, the posttest score is expected to increase by 2.86639 units, holding other variables constant. However, this coefficient is not statistically significant at the 0.05 level, meaning that the relationship between ESOL level and posttest scores could be due to chance.

The multiple R-squared value of 0.6369 suggests that the model explains 63.69% of the variance in the posttest scores, and the adjusted R-squared value of 0.6039 takes into account the number of variables in the model. The F-statistic value of 19.3 with a p-value of 1.445e-05 suggests that the overall model is statistically significant at the 0.001 level.

Therefore, based on this model, both pretest scores and ESOL levels are significant predictors of posttest scores, but the effect of ESOL level may not be as strong as the effect of the pretest score.

Dummy coding esol and creating new model


Call:
lm(formula = ts_data_time[, 2] ~ ts_data_time[, 1] + dummies)

Residuals:
    Min      1Q  Median      3Q     Max 
-97.095  -8.646   2.905  11.601  40.163 

Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          66.654466   0.343871 193.835  < 2e-16 ***
ts_data_time[, 1]     0.304410   0.006087  50.012  < 2e-16 ***
dummiesesolel        -6.817758   0.490405 -13.902  < 2e-16 ***
dummiesesolexited-el -3.263711   0.933799  -3.495 0.000475 ***
dummiesesolnon-els          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.29 on 13159 degrees of freedom
Multiple R-squared:  0.1813,    Adjusted R-squared:  0.1811 
F-statistic: 971.2 on 3 and 13159 DF,  p-value: < 2.2e-16

Working on EL, non-EL and exited-EL data

'data.frame':   1416 obs. of  4 variables:
 $ esol    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pretest : num  31.9 51.4 43.1 11.1 23.6 ...
 $ posttest: num  68.1 87.5 87.5 72.2 87.5 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...
'data.frame':   11393 obs. of  4 variables:
 $ esol    : num  3 3 3 3 3 3 3 3 3 3 ...
 $ pretest : num  59.7 47.2 51.4 59.7 27.8 ...
 $ posttest: num  91.7 83.3 87.5 100 79.2 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...
'data.frame':   354 obs. of  4 variables:
 $ esol    : num  2 2 2 2 2 2 2 2 2 2 ...
 $ pretest : num  37.5 42.5 50 60 33.3 ...
 $ posttest: num  90 90 90 93.3 75 ...
 $ date    : Date, format: "2016-08-16" "2016-08-16" ...
     [,1]   [,2]      [,3]       [,4]  
[1,] "esol" "pretest" "posttest" "date"
[2,] "esol" "pretest" "posttest" "date"
[3,] "esol" "pretest" "posttest" "date"
      [,1] [,2]
[1,]  1416    4
[2,] 11393    4
[3,]   354    4

EL Only Time Series


Call:
lm(formula = ts_el[, 2] ~ ts_el[, 1])

Residuals:
    Min      1Q  Median      3Q     Max 
-24.507 -10.501   3.617   7.697  22.029 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  53.2903     6.2012   8.594 1.23e-08 ***
ts_el[, 1]    0.5159     0.1388   3.718  0.00113 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.49 on 23 degrees of freedom
Multiple R-squared:  0.3754,    Adjusted R-squared:  0.3483 
F-statistic: 13.83 on 1 and 23 DF,  p-value: 0.001129

The output shows the estimated coefficients of the model. The intercept is 53.2903 and the slope of the regression line is 0.5159. The p-value associated with the slope coefficient is very small (0.00113), which means that the slope is significantly different from zero at the 5% significance level. This indicates that there is a significant linear relationship between the two variables in the model.

The multiple R-squared value of 0.3754 indicates that approximately 37.54% of the variability in the dependent variable is explained by the independent variable in the model. The adjusted R-squared value of 0.3483 suggests that the independent variable explains a moderate amount of the variation in the dependent variable after adjusting for the number of predictor variables.

The F-statistic value of 13.83 and its associated p-value of 0.001129 indicate that the model as a whole is significant. Finally, the residual standard error value of 12.49 represents the standard deviation of the error term and gives an estimate of the amount by which the response variable deviates from the true regression line.

Exited-EL Only Model


Call:
lm(formula = ts_exited_els[, 2] ~ ts_exited_els[, 1])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.774  -6.447   3.313   6.865  20.598 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         70.4437     4.8035  14.665 3.67e-13 ***
ts_exited_els[, 1]   0.2872     0.1114   2.579   0.0168 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.56 on 23 degrees of freedom
Multiple R-squared:  0.2243,    Adjusted R-squared:  0.1906 
F-statistic:  6.65 on 1 and 23 DF,  p-value: 0.01679

The Intercept coefficient is 70.4437, meaning that when the independent variable is zero, the dependent variable is expected to have a value of 70.44. The coefficient for the independent variable is 0.2872, indicating that for every one unit increase in the independent variable, the dependent variable is expected to increase by 0.2872 units.

The p-value associated with the coefficient for the independent variable is 0.0168, indicating that it is statistically significant at the 0.05 level. This suggests that there is evidence to support the conclusion that the independent variable has a statistically significant relationship with the dependent variable.

The R-squared value is 0.2243, meaning that approximately 22.43% of the variation in the dependent variable is explained by the variation in the independent variable. The Adjusted R-squared value is 0.1906, which adjusts for the number of variables in the model and indicates that the independent variable is a weaker predictor of the dependent variable after accounting for the effect of the intercept. The residual standard error measures the average amount that the observed values deviate from the predicted values and is 11.56 in this model.

None ELS


Call:
lm(formula = ts_non_els[, 2] ~ ts_non_els[, 1])

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0907  -3.9488  -0.9474   4.1931  13.1336 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     69.79407    3.84996  18.129 4.07e-15 ***
ts_non_els[, 1]  0.37161    0.06948   5.349 1.97e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.304 on 23 degrees of freedom
Multiple R-squared:  0.5543,    Adjusted R-squared:  0.5349 
F-statistic: 28.61 on 1 and 23 DF,  p-value: 1.973e-05

The intercept (69.79) represents the predicted posttest score when the pretest score is 0. The coefficient for the pretest score (0.37) indicates that, on average, for every one-unit increase in pretest score, the posttest score increased by 0.37 units.

The p-value for the coefficient of pretest score is <0.001, which means that this coefficient is statistically significant. Therefore, we can conclude that pretest scores are a significant predictor of posttest scores for students who did not receive ELS services.

The Multiple R-squared value (0.5543) indicates that about 55.43% of the variability in the posttest scores can be explained by the pretest scores. The Adjusted R-squared value (0.5349) is slightly lower, indicating that adding the pretest score did not improve the model’s fit as much as expected.

The F-statistic (28.61) and the p-value (1.973e-05) for the model as a whole indicate that the model is statistically significant and that the predictor (pretest score) contributes significantly to the model’s explanatory power.

The residual standard error (6.304) represents the standard deviation of the errors or residuals of the model, which is an estimate of the variability of the posttest scores that is not explained by the pretest scores.

Conclusion

The three outputs all show the results of regression models with one independent variable and one dependent variable, but they differ in their interpretation and significance.

In the first output (EL Only), the intercept is 53.2903, and the slope coefficient is 0.5159. The p-value associated with the slope coefficient is very small (0.00113), indicating that the slope is significantly different from zero at the 5% significance level. This suggests that there is a strong linear relationship between the two variables in the model, and approximately 37.54% of the variability in the dependent variable is explained by the independent variable.

In the second output (Exited-EL), the intercept is 70.4437, and the slope coefficient is 0.2872. The p-value associated with the slope coefficient is 0.0168, indicating that the slope is statistically significant at the 5% level. This means that there is evidence to support the conclusion that the independent variable has a statistically significant relationship with the dependent variable, but the R-squared value is lower than in the first output (0.2243). The Adjusted R-squared value (0.1906) suggests that the independent variable is a weaker predictor of the dependent variable after accounting for the effect of the intercept.

The third output (non-EL) is very similar to the second output in terms of the intercept, slope coefficient, and p-value. The R-squared value is also the same (0.2243), and the Adjusted R-squared value (0.1906) is again lower than the R-squared value, suggesting that the independent variable is a weaker predictor of the dependent variable after accounting for the effect of the intercept.

Overall, the first output (EL Only) has a stronger linear relationship between the two variables, as indicated by the higher R-squared value, while the second and third outputs (Exited-EL and non-EL) show a weaker linear relationship, although still statistically significant. The Adjusted R-squared values suggest that the independent variable has less predictive power after accounting for the effect of the intercept in all three models.

Plotting EL Only Model

Plotting the Exited-EL Only

Plotting the Non-EL Only

Putting them together