Data description

The data set used for this analysis is the Daily Female Births data set. There are 365 observations and 2 variables. The data set is a daily count of female births in California in 1959. Summary output, the data sets structure and a time series plot follow. the two variables in the data set are called Date and Births. Date is a date variable that starts from 1959-01-01 and ends at 1959-12-31. Births is an int, which counts the births for each observation. Its min, mean, and max are, respectively, 23, 41.98,73. the seem to be no missing values. A graphic of the time series for female births follows also.

'data.frame':   365 obs. of  2 variables:
 $ Date  : Date, format: "1959-01-01" "1959-01-02" ...
 $ Births: int  35 32 30 31 44 29 45 43 38 27 ...
      Date                Births     
 Min.   :1959-01-01   Min.   :23.00  
 1st Qu.:1959-04-02   1st Qu.:37.00  
 Median :1959-07-02   Median :42.00  
 Mean   :1959-07-02   Mean   :41.98  
 3rd Qu.:1959-10-01   3rd Qu.:46.00  
 Max.   :1959-12-31   Max.   :73.00  

Training and Testing Data

Next the data is split into test and training data sets. The most recent ten periods are kept for the test data set. Tabular output of these amounts follow.

DataSplit
AMNTs
Train 355
Test 10
ORG 365

Time Series Object

A time series object was constructed using the ts() function. The frequency was set to 12 despite the data occurring over 365 days. This is owed to the behavior of the snaive fuction. When the ts object was set to daily, the snaive functions output differed from what was expected: it output the first ten observations from the test data set instead of the first ten of the last 12 observations in the test data set. However the meanf, naive, and rwf, behaved the same as if the frequency was monthly. Output of the ts object follows.

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1959  35  32  30  31  44  29  45  43  38  27  38  33
1960  55  47  45  37  50  43  41  52  34  53  39  32
1961  37  43  39  35  44  38  24  23  31  44  38  50
1962  38  51  31  31  51  36  45  51  34  52  47  45
1963  46  39  48  37  35  52  42  45  39  37  30  35
1964  28  45  34  36  50  44  39  32  39  45  43  39
1965  31  27  30  42  46  41  36  45  46  43  38  34
1966  35  56  36  32  50  41  39  41  47  34  36  33
1967  35  38  38  34  53  34  34  38  35  32  42  34
1968  46  30  46  45  54  34  37  35  40  42  58  51
1969  32  35  38  33  39  47  38  52  30  34  40  35
1970  42  41  42  38  24  34  43  36  55  41  45  41
1971  37  43  39  33  43  40  38  45  46  34  35  48
1972  51  36  33  46  42  48  34  41  35  40  34  30
1973  36  40  39  45  38  47  33  30  42  43  41  41
1974  59  43  45  38  37  45  42  57  46  51  41  47
1975  26  35  44  41  42  36  45  45  45  47  38  42
1976  35  36  39  45  43  47  36  41  50  39  41  46
1977  64  45  34  38  44  48  46  44  37  39  44  45
1978  33  44  38  46  46  40  39  44  48  50  41  42
1979  51  41  44  38  68  40  42  51  44  45  36  57
1980  44  42  53  42  34  40  56  44  53  55  39  59
1981  55  73  55  44  43  40  47  51  56  49  54  56
1982  47  44  43  42  45  50  48  43  40  59  41  42
1983  51  49  45  43  42  38  47  38  36  42  35  28
1984  44  36  45  46  48  49  43  42  59  45  52  46
1985  42  40  40  45  35  35  40  39  33  42  47  51
1986  44  40  57  49  45  49  51  46  44  52  45  32
1987  46  41  34  33  36  49  43  43  34  39  35  52
1988  47  52  39  40  42  42  53                    

Forecasting

Forecasting was conducted using the meanf, naive, snaive, and rwf functions, the period was set to 10. Tabular output follows for the 10 predictions follows. The output seems consistent with the expected outputs for each function.

Forecast, h=10
mAVG Naive SNaive Drift
41.93239 53 43 53.05085
41.93239 53 34 53.10169
41.93239 53 39 53.15254
41.93239 53 35 53.20339
41.93239 53 52 53.25424
41.93239 53 47 53.30508
41.93239 53 52 53.35593
41.93239 53 39 53.40678
41.93239 53 40 53.45763
41.93239 53 42 53.50847

Visualization

A time series graph follows. It depicts the last 25 observations from the original data set for the time series. Note, the training data set stops at the x marker 355. From 356 on wards, the test data set values and the predictions derived from the test observations follow. There is legend in the upper right hand corner of the graph which identifies each predictions series. The moving average, in red seems constant at 41 and some change, the Naive, in royal, seems constant at 53, the Drift, in navy, seems to crawl away from 53, and the SNaive, in purple, seems to mimic the first ten of the last 12 training observations. These visuals seem consistent with each method’s expected output.

Accuracy Metrics

Accuracy metrics were derived. These measures reflect both the absolute and relative errors produced by the predictions. The specific ones used were the MAPE, MAD, and MSE. Notice in each case that the Moving average has the lowest value among of them. Also that the Drift seems to trail Naive. Further, note that each method for calculating the errors sums the distances of the errors from the actual vales. The moving average’s value is the average of 355 out of 365 observations in the original data set; it may be close to the mean value of the original and test data sets. The mean is the constant that may minimize the error’s summed distances the best. This may explain why the moving average has the least value, yet visually, the SNaive seems to follow the test observations well. Further, remember that the Drift and Naive values are similar; their integer parts don’t differ in this analysis. However, Drift has decimal values for each of its predictions. This may explain why it trails Naive under each metric; its always greater than naive here. Based on this information, determining which method to use may be difficult.

Forecasting Errors
MAPE MAD MSE
Moving Average 13.59549 61.00000 49.33443
Naive 24.94274 97.00000 132.70000
Seasonal Naive 19.03335 80.00000 92.80000
Drift 25.39648 98.88136 136.57242

Discussion

Four forecasting methods were used to predict the count of female births for the last ten days of the year. Those methods were the moving Average, naive, seasonal naive, and Drift method. The methods were trained on 355 out of 365 of the data’s observations. The data used was the daily female births in New York, from 1959-01-01 to 1959-12-31. After forecasting, Accuracy metrics were derived. These measures reflected both the absolute and relative errors produced by the predictions. Those metrics were the MAPE, MAD, and MSEA. Under each measure the moving average minimized the error the most, followed by the seasonal naive method, next the naive method, and then the drift method. Interpretations of the output were given, and may shed light on why that occurred. Mainly, the moving average is close to the center of the actual values and may therefore be best positioned to minimize the summed errors. Nonetheless, based on peculiarities inherent in each method, a final method for forecasting could not be decided upon. Prior to forecasting, a time Series object was created. The frequency used was 12 despite the data occurring daily. When the object was created with a daily frequency, the output for the seasonal naive method differed from expectation. However, none of the other methods differed when a monthly frequency was used. A graphic with all of the methods depicted is presented as well. One may note despite the moving average’s performance, visually, the seasonal naive predictions seem to follow the the test data well.

Conclussion

A final method for forecasting could not be decided upon. This may be owed to peculiarities inherent in each method. Whether, the moving average is the best method to utilize, may depend on the values in the data set. if the last value in the training set is close to the average then the drift and naive may perform as well as the moving average. likewise, if conditions are right, then the seasonal naive may perform just as well or better than any of the other methods.