The data set used for this analysis is the Daily Female Births data set. There are 365 observations and 2 variables. The data set is a daily count of female births in California in 1959. Summary output, the data sets structure and a time series plot follow. the two variables in the data set are called Date and Births. Date is a date variable that starts from 1959-01-01 and ends at 1959-12-31. Births is an int, which counts the births for each observation. Its min, mean, and max are, respectively, 23, 41.98,73. the seem to be no missing values. A graphic of the time series for female births follows also.
'data.frame': 365 obs. of 2 variables:
$ Date : Date, format: "1959-01-01" "1959-01-02" ...
$ Births: int 35 32 30 31 44 29 45 43 38 27 ...
Date Births
Min. :1959-01-01 Min. :23.00
1st Qu.:1959-04-02 1st Qu.:37.00
Median :1959-07-02 Median :42.00
Mean :1959-07-02 Mean :41.98
3rd Qu.:1959-10-01 3rd Qu.:46.00
Max. :1959-12-31 Max. :73.00
Next the data is split into test and training data sets. The most recent ten periods are kept for the test data set. Tabular output of these amounts follow.
| AMNTs | |
|---|---|
| Train | 355 |
| Test | 10 |
| ORG | 365 |
A time series object was constructed using the ts() function. The frequency was set to 12 despite the data occurring over 365 days. This is owed to the behavior of the snaive fuction. When the ts object was set to daily, the snaive functions output differed from what was expected: it output the first ten observations from the test data set instead of the first ten of the last 12 observations in the test data set. However the meanf, naive, and rwf, behaved the same as if the frequency was monthly. Output of the ts object follows.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1959 35 32 30 31 44 29 45 43 38 27 38 33
1960 55 47 45 37 50 43 41 52 34 53 39 32
1961 37 43 39 35 44 38 24 23 31 44 38 50
1962 38 51 31 31 51 36 45 51 34 52 47 45
1963 46 39 48 37 35 52 42 45 39 37 30 35
1964 28 45 34 36 50 44 39 32 39 45 43 39
1965 31 27 30 42 46 41 36 45 46 43 38 34
1966 35 56 36 32 50 41 39 41 47 34 36 33
1967 35 38 38 34 53 34 34 38 35 32 42 34
1968 46 30 46 45 54 34 37 35 40 42 58 51
1969 32 35 38 33 39 47 38 52 30 34 40 35
1970 42 41 42 38 24 34 43 36 55 41 45 41
1971 37 43 39 33 43 40 38 45 46 34 35 48
1972 51 36 33 46 42 48 34 41 35 40 34 30
1973 36 40 39 45 38 47 33 30 42 43 41 41
1974 59 43 45 38 37 45 42 57 46 51 41 47
1975 26 35 44 41 42 36 45 45 45 47 38 42
1976 35 36 39 45 43 47 36 41 50 39 41 46
1977 64 45 34 38 44 48 46 44 37 39 44 45
1978 33 44 38 46 46 40 39 44 48 50 41 42
1979 51 41 44 38 68 40 42 51 44 45 36 57
1980 44 42 53 42 34 40 56 44 53 55 39 59
1981 55 73 55 44 43 40 47 51 56 49 54 56
1982 47 44 43 42 45 50 48 43 40 59 41 42
1983 51 49 45 43 42 38 47 38 36 42 35 28
1984 44 36 45 46 48 49 43 42 59 45 52 46
1985 42 40 40 45 35 35 40 39 33 42 47 51
1986 44 40 57 49 45 49 51 46 44 52 45 32
1987 46 41 34 33 36 49 43 43 34 39 35 52
1988 47 52 39 40 42 42 53
Forecasting was conducted using the meanf, naive, snaive, and rwf functions, the period was set to 10. Tabular output follows for the 10 predictions follows. The output seems consistent with the expected outputs for each function.
| mAVG | Naive | SNaive | Drift |
|---|---|---|---|
| 41.93239 | 53 | 43 | 53.05085 |
| 41.93239 | 53 | 34 | 53.10169 |
| 41.93239 | 53 | 39 | 53.15254 |
| 41.93239 | 53 | 35 | 53.20339 |
| 41.93239 | 53 | 52 | 53.25424 |
| 41.93239 | 53 | 47 | 53.30508 |
| 41.93239 | 53 | 52 | 53.35593 |
| 41.93239 | 53 | 39 | 53.40678 |
| 41.93239 | 53 | 40 | 53.45763 |
| 41.93239 | 53 | 42 | 53.50847 |
A time series graph follows. It depicts the last 25 observations from the original data set for the time series. Note, the training data set stops at the x marker 355. From 356 on wards, the test data set values and the predictions derived from the test observations follow. There is legend in the upper right hand corner of the graph which identifies each predictions series. The moving average, in red seems constant at 41 and some change, the Naive, in royal, seems constant at 53, the Drift, in navy, seems to crawl away from 53, and the SNaive, in purple, seems to mimic the first ten of the last 12 training observations. These visuals seem consistent with each method’s expected output.
Accuracy metrics were derived. These measures reflect both the absolute and relative errors produced by the predictions. The specific ones used were the MAPE, MAD, and MSE. Notice in each case that the Moving average has the lowest value among of them. Also that the Drift seems to trail Naive. Further, note that each method for calculating the errors sums the distances of the errors from the actual vales. The moving average’s value is the average of 355 out of 365 observations in the original data set; it may be close to the mean value of the original and test data sets. The mean is the constant that may minimize the error’s summed distances the best. This may explain why the moving average has the least value, yet visually, the SNaive seems to follow the test observations well. Further, remember that the Drift and Naive values are similar; their integer parts don’t differ in this analysis. However, Drift has decimal values for each of its predictions. This may explain why it trails Naive under each metric; its always greater than naive here. Based on this information, determining which method to use may be difficult.
| MAPE | MAD | MSE | |
|---|---|---|---|
| Moving Average | 13.59549 | 61.00000 | 49.33443 |
| Naive | 24.94274 | 97.00000 | 132.70000 |
| Seasonal Naive | 19.03335 | 80.00000 | 92.80000 |
| Drift | 25.39648 | 98.88136 | 136.57242 |
Four forecasting methods were used to predict the count of female births for the last ten days of the year. Those methods were the moving Average, naive, seasonal naive, and Drift method. The methods were trained on 355 out of 365 of the data’s observations. The data used was the daily female births in New York, from 1959-01-01 to 1959-12-31. After forecasting, Accuracy metrics were derived. These measures reflected both the absolute and relative errors produced by the predictions. Those metrics were the MAPE, MAD, and MSEA. Under each measure the moving average minimized the error the most, followed by the seasonal naive method, next the naive method, and then the drift method. Interpretations of the output were given, and may shed light on why that occurred. Mainly, the moving average is close to the center of the actual values and may therefore be best positioned to minimize the summed errors. Nonetheless, based on peculiarities inherent in each method, a final method for forecasting could not be decided upon. Prior to forecasting, a time Series object was created. The frequency used was 12 despite the data occurring daily. When the object was created with a daily frequency, the output for the seasonal naive method differed from expectation. However, none of the other methods differed when a monthly frequency was used. A graphic with all of the methods depicted is presented as well. One may note despite the moving average’s performance, visually, the seasonal naive predictions seem to follow the the test data well.
A final method for forecasting could not be decided upon. This may be owed to peculiarities inherent in each method. Whether, the moving average is the best method to utilize, may depend on the values in the data set. if the last value in the training set is close to the average then the drift and naive may perform as well as the moving average. likewise, if conditions are right, then the seasonal naive may perform just as well or better than any of the other methods.