Using the johnsonjohnson data in R datasets The johnsonjohnson data is a data that shows the Quarterly Earnings per Johnson & Johnson Share
library(datasets)
data("JohnsonJohnson")
summary(JohnsonJohnson)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.440 1.248 3.510 4.800 7.133 16.200
print(JohnsonJohnson)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1960 0.71 0.63 0.85 0.44
## 1961 0.61 0.69 0.92 0.55
## 1962 0.72 0.77 0.92 0.60
## 1963 0.83 0.80 1.00 0.77
## 1964 0.92 1.00 1.24 1.00
## 1965 1.16 1.30 1.45 1.25
## 1966 1.26 1.38 1.86 1.56
## 1967 1.53 1.59 1.83 1.86
## 1968 1.53 2.07 2.34 2.25
## 1969 2.16 2.43 2.70 2.25
## 1970 2.79 3.42 3.69 3.60
## 1971 3.60 4.32 4.32 4.05
## 1972 4.86 5.04 5.04 4.41
## 1973 5.58 5.85 6.57 5.31
## 1974 6.03 6.39 6.93 5.85
## 1975 6.93 7.74 7.83 6.12
## 1976 7.74 8.91 8.28 6.84
## 1977 9.54 10.26 9.54 8.73
## 1978 11.88 12.06 12.15 8.91
## 1979 14.04 12.96 14.85 9.99
## 1980 16.20 14.67 16.02 11.61
Assign the data to a name then plot for visualization
data_1 <- JohnsonJohnson
plot(data_1, main= "Quarterly Earnings per Johnson & Johnson Share",
xlab= "Time", ylab= "Earnings per Johnson & Johnson Share", col="red")
The visualization of the data shows that the data has trend and seasonality, and thus it is not stationary. We can also see that the variance and mean are not constant which is also a sign of non-stationarity.
Asides visualization, we can also check the stationarity of a data through the adf test.
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(data_1)
## Warning in adf.test(data_1): p-value greater than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: data_1
## Dickey-Fuller = 1.9321, Lag order = 4, p-value = 0.99
## alternative hypothesis: stationary
#Here we can see that the p-value is greater than 0.05 which shows that the data is not stationary.
Now, since the data is not stationary, it means the variance and mean are not constant. To make the Variance constant, we first log transform the data.
data_2 <- log(data_1)
data_2
## Qtr1 Qtr2 Qtr3 Qtr4
## 1960 -0.34249031 -0.46203546 -0.16251893 -0.82098055
## 1961 -0.49429632 -0.37106368 -0.08338161 -0.59783700
## 1962 -0.32850407 -0.26136476 -0.08338161 -0.51082562
## 1963 -0.18632958 -0.22314355 0.00000000 -0.26136476
## 1964 -0.08338161 0.00000000 0.21511138 0.00000000
## 1965 0.14842001 0.26236426 0.37156356 0.22314355
## 1966 0.23111172 0.32208350 0.62057649 0.44468582
## 1967 0.42526774 0.46373402 0.60431597 0.62057649
## 1968 0.42526774 0.72754861 0.85015093 0.81093022
## 1969 0.77010822 0.88789126 0.99325177 0.81093022
## 1970 1.02604160 1.22964055 1.30562646 1.28093385
## 1971 1.28093385 1.46325540 1.46325540 1.39871688
## 1972 1.58103844 1.61740608 1.61740608 1.48387469
## 1973 1.71918878 1.76644166 1.88251383 1.66959184
## 1974 1.79674701 1.85473427 1.93585981 1.76644166
## 1975 1.93585981 2.04640169 2.05796251 1.81156210
## 1976 2.04640169 2.18717424 2.11384297 1.92278773
## 1977 2.25549349 2.32825284 2.25549349 2.16676537
## 1978 2.47485631 2.48989419 2.49732917 2.18717424
## 1979 2.64191040 2.56186769 2.69799987 2.30158459
## 1980 2.78501124 2.68580459 2.77383794 2.45186680
plot(data_2, col="blue", main= "Quarterly Earnings per Johnson & Johnson Share", xlab= "Time", ylab= "Earnings per Johnson & Johnson Share")
Then, difference the data, to make the mean constant, thereby reducing the trend and seasonal stationarity.
data_3 <- diff(data_2)
data_3
## Qtr1 Qtr2 Qtr3 Qtr4
## 1960 -0.119545151 0.299516530 -0.658461623
## 1961 0.326684230 0.123232640 0.287682072 -0.514455392
## 1962 0.269332934 0.067139303 0.177983155 -0.427444015
## 1963 0.324496046 -0.036813973 0.223143551 -0.261364764
## 1964 0.177983155 0.083381609 0.215111380 -0.215111380
## 1965 0.148420005 0.113944259 0.109199292 -0.148420005
## 1966 0.007968170 0.090971778 0.298492989 -0.175890666
## 1967 -0.019418086 0.038466281 0.140581951 0.016260521
## 1968 -0.195308752 0.302280872 0.122602322 -0.039220713
## 1969 -0.040821995 0.117783036 0.105360516 -0.182321557
## 1970 0.215111380 0.203598955 0.075985907 -0.024692613
## 1971 0.000000000 0.182321557 0.000000000 -0.064538521
## 1972 0.182321557 0.036367644 0.000000000 -0.133531393
## 1973 0.235314087 0.047252885 0.116072171 -0.212921997
## 1974 0.127155175 0.057987258 0.081125545 -0.169418152
## 1975 0.169418152 0.110541874 0.011560822 -0.246400413
## 1976 0.234839591 0.140772554 -0.073331273 -0.191055237
## 1977 0.332705754 0.072759354 -0.072759354 -0.088728116
## 1978 0.308090944 0.015037877 0.007434978 -0.310154928
## 1979 0.454736157 -0.080042708 0.136132174 -0.396415273
## 1980 0.483426650 -0.099206650 0.088033349 -0.321971146
plot(data_3, col="blue", main= "Quarterly Earnings per Johnson & Johnson Share",
xlab= "Time", ylab= "Earnings per Johnson & Johnson Share")
Now, the mean and variance of the data is now constant, then we can use the adf test again to check if it is now stationary.
adf.test(data_3)
## Warning in adf.test(data_3): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: data_3
## Dickey-Fuller = -4.5649, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
#The p-value is now less than 0.05, which shows the data is stationary.
The Autocorrelation function is the correlation of a time series with a lagged version of itself. It tells how much the current value of the series is related to its past values at different time lags.
The original data has a clear pattern (a trend) over time. It’s like a steadily rising or falling line.
acf(data_1)
Now, the trend has been removed by log transform and differencing, the data looks much more stable. Because they now all tending towards zero.
acf(data_3)
Decomposing a data, reveals the underlying components of the data by dividing it into 4 components which includes; the original data, the seasonal component, trend component, and the random which is the remainder or noise. Each component can also be plotted individually.
decomposed_data <- decompose(data_1, type='multiplicative')
plot(decomposed_data)
Another method is the Seasonal Decomposition of Time Series by Loess(STL). Loess - Locally Estimated Scatterplot Smoothing
mydata <- JohnsonJohnson
mydata_stl <- stl(mydata, s.window = "periodic")
plot(mydata_stl)
mydata_stl2 <- stl(data_2, s.window = "periodic")
plot(mydata_stl2)
The longer the bar of a component, the lesser it’s effect on the data. Looking at the bar in the plot above, the data has more trend than seasonality, because the bar for the trend component is lesser than that of season which is longer. Note: ARIMA model can be used for this data since the trend as more effect, if its the seasonal we use SARIMA model.
To extract the each component and plot individually.
#Extracting the trend from the stl
stltrend <- mydata_stl$time.series[,"trend"]
plot (stltrend, ylab="Trend Component", xlab="Time")
#Extracting the seasonal from the stl
stlseasonal <- mydata_stl$time.series [, "seasonal"]
plot (stlseasonal, ylab="seasonal Component", xlab="Time")
#Extracting the remainder from the stl
stlremainder <- mydata_stl$time.series [, "remainder"]
plot (stlremainder, ylab="Remainder Component", xlab="Time")