Instagram is one of the most favorite media social that currently use right now among millenials. Here, I do analysis and forecasting of my friend instagram. The data is from instagram, by click settings -> security -> download data, as following picture :
Time series is a method of analyzing and processing data which the values are affected by time. The action of predicting future values based on its value in the previous period of time is called forecasting. The data which formatted into a time series (ts) object must have some charactersitics:
* no missing intervals
* no missing values
* data should be ordered by time
A ts object can be decomposed into 3 main components which will be calculated for forecasting. These components are :
- trend (T) : the movement of mean, globally,throughout an interval
- seasonal (S) : the pattern captured on each seasonal interval
- error (E) : the pattern /value that cannot be captured by both trend and seasonal.
There are few ways of forecasting in time series :
1. Naive Bayes
2. Simple Moving Average (SMA)
3. Exponential Smoothing
- Simple exponential smoothing (SES): smoothing error
- Double exponential smoothing (Holt): smoothing error and trend
- Triple exponential smoothing (Holt Winters) : smoothing error, trend, & seasonal
4. ARIMA
Forecasting can be done by using forecast () function from forecast package. Evaluation can be done by comparing errors of the prediction.
There are two assumption for a time series analysis :
1. Normality : Shapiro.test
- H0 : residuals are normally distributed
- H1 : residuals are not normally distributed
2. Autocorrelations :Box.test-Ljng-Box
- H0 : No autocorrelations in the forecast errors
- H1 : There is an autocorrelations in the forecast errors
There are cases that have multiple seasonal on their data and should be handled differently such as using seasonal time series approach (Seasonal ARIMA, etc)
#Solution ## Import Library
library(lubridate) #to dea with data
library(tidyverse) #for data wrangling
library(dplyr) #for data wrangling
library(ggplot2) #for basic EDA
library(TSstudio) #time series library
library(padr) # for padding
library(ggfortify)
library(forecast) # for forecasting
library(tseries) # for adf.test
library(gridExtra)
library(MLmetrics)#for calculating errorThe data was obtained from instagram which contains of Date and account_IG that likes by intan_ohana’s account from 2017-01-01 until 2020-03-11
## 'data.frame': 16384 obs. of 2 variables:
## $ Date : Factor w/ 1128 levels "","1/1/2017",..: 498 498 498 498 494 494 494 494 494 494 ...
## $ IG_account: Factor w/ 999 levels "","_citz","_katie_may",..: 881 680 448 448 121 834 216 797 262 900 ...
## Date IG_account
## 339 0
The intan_2017 data consists of 16045 observations and 2 variables. The description of each feature is explained below:
Date : Date when Intan gave like for IG_account.IG_account : The IG account that liked by Intan.As a data scientist, I will develop a forecasting model that will forecast number of likes that will be given by Intan. Based on our data, we want to forecast the number of likes given by Intan for each IG_account. That’s why we need to make a new variable (total_likes)
# Top IG account likes by Intan
intan_likes <- intan_2017 %>%
group_by(IG_account) %>%
summarise(likes_peraccount = n()) %>%
ungroup() %>%
mutate(IG_account = as.factor(IG_account))
intan_likes_arrange <- intan_likes %>%
arrange(desc(likes_peraccount)) %>% # secara default ascending
head(10)
plot_top_likes <- ggplot(data = intan_likes_arrange, aes(x = reorder(IG_account, likes_peraccount),
y = likes_peraccount, label = likes_peraccount)) +
geom_col(aes(fill= likes_peraccount), show.legend = T)+
coord_flip()+
theme_bw()+
theme(axis.text = element_text(size = 12), axis.title = element_text(size = 18, colour = "black"))+
geom_label(aes(fill = likes_peraccount),
colour = "white",
fontface = "bold",
size = 5,
position = position_stack(0.8))+
labs(title = "Total Likes given by Intan",
subtitle = "Top 10 IG account likes by Intan From 2017-2020",
x = "IG_Account",
y = "Total Likes")
plot_top_likes 9gag is IG_account with the most total likes given by intan for 3 past year with 1374 likes followed by
retnohening (910 likes) and dagelan (582 likes).
# grouping by and input missing value of date with 0
intan1 <- intan_2017 %>%
group_by(Date) %>%
summarise(total_likes = n()) %>%
pad() %>%
fill_by_value(total_likes, value = 0)
colSums(is.na(intan1))## Date total_likes
## 0 0
## [1] "2017-01-01" "2020-03-11"
In this step I changed the format data into ts format.
intan_ts <- ts(data = intan1$total_likes, start = c(2017,1), frequency = 365)
class(intan_ts)#check the data class## [1] "ts"
#inspect data trend
intan_ts %>%
ts_plot(title = "Total Likes Given By Intan from Januari-1-2017 unitil March-12-2020")After I made the time series object for our intan data, I inspected our time series element of our intan_ts data. I want to look at the trend and seasonality pattern to choose the appropriate model for forecast intan_ts data. I used decompose() to know the trend, seasonality, and error of our time series data and visualize them using autoplot().
There is decreasing trend from semester II of 2017 until 2019 end.
# test menggunakan `tail()`
intan_test <- tail(intan_ts, 60)
intan_train <- head(intan_ts,
length(intan_ts) - length(intan_test))
length(intan_ts)## [1] 1166
## [1] 1106
Based on the data inspection of decomposition there are trend and seasonal, using Holt winters and Seasonal Arima.
#forecast
intan_holt_f <- forecast(intan_holt, h=60)
intan_arima_f <- forecast(intan_auto, h=60)
intan_holt_f## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020.0301 18.3538481 6.80628258 29.90141 0.6933706 36.01433
## 2020.0329 20.8903506 9.20189057 32.57881 3.0143936 38.76631
## 2020.0356 4.5543624 -7.27331378 16.38204 -13.5345074 22.64323
## 2020.0384 8.4774078 -3.48786494 20.44268 -9.8218977 26.77671
## 2020.0411 11.4434941 -0.65781078 23.54480 -7.0638546 29.95084
## 2020.0438 14.0949944 1.85916966 26.33082 -4.6180847 32.80807
## 2020.0466 10.0694512 -2.29943046 22.43833 -8.8471209 28.98602
## 2020.0493 13.5987425 1.09822009 26.09926 -5.5191568 32.71664
## 2020.0521 25.1464404 12.51564915 37.77723 5.8293121 44.46357
## 2020.0548 14.9241512 2.16442100 27.68388 -4.5901722 34.43847
## 2020.0575 16.3495020 3.46212279 29.23688 -3.3600437 36.05905
## 2020.0603 13.6104022 0.59662604 26.62418 -6.2924509 33.51326
## 2020.0630 14.9306839 1.79172669 28.06964 -5.1636171 35.02499
## 2020.0658 11.5914627 -1.67149416 24.85442 -8.6924794 31.87540
## 2020.0685 17.4543673 4.06855943 30.84018 -3.0174592 37.92619
## 2020.0712 11.3251008 -2.18244079 24.83264 -9.3329014 31.98310
## 2020.0740 18.2159514 4.58776338 31.84414 -2.6265635 39.05847
## 2020.0767 16.2957701 2.54799446 30.04355 -4.7296383 37.32118
## 2020.0795 14.7003834 0.83405139 28.56672 -6.5063413 35.90711
## 2020.0822 15.6141936 1.63031037 29.59808 -5.7723102 37.00070
## 2020.0849 13.8150540 -0.28540050 27.91551 -7.7497302 35.37984
## 2020.0877 15.8297549 1.61368498 30.04582 -5.9118479 37.57136
## 2020.0904 27.5932206 13.26246792 41.92397 5.6762257 49.51022
## 2020.0932 28.9111473 14.46662245 43.35567 6.8201529 51.00214
## 2020.0959 18.1199078 3.56249980 32.67732 -4.1437265 40.38354
## 2020.0986 20.4922731 5.82285073 35.16170 -1.9426724 42.92722
## 2020.1014 21.0108651 6.23027716 35.79145 -1.5940935 43.61582
## 2020.1041 19.8801946 4.98927089 34.77112 -2.8935080 42.65390
## 2020.1068 19.6723261 4.67187825 34.67277 -3.2688792 42.61353
## 2020.1096 15.2549130 0.14573486 30.36409 -7.8525809 38.36241
## 2020.1123 16.1855250 0.96839355 31.40266 -7.0870693 39.45812
## 2020.1151 18.7686495 3.44432513 34.09297 -4.6678822 42.20518
## 2020.1178 16.0599305 0.62915778 31.49070 -7.5393998 39.65926
## 2020.1205 17.8285530 2.29206137 33.36504 -5.9324605 41.58957
## 2020.1233 20.1348467 4.49335065 35.77634 -3.7867572 44.05645
## 2020.1260 14.7980372 -0.94776311 30.54384 -9.2830863 38.87916
## 2020.1288 16.6855192 0.83610103 32.53494 -7.5540741 40.92511
## 2020.1315 11.3768237 -4.57553921 27.32919 -13.0202100 35.77386
## 2020.1342 11.2000560 -4.85459164 27.25470 -13.3534087 35.75352
## 2020.1370 15.9503396 -0.20594510 32.10662 -8.7585656 40.65924
## 2020.1397 13.6019255 -2.65536104 29.85921 -11.2614487 38.46530
## 2020.1425 11.6221228 -4.73554177 27.97979 -13.3947663 36.63901
## 2020.1452 13.6438002 -2.81363035 30.10123 -11.5256678 38.81327
## 2020.1479 12.9782602 -3.57833509 29.53486 -12.3428672 38.29939
## 2020.1507 16.7409986 0.08582906 33.39617 -8.7308851 42.21288
## 2020.1534 12.9225023 -3.83066158 29.67567 -12.6992508 38.54426
## 2020.1562 10.8806735 -5.96991488 27.73126 -14.8900775 36.65142
## 2020.1589 9.0189101 -7.92854267 25.96636 -16.8999822 34.93780
## 2020.1616 7.6307527 -9.41301401 24.67452 -18.4354391 33.69694
## 2020.1644 -0.8797473 -18.01928667 16.25979 -27.0924108 25.33292
## 2020.1671 12.4961543 -4.73862554 29.73093 -13.8621669 38.85448
## 2020.1699 23.5444862 6.21498919 40.87398 -2.9586924 50.04766
## 2020.1726 15.3563970 -2.06730219 32.78010 -11.2908514 42.00365
## 2020.1753 10.8189143 -6.69848049 28.33631 -15.9716291 37.60946
## 2020.1781 16.9013293 -0.70926264 34.51192 -10.0317469 43.83441
## 2020.1808 13.1141803 -4.58911810 30.81748 -13.9606782 40.18904
## 2020.1836 7.3246893 -10.47083268 25.12021 -19.8912130 34.54059
## 2020.1863 16.6946197 -1.19265038 34.58189 -10.6615992 44.05084
## 2020.1890 15.2560938 -2.72245608 33.23464 -12.2397255 42.75191
## 2020.1918 10.4762305 -7.59313822 28.54560 -17.1584842 38.11095
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020.0301 14.54628 4.925006 24.16755 -0.16818884 29.26075
## 2020.0329 15.03610 5.213510 24.85869 0.01374448 30.05846
## 2020.0356 15.09772 5.206607 24.98884 -0.02943367 30.22488
## 2020.0384 15.04768 5.113208 24.98215 -0.14578520 30.24115
## 2020.0411 14.98231 5.011551 24.95307 -0.26664842 30.23126
## 2020.0438 14.91680 4.913955 24.91964 -0.38122912 30.21482
## 2020.0466 14.85400 4.822452 24.88556 -0.48793106 30.19594
## 2020.0493 14.79431 4.737001 24.85162 -0.58701741 30.17564
## 2020.0521 14.73766 4.657221 24.81809 -0.67903970 30.15435
## 2020.0548 14.68390 4.582698 24.78511 -0.76455782 30.13237
## 2020.0575 14.63291 4.513041 24.75277 -0.84409120 30.10990
## 2020.0603 14.58452 4.447893 24.72115 -0.91811407 30.08716
## 2020.0630 14.53862 4.386924 24.69031 -0.98705843 30.06429
## 2020.0658 14.49507 4.329833 24.66030 -1.05131801 30.04145
## 2020.0685 14.45375 4.276343 24.63116 -1.11125204 30.01875
## 2020.0712 14.41455 4.226200 24.60291 -1.16718863 29.99629
## 2020.0740 14.37736 4.179170 24.57556 -1.21942778 29.97416
## 2020.0767 14.34208 4.135038 24.54913 -1.26824412 29.95241
## 2020.0795 14.30861 4.093606 24.52361 -1.31388926 29.93111
## 2020.0822 14.27685 4.054691 24.49901 -1.35659398 29.91030
## 2020.0849 14.24672 4.018123 24.47532 -1.39657013 29.89002
## 2020.0877 14.21814 3.983747 24.45253 -1.43401240 29.87029
## 2020.0904 14.19102 3.951418 24.43062 -1.46909981 29.85114
## 2020.0932 14.16529 3.921002 24.40958 -1.50199719 29.83258
## 2020.0959 14.14088 3.892376 24.38939 -1.53285637 29.81462
## 2020.0986 14.11772 3.865423 24.37003 -1.56181737 29.79727
## 2020.1014 14.09575 3.840039 24.35147 -1.58900940 29.78052
## 2020.1041 14.07491 3.816123 24.33370 -1.61455180 29.76437
## 2020.1068 14.05514 3.793583 24.31669 -1.63855488 29.74883
## 2020.1096 14.03637 3.772334 24.30041 -1.66112068 29.73387
## 2020.1123 14.01857 3.752296 24.28485 -1.68234367 29.71949
## 2020.1151 14.00169 3.733395 24.26998 -1.70231138 29.70569
## 2020.1178 13.98567 3.715561 24.25577 -1.72110496 29.69244
## 2020.1205 13.97047 3.698730 24.24220 -1.73879970 29.67973
## 2020.1233 13.95605 3.682841 24.22925 -1.75546552 29.66756
## 2020.1260 13.94237 3.667839 24.21689 -1.77116737 29.65590
## 2020.1288 13.92939 3.653670 24.20510 -1.78596565 29.64474
## 2020.1315 13.91707 3.640286 24.19386 -1.79991658 29.63406
## 2020.1342 13.90539 3.627640 24.18314 -1.81307247 29.62385
## 2020.1370 13.89431 3.615689 24.17292 -1.82548207 29.61409
## 2020.1397 13.88379 3.604393 24.16319 -1.83719084 29.60477
## 2020.1425 13.87381 3.593715 24.15391 -1.84824117 29.59587
## 2020.1452 13.86435 3.583618 24.14508 -1.85867264 29.58737
## 2020.1479 13.85537 3.574069 24.13667 -1.86852221 29.57926
## 2020.1507 13.84685 3.565038 24.12866 -1.87782444 29.57152
## 2020.1534 13.83877 3.556495 24.12104 -1.88661161 29.56415
## 2020.1562 13.83110 3.548412 24.11379 -1.89491395 29.55711
## 2020.1589 13.82382 3.540764 24.10689 -1.90275975 29.55041
## 2020.1616 13.81692 3.533526 24.10032 -1.91017550 29.54402
## 2020.1644 13.81037 3.526676 24.09407 -1.91718603 29.53794
## 2020.1671 13.80416 3.520191 24.08813 -1.92381462 29.53214
## 2020.1699 13.79827 3.514052 24.08249 -1.93008310 29.52662
## 2020.1726 13.79268 3.508240 24.07711 -1.93601196 29.52137
## 2020.1753 13.78737 3.502737 24.07201 -1.94162044 29.51637
## 2020.1781 13.78234 3.497525 24.06715 -1.94692659 29.51161
## 2020.1808 13.77756 3.492589 24.06254 -1.95194740 29.50708
## 2020.1836 13.77303 3.487915 24.05815 -1.95669883 29.50277
## 2020.1863 13.76874 3.483487 24.05399 -1.96119588 29.49867
## 2020.1890 13.76466 3.479292 24.05003 -1.96545267 29.49477
## 2020.1918 13.76079 3.475318 24.04626 -1.96948249 29.49106
#visualization
plot_model_holt <- autoplot(intan_holt_f, series = "Holtwinters", fcol = "red")+
autolayer(intan_ts, series = "Actual", color = "black")+
labs(subtitle = "Likes given by Intan daily from 2017-01-01 until 2020-03-11",
y = "Total likes")+
theme_minimal()
plot_model_arima <- autoplot(intan_arima_f, series = "Holtwinters", fcol = "red")+
autolayer(intan_ts, series = "Actual", color = "black")+
labs(subtitle = "Likes given by Intan daily from 2017-01-01 until 2020-03-11",
y = "Total likes")+
theme_minimal()
grid.arrange(plot_model_holt,plot_model_arima)
Holt-Winter is better model that auto-arima in forecasting.
#model evaluation : root mean squared error (RMSE)
data.frame(ETS = RMSE(intan_holt_f$mean, intan_test), ARIMA = RMSE(intan_arima_f$mean, intan_test))1. Normality : Shapiro.test
- H0 : residuals are normally distributed
- H1 : residuals are not normally distributed
##
## Shapiro-Wilk normality test
##
## data: intan_holt_f$residuals
## W = 0.95594, p-value = 0.00000000000004202
2. Autocorrelation : Box.test - Ljung-Box
- H0: No autocorrelation in the forecast errors
- H1: there is an autocorrelation in the forecast errors
##
## Box-Ljung test
##
## data: intan_holt_f$residuals
## X-squared = 4.1319, df = 1, p-value = 0.04208
##
## Box-Ljung test
##
## data: intan_arima_f$residuals
## X-squared = 0.00033281, df = 1, p-value = 0.9854
Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05) in ARIMA model. Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram.
In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. From that insight, airports can develop an standard operational procedure and smart strategies for dealing with such events.