The problem given for first part of the assignment involves forecasting of personal consumption expenditures from a US seasonally adjusted data given in PCE.csv . As part of the assignment, the data is used across 3 different forecasting models as follows :
The forecasting accuracies of the models are discussed in the subsequent sections in details for evaluation, and the best model is used to find the personal expenditure estimate for October 2024. Further, one-step ahead rolling forecasting is performed for all 3 models without re-estimation of the parameters to find a relative comparison across the models.
On loading the given data, in a data-frame and inspecting it can be found out that the data set starts from January 1959 and ends on November 2023 giving a total of 779 readings for expenditures across 64 years.
## DATE PCE
## 1 01/01/1959 306.1
## 2 01/02/1959 309.6
## 3 01/03/1959 312.7
## 4 01/04/1959 312.2
## 5 01/05/1959 316.1
## 6 01/06/1959 318.2
## DATE PCE
## 774 01/06/2023 18485.4
## 775 01/07/2023 18595.4
## 776 01/08/2023 18651.6
## 777 01/09/2023 18791.5
## 778 01/10/2023 18812.2
## 779 01/11/2023 18858.9
Now a time-series object is created using the start date and end date, and a periodicity of 12 months. On plotting the time-series it is seen that there is a non-linear trend in the data and it appears that the personal expenditure values had increased almost exponentially over the years. There is a sudden dip in the value during 2020 which can be attributed to ongoing global pandemic, other than which the overall trend seems non-linear over time, and already it has been mentioned that the given data is seasonally adjusted. Alongside the time-series shows presence of NULL values which need to be imputed before further analysis.
## [1] "Number of Null values : 43"
While filling missing values in a time-series, it’s essential to select an imputation method that can capture the complexity and dynamics of the data. The commonly used imputation techniques under imputeTS are as follows :
In this assignment the interpolation function is used for imputation. The (option = “stine”) is used as Stineman interpolation can provide better estimates in cases where the data has a non-linear trend.
Since the non-linear trend of the data is quite evident from the time-series plot, multiplicative decomposition is used to visualise the underlying components. The graph quite evidently points out that :
Moving forward, the ACF plot is used to identify further information about the time-series. The slow decay of the plot suggests that the trend component is quite strongly influencing the data and should be taken care of while choosing a model, in order to effectively forecast it. Also it can be inferred that this time-series is a non-stationary one.
A 80:20 ratio has been used here for splitting the data. Hence out of the 64 years, first 51 years of observations are used for training the models and rest 13 years are used for testing (validation set). After comparing the accuracies the best found model is used on entire data-set to forecast the expenditures in subsequent 11 periods extending till October 2024.
train_TS <- subset(ip_TS, end = (length(ip_TS) - (13*12)))
length(train_TS) # checking train set length
## [1] 623
test_TS <- subset(ip_TS,start = (length(ip_TS) - (13*12) + 1))
length(test_TS) # checking validation set length
## [1] 156
Drift is one of the simple random walk forecasting methods used for making short-term forecasts. Here the forecasted value is calculated by taking in the last observation adjusted by the average change observed in the historical data, assuming that the time series will continue to change at the same average rate as per historical observations.It doesn’t assume any underlying model structure, hence fails to capture any trend or seasonality.
Below is a snippet of the Drift model accuracies on the test-data.
## ME RMSE MAE MPE MAPE MASE
## Training set 2.564585e-14 24.83965 16.72727 -0.8283615 1.131619 0.08248904
## Test set 1.925996e+03 2545.27790 1926.57567 12.5872452 12.591982 9.50073503
## ACF1 Theil's U
## Training set 0.1215274 NA
## Test set 0.9702413 10.38423
Exponential smoothing methods produce forecasting results by giving more weightage to recent observations rather than older ones. In this case Holt’s linear method is used for forecasting the expense time-series because there is trend present and simple exponential smoothing cannot account for underlying trend or seasonality in the data.
Below is a snippet of Holt’s model accuracies on the test-data.
## ME RMSE MAE MPE MAPE MASE
## Training set 0.4355405 22.48205 12.3941 0.02469141 0.3976693 0.06112038
## Test set 571.1309814 1151.49047 649.7501 3.30069373 3.9445167 3.20418408
## ACF1 Theil's U
## Training set -0.01424022 NA
## Test set 0.95696315 4.420796
The AutoRegressive Integrated Moving Average (ARIMA) method is a statistical forecasting approach, where prediction of the time-series is done using a linear combination of its past values and errors. There is an underlying assumption that the data is stationery in this approach. The ARIMA model is denoted as ARIMA(p, d, q), where:
Here auto.arima() function is used to determine the ideal number of parameters for the train data. The model summary below shows the model parameters as (3,2,2).
## Series: train_TS
## ARIMA(3,2,2)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## 0.4622 0.1941 0.0650 -1.5322 0.5413
## s.e. 0.1633 0.0445 0.0576 0.1604 0.1562
##
## sigma^2 = 494.7: log likelihood = -2806.23
## AIC=5624.46 AICc=5624.6 BIC=5651.05
Snippet below captures the ARIMA(3,2,2) model’s accuracy on the test set.
## ME RMSE MAE MPE MAPE MASE
## Training set 1.258107 22.116 12.38468 0.06718656 0.4055402 0.06107394
## Test set 1020.890220 1592.717 1042.39499 6.35958873 6.5328026 5.14047736
## ACF1 Theil's U
## Training set -0.005215809 NA
## Test set 0.963688415 6.239238
For the evaluation of the model Root Mean Squared Errors and Mean Absolute Errors are taken into account. Below is a visualisation of the metrics for 3 models on test data :
It is visibly clear from the plots that Holt model is performing better for the test data. Plotting the prediction on the test window makes that further clear.
It is safe to conclude from the above plots that exponential smoothing using Holt model performs best on the given data and should be used for estimating the expense for October 2024. While inferring the performance of the models following points must be considered :
The Expense estimate mean for October 2024 is 19566.92 USD. Below is given the entire forecast window predictions along with intervals.
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Dec 2023 18923.27 18805.22 19041.31 18742.74 19103.80
## Jan 2024 18987.63 18819.53 19155.74 18730.54 19244.73
## Feb 2024 19052.00 18844.67 19259.33 18734.91 19369.08
## Mar 2024 19116.36 18875.29 19357.44 18747.67 19485.06
## Apr 2024 19180.73 18909.32 19452.14 18765.65 19595.81
## May 2024 19245.09 18945.73 19544.46 18787.25 19702.94
## Jun 2024 19309.46 18983.88 19635.04 18811.52 19807.40
## Jul 2024 19373.83 19023.38 19724.27 18837.86 19909.79
## Aug 2024 19438.19 19063.95 19812.43 18865.83 20010.55
## Sep 2024 19502.56 19105.39 19899.72 18895.14 20109.97
## Oct 2024 19566.92 19147.56 19986.29 18925.56 20208.29
Rolling forecasts are an effective way of comparing forecasting models on a single set of training data. For this task, the existing models are refitted on the entire expense data-set and then the rolling forecasts starting December 2010 (as per train-test split) are compared with the test data for accuracy. The rolling forecasts are calculated using fitted() function from fpp package
Below the accuracies of the one-step ahead forecasts are calculated on the existing models.
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 30.16155 201.3577 75.68835 0.1880269 0.541034 0.1829177 0.9762526
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 16.99518 200.1223 69.40865 0.1012731 0.5012335 0.1739243 0.9742215
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8.839923 221.2283 73.83458 0.04964673 0.5357159 0.2610437 1.084094
The above results consistently resonates the fact that the Holt model is performing superiorly on the given data.
In this task 3 different models have been used to forecast predictions for a seasonally adjusted US personal expenditure data and it has been shown that Holt exponential smoothing works best on the given data. As found in course of the analysis the given data has non-linear trend, is seasonally adjusted and is non-stationary. These attributes makes it difficult for Drift and ARIMA models to work well while predicting the forecast windows. The model evaluations using both RMSE and MAE scores hold true for this inference. So it is safe to conclude that estimates made for October 2024 using the Holt model should be closest possible to actual value coming in future.
For the second task, a set of online Hotel reviews have been given along with their respective ratings on a range of 1 - 5 with 1 denoting Low Satisfaction and 5 denoting High satisfaction. As part of the task, a text analysis model has to be designed to identify the factors that are discussed in positive and negative reviews respectively. In the subsequent sections, the notion for deciding positive and negative reviews, the steps for carrying out the analysis, and the criterion to decide number of topics has been discussed in details. Finally the topics have been labelled, in order to identify the top factors that affect customer satisfaction or grievances.
The entire data-set of 10000 hotel reviews in HotelsData.csv is loaded into a data-frame, and eventually following steps are executed for preparing the data :
## Review.score
## 2175 4
## Text.1
## 2175 We were there for 3 nights. The hotel is located in a very nice aneighborhood but a bit of a walk from public transportation. It's a lovely old building with lots of character. My room was spacious for London standard and clean and quiet. Bedding was comfortable. Bathroom was spacious but water pressure was limited. Need to wait 10 minutes in between flush. Staff was friendly and breakfast was excellent. Wi-fi didn't work very well. Despite some limitations, I would recommend this place to others and stay here again especially if they upgrade their wi-fi.
## Review.score
## 7872 1
## Text.1
## 7872 (1)Dirty; housekeeping doesn't clean the room well. (2)We were in a triple share room and everyday they only gave us two towels despite following up everyday repeatedly. (3)Takes forever to call any housekeeping - always leaving us on hold
After separating out the Positive and Negative reviews, it is seen that there are around 1760 and 240 counts in the respective data-sets. Next the term frequency document term matrix is created in order to be used as an input to learn which words are frequently found together in a document so that it could try to model the topics. The steps are broken down in subsequent points.
Two respective corpuses are created from the review datasets after converting the document contents to utf-8 encoding as some of the characters in the text are not characters that tm package can handle.
Next the DocumentTermMatrix() function is used for forming the matrix from the text data and it implicitly takes care of data cleaning steps like lemmatization, along with removal of punctuation, numbers, stopwords, and finally lowercase all tokens.
The output of the DTM function is then converted to a matrix for finding frequency of the top most used words and using those frequencies are used in plotting the word-clouds. Below the top 10 used words in both positive and negative review corpuses are shown followed with respective wordclouds.
## hotel room staff london good stay great breakfast
## 2636 2219 1352 1145 1090 1059 973 922
## location rooms
## 818 701
## room hotel staff one breakfast stay good bed
## 539 465 146 142 137 135 133 124
## london rooms
## 124 118
Topic modelling uses a probability mapping function to determine if a particular word is correlated to a certain topic using the co-occurrence of words in the documents. In R ldatuning and topicmodels libraries are used for this purpose. There are 2 major steps in this.
Using the ldatuning library, the optimal number of topics(k) are decided for LDA to generate out of the review set. In this case three criterions are choosen out of the available options in the library. The code minimizes the criteria Arun2010 and CaoJuan2009 and maximizes the Griffiths2004 over a range of 5 - 20 possible topics and the optimal number of topics is decided by graphical inspection.
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
This step involves generating a list of the topics covered by the documents and of grouping documents by the topics that was found. LDA function () is used with 1000 iterations for both the corpuses in this case, and it takes in the number of topics(k) decided in previous step as input.
For labelling the topics, the command term() is used with the output of LDA and the top 10 terms for each topic is inspected. The labels are based on the predominant themes conveyed by those top 10 words.
Referring to the top 10 terms in each topic, the labels are as follows:
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "staff" "london" "get" "room" "great" "good"
## [2,] "friendly" "rooms" "like" "shower" "stay" "breakfast"
## [3,] "helpful" "hotel" "can" "bathroom" "location" "food"
## [4,] "clean" "hotels" "even" "bed" "stayed" "clean"
## [5,] "stay" "stayed" "quite" "small" "definitely" "comfortable"
## [6,] "comfortable" "well" "want" "desk" "service" "value"
## [7,] "excellent" "always" "people" "one" "recommend" "price"
## [8,] "extremely" "business" "little" "water" "hotel" "location"
## [9,] "pleasant" "trip" "one" "door" "will" "modern"
## [10,] "stayed" "staying" "really" "front" "staff" "quality"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## [1,] "breakfast" "room" "hotel" "hotel" "room" "walk"
## [2,] "tea" "check" "stay" "nice" "night" "station"
## [3,] "well" "day" "really" "street" "just" "tube"
## [4,] "also" "went" "time" "park" "one" "close"
## [5,] "free" "arrived" "everything" "around" "two" "easy"
## [6,] "nice" "asked" "will" "rooms" "floor" "minutes"
## [7,] "lovely" "early" "next" "well" "bed" "restaurants"
## [8,] "room" "first" "much" "small" "nights" "walking"
## [9,] "lounge" "said" "back" "near" "got" "distance"
## [10,] "etc" "told" "visit" "quiet" "reception" "minute"
## Topic 13 Topic 14
## [1,] "hotel" "service"
## [2,] "london" "perfect"
## [3,] "area" "lovely"
## [4,] "big" "experience"
## [5,] "location" "wonderful"
## [6,] "city" "staff"
## [7,] "also" "made"
## [8,] "view" "special"
## [9,] "large" "amazing"
## [10,] "access" "best"
Referring to the top 10 terms in each topic, the labels are as follows:
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
## [1,] "hotel" "breakfast" "one" "room" "room" "london" "reception"
## [2,] "will" "good" "even" "door" "night" "hotel" "told"
## [3,] "well" "didnt" "made" "area" "floor" "stayed" "asked"
## [4,] "time" "small" "desk" "dirty" "get" "hotels" "manager"
## [5,] "days" "location" "going" "looked" "couldnt" "just" "check"
## [6,] "many" "close" "front" "however" "hot" "can" "said"
## [7,] "now" "people" "well" "next" "sleep" "little" "went"
## [8,] "problem" "enough" "price" "stayed" "bad" "station" "left"
## [9,] "phone" "poor" "just" "just" "windows" "found" "one"
## [10,] "small" "coffee" "feel" "stay" "really" "tube" "morning"
## Topic 8 Topic 9 Topic 10 Topic 11
## [1,] "rooms" "bed" "staff" "room"
## [2,] "back" "shower" "stay" "service"
## [3,] "bar" "water" "clean" "booked"
## [4,] "like" "bathroom" "nice" "also"
## [5,] "stay" "tiny" "place" "like"
## [6,] "got" "room" "great" "first"
## [7,] "get" "need" "friendly" "see"
## [8,] "never" "really" "helpful" "much"
## [9,] "put" "nothing" "walk" "beds"
## [10,] "around" "think" "away" "know"
From the above labels, it can be inferred that some of the topics are interrelated, but mostly reviews are concerned around overall service quality, stay experiences, location and cleanliness of the hotels. To determine the top factors governing the nature of review, further analysis is done in next section.
For determining the factors that affect customer satisfaction and grievances, the original reviews are assigned under the topics modeled in previous section. Then these topic label counts are compared to understand the relevance of those factors across the review set and determine top 3 factors for both positive and negative reviews.
As visualised in the countplot below, the topics most discussed in satisfactory reviews are :
As visualised in the countplot below, the topics most discussed in unpleasant reviews are :
In course of this text analysis task, a variety of transformations has been carried on the online hotel review data to understand the factors affecting customer satisfaction. While the words “hotels” and “rooms” were most used in terms of frequency within the token corpus, the broader subject of discussion across the reviews were found to be different. Across both the positive and negative review sets, Accessibility to transport and Location has been a common key topic of interest, along side overall quality of service and cleanliness. It can be concluded that Location of a hotel, and the kind of hospitality offered to the guests are the most prominent factors that has been captured in the online reviews.