Part 1

1. Introduction

The problem given for first part of the assignment involves forecasting of personal consumption expenditures from a US seasonally adjusted data given in PCE.csv . As part of the assignment, the data is used across 3 different forecasting models as follows :

Simple forecasting method
Exponential smoothing method
ARIMA model

The forecasting accuracies of the models are discussed in the subsequent sections in details for evaluation, and the best model is used to find the personal expenditure estimate for October 2024. Further, one-step ahead rolling forecasting is performed for all 3 models without re-estimation of the parameters to find a relative comparison across the models.

2. Data description and Preprocessing

On loading the given data, in a data-frame and inspecting it can be found out that the data set starts from January 1959 and ends on November 2023 giving a total of 779 readings for expenditures across 64 years.

##         DATE   PCE
## 1 01/01/1959 306.1
## 2 01/02/1959 309.6
## 3 01/03/1959 312.7
## 4 01/04/1959 312.2
## 5 01/05/1959 316.1
## 6 01/06/1959 318.2

##           DATE     PCE
## 774 01/06/2023 18485.4
## 775 01/07/2023 18595.4
## 776 01/08/2023 18651.6
## 777 01/09/2023 18791.5
## 778 01/10/2023 18812.2
## 779 01/11/2023 18858.9

Now a time-series object is created using the start date and end date, and a periodicity of 12 months. On plotting the time-series it is seen that there is a non-linear trend in the data and it appears that the personal expenditure values had increased almost exponentially over the years. There is a sudden dip in the value during 2020 which can be attributed to ongoing global pandemic, other than which the overall trend seems non-linear over time, and already it has been mentioned that the given data is seasonally adjusted. Alongside the time-series shows presence of NULL values which need to be imputed before further analysis.

## [1] "Number of Null values :  43"

2.1 Time-series Imputation

While filling missing values in a time-series, it’s essential to select an imputation method that can capture the complexity and dynamics of the data. The commonly used imputation techniques under imputeTS are as follows :

na_interpolation performs imputation of missing values in time series data using interpolation methods. This function supports several types of interpolation methods, including linear, spline, and stine interpolation.
na_ma provides a straightforward way to handle missing values in time series data using moving average methods.
na_kalman provides a robust way to handle missing values in time series data using Kalman smoothing and state space models.

In this assignment the interpolation function is used for imputation. The (option = “stine”) is used as Stineman interpolation can provide better estimates in cases where the data has a non-linear trend.

2.2 Decomposition and Analysis

Since the non-linear trend of the data is quite evident from the time-series plot, multiplicative decomposition is used to visualise the underlying components. The graph quite evidently points out that :

The trend of the data has an exponential incline with time.
The seasonality of the time-series is constant over time, since the data has been seasonally adjusted as already mentioned.
There is some finite fluctuations in the random component across time and the sudden dip in the data around 2020 is due to the underlying randomness of the global pandemic situation.

Moving forward, the ACF plot is used to identify further information about the time-series. The slow decay of the plot suggests that the trend component is quite strongly influencing the data and should be taken care of while choosing a model, in order to effectively forecast it. Also it can be inferred that this time-series is a non-stationary one.

3. Train-test split

A 80:20 ratio has been used here for splitting the data. Hence out of the 64 years, first 51 years of observations are used for training the models and rest 13 years are used for testing (validation set). After comparing the accuracies the best found model is used on entire data-set to forecast the expenditures in subsequent 11 periods extending till October 2024.

train_TS <- subset(ip_TS, end = (length(ip_TS) - (13*12)))
length(train_TS) # checking train set length

## [1] 623

test_TS <- subset(ip_TS,start = (length(ip_TS) - (13*12) + 1))
length(test_TS) # checking validation set length

## [1] 156

4. Forecasting using different models

4.1 Simple Forecasting Method - Drift

Drift is one of the simple random walk forecasting methods used for making short-term forecasts. Here the forecasted value is calculated by taking in the last observation adjusted by the average change observed in the historical data, assuming that the time series will continue to change at the same average rate as per historical observations.It doesn’t assume any underlying model structure, hence fails to capture any trend or seasonality.

Below is a snippet of the Drift model accuracies on the test-data.

##                        ME       RMSE        MAE        MPE      MAPE       MASE
## Training set 2.564585e-14   24.83965   16.72727 -0.8283615  1.131619 0.08248904
## Test set     1.925996e+03 2545.27790 1926.57567 12.5872452 12.591982 9.50073503
##                   ACF1 Theil's U
## Training set 0.1215274        NA
## Test set     0.9702413  10.38423

4.2 Exponential Smoothing Method - Holt

Exponential smoothing methods produce forecasting results by giving more weightage to recent observations rather than older ones. In this case Holt’s linear method is used for forecasting the expense time-series because there is trend present and simple exponential smoothing cannot account for underlying trend or seasonality in the data.

Below is a snippet of Holt’s model accuracies on the test-data.

##                       ME       RMSE      MAE        MPE      MAPE       MASE
## Training set   0.4355405   22.48205  12.3941 0.02469141 0.3976693 0.06112038
## Test set     571.1309814 1151.49047 649.7501 3.30069373 3.9445167 3.20418408
##                     ACF1 Theil's U
## Training set -0.01424022        NA
## Test set      0.95696315  4.420796

4.3 ARIMA model

The AutoRegressive Integrated Moving Average (ARIMA) method is a statistical forecasting approach, where prediction of the time-series is done using a linear combination of its past values and errors. There is an underlying assumption that the data is stationery in this approach. The ARIMA model is denoted as ARIMA(p, d, q), where:

p: The number of autoregressive terms, represents the relationship between an observation and a number of lagged observations (autoregressive terms).
d: The degree of differencing, indicating the number of times differencing is required to achieve stationarity.
q: The number of moving average terms, representing the relationship between an observation and a residual error from a moving average model applied to lagged observations.

Here auto.arima() function is used to determine the ideal number of parameters for the train data. The model summary below shows the model parameters as (3,2,2).

## Series: train_TS 
## ARIMA(3,2,2) 
## 
## Coefficients:
##          ar1     ar2     ar3      ma1     ma2
##       0.4622  0.1941  0.0650  -1.5322  0.5413
## s.e.  0.1633  0.0445  0.0576   0.1604  0.1562
## 
## sigma^2 = 494.7:  log likelihood = -2806.23
## AIC=5624.46   AICc=5624.6   BIC=5651.05

Snippet below captures the ARIMA(3,2,2) model’s accuracy on the test set.

##                       ME     RMSE        MAE        MPE      MAPE       MASE
## Training set    1.258107   22.116   12.38468 0.06718656 0.4055402 0.06107394
## Test set     1020.890220 1592.717 1042.39499 6.35958873 6.5328026 5.14047736
##                      ACF1 Theil's U
## Training set -0.005215809        NA
## Test set      0.963688415  6.239238

5. Evaluation and Visualisation

For the evaluation of the model Root Mean Squared Errors and Mean Absolute Errors are taken into account. Below is a visualisation of the metrics for 3 models on test data :

It is visibly clear from the plots that Holt model is performing better for the test data. Plotting the prediction on the test window makes that further clear.

It is safe to conclude from the above plots that exponential smoothing using Holt model performs best on the given data and should be used for estimating the expense for October 2024. While inferring the performance of the models following points must be considered :

The Drift model as discussed earlier is a simple model and fails to capture any trend or seasonality in the data. Hence it performs very poorly on the forecasting window.
The ARIMA model assumes that dataset is stationary. In this case the ACF plot suggests that this data is non-stationary hence ARIMA doesn’t work too well.
The exponential model using Holt’s method is able to capture trend in a data when there is no seasonality and hence performs best on the model.

The Expense estimate mean for October 2024 is 19566.92 USD. Below is given the entire forecast window predictions along with intervals.

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Dec 2023       18923.27 18805.22 19041.31 18742.74 19103.80
## Jan 2024       18987.63 18819.53 19155.74 18730.54 19244.73
## Feb 2024       19052.00 18844.67 19259.33 18734.91 19369.08
## Mar 2024       19116.36 18875.29 19357.44 18747.67 19485.06
## Apr 2024       19180.73 18909.32 19452.14 18765.65 19595.81
## May 2024       19245.09 18945.73 19544.46 18787.25 19702.94
## Jun 2024       19309.46 18983.88 19635.04 18811.52 19807.40
## Jul 2024       19373.83 19023.38 19724.27 18837.86 19909.79
## Aug 2024       19438.19 19063.95 19812.43 18865.83 20010.55
## Sep 2024       19502.56 19105.39 19899.72 18895.14 20109.97
## Oct 2024       19566.92 19147.56 19986.29 18925.56 20208.29

6. One-step ahead rolling forecasting without re-estimation of the parameters

Rolling forecasts are an effective way of comparing forecasting models on a single set of training data. For this task, the existing models are refitted on the entire expense data-set and then the rolling forecasts starting December 2010 (as per train-test split) are compared with the test data for accuracy. The rolling forecasts are calculated using fitted() function from fpp package

Below the accuracies of the one-step ahead forecasts are calculated on the existing models.

Drift Model Accuracies :

##                ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set 30.16155 201.3577 75.68835 0.1880269 0.541034 0.1829177 0.9762526

Holt Model Accuracies :

##                ME     RMSE      MAE       MPE      MAPE      ACF1 Theil's U
## Test set 16.99518 200.1223 69.40865 0.1012731 0.5012335 0.1739243 0.9742215

ARIMA Model Accuracies :

##                ME     RMSE      MAE        MPE      MAPE      ACF1 Theil's U
## Test set 8.839923 221.2283 73.83458 0.04964673 0.5357159 0.2610437  1.084094

The above results consistently resonates the fact that the Holt model is performing superiorly on the given data.

7. Conclusion

In this task 3 different models have been used to forecast predictions for a seasonally adjusted US personal expenditure data and it has been shown that Holt exponential smoothing works best on the given data. As found in course of the analysis the given data has non-linear trend, is seasonally adjusted and is non-stationary. These attributes makes it difficult for Drift and ARIMA models to work well while predicting the forecast windows. The model evaluations using both RMSE and MAE scores hold true for this inference. So it is safe to conclude that estimates made for October 2024 using the Holt model should be closest possible to actual value coming in future.

Part 2

1.Introduction

For the second task, a set of online Hotel reviews have been given along with their respective ratings on a range of 1 - 5 with 1 denoting Low Satisfaction and 5 denoting High satisfaction. As part of the task, a text analysis model has to be designed to identify the factors that are discussed in positive and negative reviews respectively. In the subsequent sections, the notion for deciding positive and negative reviews, the steps for carrying out the analysis, and the criterion to decide number of topics has been discussed in details. Finally the topics have been labelled, in order to identify the top factors that affect customer satisfaction or grievances.

2.Data Preprocessing

The entire data-set of 10000 hotel reviews in HotelsData.csv is loaded into a data-frame, and eventually following steps are executed for preparing the data :

All non-English reviews are removed using textcat library
For understanding the sentiment of the reviews the given score is referenced. Reviews having score greater than 3 are considered for positive feedback and reviews having score less than 3 are considered negative. Hence reviews scoring 3 are considered neutral.
After dropping the neutral reviews and NULL value rows, sampling is done from the remnant data using the method mentioned in brief.
The sample is then divided in to positive and negative sets for next steps of analysis and topic modelling. (Single examples of positive and negative reviews given below)

##      Review.score
## 2175            4
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Text.1
## 2175 We were there for 3 nights. The hotel is located in a very nice aneighborhood but a bit of a walk from public transportation. It's a lovely old building with lots of character. My room was spacious for London standard and clean and quiet. Bedding was comfortable. Bathroom was spacious but water pressure was limited. Need to wait 10 minutes in between flush. Staff was friendly and breakfast was excellent. Wi-fi didn't work very well. Despite some limitations, I would recommend this place to others and stay here again especially if they upgrade their wi-fi.

##      Review.score
## 7872            1
##                                                                                                                                                                                                                                               Text.1
## 7872 (1)Dirty; housekeeping doesn't clean the room well. (2)We were in a triple share room and everyday they only gave us two towels despite following up everyday repeatedly. (3)Takes forever to call any housekeeping - always leaving us on hold

3.Creating Document Term Matrix with Term Frequency and Word Cloud Visualisation

After separating out the Positive and Negative reviews, it is seen that there are around 1760 and 240 counts in the respective data-sets. Next the term frequency document term matrix is created in order to be used as an input to learn which words are frequently found together in a document so that it could try to model the topics. The steps are broken down in subsequent points.

Two respective corpuses are created from the review datasets after converting the document contents to utf-8 encoding as some of the characters in the text are not characters that tm package can handle.
Next the DocumentTermMatrix() function is used for forming the matrix from the text data and it implicitly takes care of data cleaning steps like lemmatization, along with removal of punctuation, numbers, stopwords, and finally lowercase all tokens.
The output of the DTM function is then converted to a matrix for finding frequency of the top most used words and using those frequencies are used in plotting the word-clouds. Below the top 10 used words in both positive and negative review corpuses are shown followed with respective wordclouds.

##     hotel      room     staff    london      good      stay     great breakfast 
##      2636      2219      1352      1145      1090      1059       973       922 
##  location     rooms 
##       818       701

##      room     hotel     staff       one breakfast      stay      good       bed 
##       539       465       146       142       137       135       133       124 
##    london     rooms 
##       124       118

4. Topic Modelling with Latent Dirichlet Allocation (LDA)

Topic modelling uses a probability mapping function to determine if a particular word is correlated to a certain topic using the co-occurrence of words in the documents. In R ldatuning and topicmodels libraries are used for this purpose. There are 2 major steps in this.

4.1 Determining the number of topics

Using the ldatuning library, the optimal number of topics(k) are decided for LDA to generate out of the review set. In this case three criterions are choosen out of the available options in the library. The code minimizes the criteria Arun2010 and CaoJuan2009 and maximizes the Griffiths2004 over a range of 5 - 20 possible topics and the optimal number of topics is decided by graphical inspection.

For the positive review DTM, 14 is choosen as the optimal number of topics. As seen in the plots below, Griffiths2004 reaches an elbow maxima at k=14, and Arun2010 is not converging beyond k=13. So increasing topics beyond this point leads to concentric topic segregation.

## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.

For the negative review DTM, 11 is choosen as the optimal number of topics. In the plots, at k = 11, CaoJuan2009 seems to have achieved global minima and Griffiths2004 achieves a converging elbow and doesn’t show any major maximisation trend beyond this point.

## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.

4.2 Modelling with LDA

This step involves generating a list of the topics covered by the documents and of grouping documents by the topics that was found. LDA function () is used with 1000 iterations for both the corpuses in this case, and it takes in the number of topics(k) decided in previous step as input.

5. Topic Labelling

For labelling the topics, the command term() is used with the output of LDA and the top 10 terms for each topic is inspected. The labels are based on the predominant themes conveyed by those top 10 words.

5.1 Positive topics labelling

Referring to the top 10 terms in each topic, the labels are as follows:

Staff and Service quality
Business travel in London
Overall Guest Impression
Room and Bathroom Facilities
Recommendations based on Overall Satisfaction
Food and Cleanliness
Complimentary Amenities
Check-in Process and Guest Experiences
Stay Experience
Hotel Surroundings
Stay Duration
Accessibility and Proximity to Transportation
Location and size of Hotel
Exceptional Hospitality

##       Topic 1       Topic 2    Topic 3  Topic 4    Topic 5      Topic 6      
##  [1,] "staff"       "london"   "get"    "room"     "great"      "good"       
##  [2,] "friendly"    "rooms"    "like"   "shower"   "stay"       "breakfast"  
##  [3,] "helpful"     "hotel"    "can"    "bathroom" "location"   "food"       
##  [4,] "clean"       "hotels"   "even"   "bed"      "stayed"     "clean"      
##  [5,] "stay"        "stayed"   "quite"  "small"    "definitely" "comfortable"
##  [6,] "comfortable" "well"     "want"   "desk"     "service"    "value"      
##  [7,] "excellent"   "always"   "people" "one"      "recommend"  "price"      
##  [8,] "extremely"   "business" "little" "water"    "hotel"      "location"   
##  [9,] "pleasant"    "trip"     "one"    "door"     "will"       "modern"     
## [10,] "stayed"      "staying"  "really" "front"    "staff"      "quality"    
##       Topic 7     Topic 8   Topic 9      Topic 10 Topic 11    Topic 12     
##  [1,] "breakfast" "room"    "hotel"      "hotel"  "room"      "walk"       
##  [2,] "tea"       "check"   "stay"       "nice"   "night"     "station"    
##  [3,] "well"      "day"     "really"     "street" "just"      "tube"       
##  [4,] "also"      "went"    "time"       "park"   "one"       "close"      
##  [5,] "free"      "arrived" "everything" "around" "two"       "easy"       
##  [6,] "nice"      "asked"   "will"       "rooms"  "floor"     "minutes"    
##  [7,] "lovely"    "early"   "next"       "well"   "bed"       "restaurants"
##  [8,] "room"      "first"   "much"       "small"  "nights"    "walking"    
##  [9,] "lounge"    "said"    "back"       "near"   "got"       "distance"   
## [10,] "etc"       "told"    "visit"      "quiet"  "reception" "minute"     
##       Topic 13   Topic 14    
##  [1,] "hotel"    "service"   
##  [2,] "london"   "perfect"   
##  [3,] "area"     "lovely"    
##  [4,] "big"      "experience"
##  [5,] "location" "wonderful" 
##  [6,] "city"     "staff"     
##  [7,] "also"     "made"      
##  [8,] "view"     "special"   
##  [9,] "large"    "amazing"   
## [10,] "access"   "best"

5.2 Negative topics labelling

Referring to the top 10 terms in each topic, the labels are as follows:

Overall Hotel Issues
Breakfast and Location problems
Value for Money
Cleanliness and Maintenance
Ventilation and temperature complaints
London Hotel Experiences
Reception and Staff Issues
Room Comfort and Bar Issues
Bathroom and Shower Issues
Overall Stay Experience
Staff and Service Quality

##       Topic 1   Topic 2     Topic 3 Topic 4   Topic 5   Topic 6   Topic 7    
##  [1,] "hotel"   "breakfast" "one"   "room"    "room"    "london"  "reception"
##  [2,] "will"    "good"      "even"  "door"    "night"   "hotel"   "told"     
##  [3,] "well"    "didnt"     "made"  "area"    "floor"   "stayed"  "asked"    
##  [4,] "time"    "small"     "desk"  "dirty"   "get"     "hotels"  "manager"  
##  [5,] "days"    "location"  "going" "looked"  "couldnt" "just"    "check"    
##  [6,] "many"    "close"     "front" "however" "hot"     "can"     "said"     
##  [7,] "now"     "people"    "well"  "next"    "sleep"   "little"  "went"     
##  [8,] "problem" "enough"    "price" "stayed"  "bad"     "station" "left"     
##  [9,] "phone"   "poor"      "just"  "just"    "windows" "found"   "one"      
## [10,] "small"   "coffee"    "feel"  "stay"    "really"  "tube"    "morning"  
##       Topic 8  Topic 9    Topic 10   Topic 11 
##  [1,] "rooms"  "bed"      "staff"    "room"   
##  [2,] "back"   "shower"   "stay"     "service"
##  [3,] "bar"    "water"    "clean"    "booked" 
##  [4,] "like"   "bathroom" "nice"     "also"   
##  [5,] "stay"   "tiny"     "place"    "like"   
##  [6,] "got"    "room"     "great"    "first"  
##  [7,] "get"    "need"     "friendly" "see"    
##  [8,] "never"  "really"   "helpful"  "much"   
##  [9,] "put"    "nothing"  "walk"     "beds"   
## [10,] "around" "think"    "away"     "know"

From the above labels, it can be inferred that some of the topics are interrelated, but mostly reviews are concerned around overall service quality, stay experiences, location and cleanliness of the hotels. To determine the top factors governing the nature of review, further analysis is done in next section.

6. Top factors affecting customer sentiments

For determining the factors that affect customer satisfaction and grievances, the original reviews are assigned under the topics modeled in previous section. Then these topic label counts are compared to understand the relevance of those factors across the review set and determine top 3 factors for both positive and negative reviews.

6.1 Top 3 factors in Positive reviews

As visualised in the countplot below, the topics most discussed in satisfactory reviews are :

Topic 12 : Accessibility and Proximity to Transportation
Topic 1 : Staff and Service quality
Topic 14 : Exceptional Hospitality

6.2 Top 3 factors in Negative reviews

As visualised in the countplot below, the topics most discussed in unpleasant reviews are :

Topic 2 : Breakfast and Location problems
Topic 6 : London Hotel Experiences
Topic 9 : Bathroom and Shower Issues

7. Conclusion

In course of this text analysis task, a variety of transformations has been carried on the online hotel review data to understand the factors affecting customer satisfaction. While the words “hotels” and “rooms” were most used in terms of frequency within the token corpus, the broader subject of discussion across the reviews were found to be different. Across both the positive and negative review sets, Accessibility to transport and Location has been a common key topic of interest, along side overall quality of service and cleanliness. It can be concluded that Location of a hotel, and the kind of hospitality offered to the guests are the most prominent factors that has been captured in the online reviews.

Assessed Coursework

Part 1

1. Introduction

2. Data description and Preprocessing

2.1 Time-series Imputation

2.2 Decomposition and Analysis

3. Train-test split

4. Forecasting using different models

4.1 Simple Forecasting Method - Drift

4.2 Exponential Smoothing Method - Holt

4.3 ARIMA model

5. Evaluation and Visualisation

6. One-step ahead rolling forecasting without re-estimation of the parameters

7. Conclusion

Part 2

1.Introduction

2.Data Preprocessing

3.Creating Document Term Matrix with Term Frequency and Word Cloud Visualisation

4. Topic Modelling with Latent Dirichlet Allocation (LDA)

4.1 Determining the number of topics

4.2 Modelling with LDA

5. Topic Labelling

5.1 Positive topics labelling

5.2 Negative topics labelling

6. Top factors affecting customer sentiments

6.1 Top 3 factors in Positive reviews

6.2 Top 3 factors in Negative reviews

7. Conclusion