library(tidyverse)
library(ggplot2)
library(dplyr)
library(reshape2)
library(GGally)
library(data.table)
library(formattable)
library(car)
workation_data <- read.csv("/Users/sindhuherle/Documents/DAM/best cities for a workation.csv")

## change column names as they are long
colnames(workation_data) <- c('ranking','city','country', 'internet', 'co_working_spaces','coffee_price','travel_price','post_work_drinks','accomodation_price','food_price','average_climate','tourist_attraction','insta_photos')

Project Contents

1. Project Overview

1.1. Introduction
1.2. Basic EDA

2. Visualizations

2.1. Heat Map
2.2. Scatterplots
2.3. Bar Charts
2.4. Box Plot

3. Linear Regression

4. Appendix

1. Project Overview

1.1 Introduction

As a team we have finalized the dataset on workcation from Kaggle (https://www.kaggle.com/prasertk/the-best-cities-for-a-workation?select=best+cities+for+a+workation.csv). When the pandemic struck, organizations had to allow their employees to work remotely. Now despite attempts to go back to the pre-covid days, most organizations are still preferring to continue with the remote work options. With this new flexibility, a lot of employees, mostly travel enthusiasts, are exploring options of being able to combine their passion for travel without draining their paid leave. It is this interest that attracted our team to the dataset, and we were curious to understand the inferences from analyzing this data.

The dataset has 13 variables and 148 observations. The variables include details like the city name, country name, average Wi-Fi speed per city, number of co-working spaces, average price of coffee, average price for taxi, beer and food, average price for accommodation per month, number of tourist destinations, number of photos per city that have been referenced by Instagram hashtags and number of sunshine hours.

1.2 Basic EDA

Our first approach was to import the libraries and loaded the dataset and perform a simple preview and statistics of our dataset. We used the head() , str(), summary(), sum(is.na()). The output showed us 147 observations and 13 variables and no null values. None of the column data types were changed and most of our data type was numeric. Our dataset is reasonable in size. The summary function shows each variable minimum, 1sy quartile, median, mean, 3rd quartile, and maximum. Some of the cities have the same ranking and hence we decided that we wouldn’t use the Ranking variable any longer for our analysis. We renamed the column names to provide short and suitable names.

2. Visualizations

Before analyzing, let us view the dataset in a table form to get a better understanding of all the variables

head(workation_data)
##   ranking         city   country internet co_working_spaces coffee_price
## 1       1      Bangkok  Thailand       28               117         1.56
## 2       2    New Delhi     India       12               165         1.42
## 3       3       Lisbon  Portugal       33                95         1.56
## 4       4    Barcelona     Spain       37               136         1.59
## 5       5 Buenos Aires Argentina       17                67         1.22
## 6       5     Budapest   Hungary       37                40         1.20
##   travel_price post_work_drinks accomodation_price food_price average_climate
## 1         0.82             3.08             415.18       1.54            2624
## 2         0.19             2.90             179.25       2.90            2685
## 3         0.40             3.42             736.19       7.69            2806
## 4         1.01             5.12             768.46      10.25            2591
## 5         0.47             2.16             229.55       5.15            2525
## 6         0.72             2.40             366.66       4.81            1988
##   tourist_attraction insta_photos
## 1               2262     28386616
## 2               2019     28528249
## 3               1969     10205538
## 4               2739     62894055
## 5               1660     21293975
## 6               1468     14267880
2.1 Correlation Matrix Heatmap

The correlation matrix above supports us to analyze the effect of different parameters over ranking. The correlation matrix heap here is very noteworthy. As opposed to what we might consider internet speed, co-working spaces, or tourist attraction as the parameters for affecting the rank of best cities to work in, it is the quite opposite. We can see that the highest correlation with the ranking are cost of post work drinks, food, travel, and caffeine which represents that people are looking for more quality of life

2.2 Scatter Plots

To further understand the parameters which deeply affect the global ranking of cities, we created scatterplots for the four variables which had highest coorelation with ranking. From the scatterplots, its evident that lower the cost of food, beverages, and transportation, higher the ranking of cities for working.

2.3 Bar Charts

In this analysis, we are interpreting the dataset to understand the top 15 cheapest and top 15 most expensive cities for a workcation. To arrive at this conclusion, we have calculated the average monthly expenses for meals, beverages, transportation, and accommodation. We are then ranking the cities based on the cost of living. The higher the cost of living, the larger their ranking. We have then plotted the following based on the top 15 and least 15 rankings in the sorted data.

2.4 Box Plots

We are analyzing the four parameters from the dataset to observe the outliers. Box plot helps us in understanding the statistics of the dataset (mean, median, quartiles, and outliers). From the above plots, we can see that there are not many outliers in the dataset, therefore we need not remove them.

3. Linear Regression

Initially, we ran an individual regression for each particular variable in order to see their effect on the total ranking of each country. After compiling all the individual regressions (Figures 1.1 – 1.10 in Appendix), we saw that coffee price, travel price, post work drink price, and food price are the only variables with an R-Squared value greater than 0.2. In a strong and predictive model, the R-Squared value is ideally greater than 0.7, which is obviously not the case for any of these variables. Because of this, we thought it might be beneficial to run a multiple linear regression with the four variables with the highest R-Squared value.

#Multiple Linear Regression Model

multiple_lr_model <- lm(workation_data$ranking ~ workation_data$coffee_price + workation_data$travel_price + workation_data$post_work_drinks + workation_data$food_price)
summary(multiple_lr_model)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$coffee_price + 
##     workation_data$travel_price + workation_data$post_work_drinks + 
##     workation_data$food_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.078 -27.234   1.834  26.735  68.448 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.885      8.251   1.198 0.232920    
## workation_data$coffee_price        9.383      4.691   2.000 0.047404 *  
## workation_data$travel_price       23.966      6.357   3.770 0.000239 ***
## workation_data$post_work_drinks    1.614      1.240   1.301 0.195214    
## workation_data$food_price          1.048      1.034   1.014 0.312205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.37 on 142 degrees of freedom
## Multiple R-squared:  0.3668, Adjusted R-squared:  0.3489 
## F-statistic: 20.56 on 4 and 142 DF,  p-value: 2.2e-13

While the multiple linear regression model still did not produce an R-Squared value greater than 0.7 indicating it is strong and predictive, it did produce a value greater than any of the individual linear regression models. This shows that while the combination of these variables isn’t entirely predictive of the ranking for which city is the best for a workstation, it is better than any of them individually. We see that our r square value is low but p value is significant, so this indicates that our model doesn’t explain much of variation of the data but it is significant (better than not having a model)

avPlots(multiple_lr_model)

4. Appendix

Figure 1.1 - Internet to Ranking Linear Regression Model

# Internet to Ranking Regression

regression_internet <- lm(workation_data$ranking ~ workation_data$internet)
summary(regression_internet)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$internet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.904 -32.537   0.785  35.758  74.135 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              82.9244     6.7235  12.334   <2e-16 ***
## workation_data$internet  -0.3350     0.2141  -1.565     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.39 on 145 degrees of freedom
## Multiple R-squared:  0.0166, Adjusted R-squared:  0.009822 
## F-statistic: 2.448 on 1 and 145 DF,  p-value: 0.1198

Figure 1.2 - Co-Working Spaces to Ranking Linear Regression Model

#Co-Working Spaces to Ranking Regression

regression_co_working_spaces <- lm(workation_data$ranking ~ workation_data$co_working_spaces)
summary(regression_co_working_spaces)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$co_working_spaces)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.302 -30.703  -3.312  32.804 123.287 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      91.22865    4.44481  20.525  < 2e-16 ***
## workation_data$co_working_spaces -0.35697    0.06375  -5.599 1.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.76 on 145 degrees of freedom
## Multiple R-squared:  0.1778, Adjusted R-squared:  0.1721 
## F-statistic: 31.35 on 1 and 145 DF,  p-value: 1.046e-07

Figure 1.3 - Coffee Price to Ranking Linear Regression Model

#Coffee Price to Ranking Regression

regression_coffee <- lm(workation_data$ranking ~ workation_data$coffee_price)
summary(regression_coffee)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$coffee_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.932 -31.211   1.981  33.019  70.886 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   20.562      8.647   2.378   0.0187 *  
## workation_data$coffee_price   23.413      3.542   6.609 6.88e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 145 degrees of freedom
## Multiple R-squared:  0.2315, Adjusted R-squared:  0.2262 
## F-statistic: 43.68 on 1 and 145 DF,  p-value: 6.881e-10

Figure 1.4 - Travel Price to Ranking Linear Regression Model

#Travel Price to Ranking Regression

regression_travel <- lm(workation_data$ranking ~ workation_data$travel_price)
summary(regression_travel)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$travel_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.788 -30.576  -0.432  26.452  82.046 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   35.412      6.256   5.660 7.83e-08 ***
## workation_data$travel_price   38.319      5.437   7.048 6.81e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.89 on 145 degrees of freedom
## Multiple R-squared:  0.2552, Adjusted R-squared:   0.25 
## F-statistic: 49.67 on 1 and 145 DF,  p-value: 6.812e-11

Figure 1.5 - Post Work Drinks to Ranking Linear Regression Model

#Post Work Drinks to Ranking Regression

regression_drinks <- lm(workation_data$ranking ~ workation_data$post_work_drinks)
summary(regression_drinks)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$post_work_drinks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.216 -33.236   3.404  28.103  85.518 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      40.6947     6.3484   6.410 1.92e-09 ***
## workation_data$post_work_drinks   5.3584     0.8882   6.033 1.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.22 on 145 degrees of freedom
## Multiple R-squared:  0.2006, Adjusted R-squared:  0.1951 
## F-statistic:  36.4 on 1 and 145 DF,  p-value: 1.279e-08

Figure 1.6 - Accommodation Price to Ranking Linear Regression Model

#Accommodation Price to Ranking Regression

regression_accommodation <- lm(workation_data$ranking ~ workation_data$accomodation_price)
summary(regression_accommodation)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$accomodation_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.330 -38.107   2.205  36.309  69.562 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       57.018770   6.760284   8.434 3.07e-14 ***
## workation_data$accomodation_price  0.023515   0.008098   2.904  0.00426 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.56 on 145 degrees of freedom
## Multiple R-squared:  0.05496,    Adjusted R-squared:  0.04844 
## F-statistic: 8.432 on 1 and 145 DF,  p-value: 0.004264

Figure 1.7 - Food Price to Ranking Linear Regression Model

#Food Price to Ranking Regression

regression_food <- lm(workation_data$ranking ~ workation_data$food_price)
summary(regression_food)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$food_price)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.111 -30.145  -0.596  28.688  83.447 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                34.2540     6.4677   5.296 4.30e-07 ***
## workation_data$food_price   4.8641     0.6989   6.960 1.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.01 on 145 degrees of freedom
## Multiple R-squared:  0.2504, Adjusted R-squared:  0.2452 
## F-statistic: 48.44 on 1 and 145 DF,  p-value: 1.09e-10

Figure 1.8 - Average Climate to Ranking Linear Regression Model

#Average Climate to Ranking Regression

regression_climate <- lm(workation_data$ranking ~ workation_data$average_climate)
summary(regression_climate)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$average_climate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.935 -34.381   0.354  37.416  81.568 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    96.780031  13.189742   7.338 1.43e-11 ***
## workation_data$average_climate -0.009479   0.005279  -1.796   0.0746 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.28 on 145 degrees of freedom
## Multiple R-squared:  0.02175,    Adjusted R-squared:  0.01501 
## F-statistic: 3.224 on 1 and 145 DF,  p-value: 0.07463

Figure 1.9 - Tourist Attraction to Ranking Linear Regression Model

#Tourist Attraction to Ranking Regression

regression_attractions <- lm(workation_data$ranking ~ workation_data$tourist_attraction)
summary(regression_attractions)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$tourist_attraction)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.173 -33.258  -0.478  30.749  94.333 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       87.873083   4.471587  19.651  < 2e-16 ***
## workation_data$tourist_attraction -0.013361   0.002901  -4.606 8.89e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.93 on 145 degrees of freedom
## Multiple R-squared:  0.1277, Adjusted R-squared:  0.1216 
## F-statistic: 21.22 on 1 and 145 DF,  p-value: 8.891e-06

Figure 1.10 - Instagram Photos to Ranking Linear Regression Model

#Instagram Photos to Ranking Regression

regression_instagram <- lm(workation_data$ranking ~ workation_data$insta_photos)
summary(regression_instagram)
## 
## Call:
## lm(formula = workation_data$ranking ~ workation_data$insta_photos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.568 -30.050  -1.813  33.763  97.801 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  8.404e+01  4.139e+00  20.308  < 2e-16 ***
## workation_data$insta_photos -5.364e-07  1.301e-07  -4.125 6.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.44 on 145 degrees of freedom
## Multiple R-squared:  0.105,  Adjusted R-squared:  0.09884 
## F-statistic: 17.01 on 1 and 145 DF,  p-value: 6.227e-05