library(tidyverse)
library(ggplot2)
library(dplyr)
library(reshape2)
library(GGally)
library(data.table)
library(formattable)
library(car)
workation_data <- read.csv("/Users/sindhuherle/Documents/DAM/best cities for a workation.csv")
## change column names as they are long
colnames(workation_data) <- c('ranking','city','country', 'internet', 'co_working_spaces','coffee_price','travel_price','post_work_drinks','accomodation_price','food_price','average_climate','tourist_attraction','insta_photos')
As a team we have finalized the dataset on workcation from Kaggle (https://www.kaggle.com/prasertk/the-best-cities-for-a-workation?select=best+cities+for+a+workation.csv). When the pandemic struck, organizations had to allow their employees to work remotely. Now despite attempts to go back to the pre-covid days, most organizations are still preferring to continue with the remote work options. With this new flexibility, a lot of employees, mostly travel enthusiasts, are exploring options of being able to combine their passion for travel without draining their paid leave. It is this interest that attracted our team to the dataset, and we were curious to understand the inferences from analyzing this data.
The dataset has 13 variables and 148 observations. The variables include details like the city name, country name, average Wi-Fi speed per city, number of co-working spaces, average price of coffee, average price for taxi, beer and food, average price for accommodation per month, number of tourist destinations, number of photos per city that have been referenced by Instagram hashtags and number of sunshine hours.
Our first approach was to import the libraries and loaded the dataset and perform a simple preview and statistics of our dataset. We used the head() , str(), summary(), sum(is.na()). The output showed us 147 observations and 13 variables and no null values. None of the column data types were changed and most of our data type was numeric. Our dataset is reasonable in size. The summary function shows each variable minimum, 1sy quartile, median, mean, 3rd quartile, and maximum. Some of the cities have the same ranking and hence we decided that we wouldn’t use the Ranking variable any longer for our analysis. We renamed the column names to provide short and suitable names.
Before analyzing, let us view the dataset in a table form to get a better understanding of all the variables
head(workation_data)
## ranking city country internet co_working_spaces coffee_price
## 1 1 Bangkok Thailand 28 117 1.56
## 2 2 New Delhi India 12 165 1.42
## 3 3 Lisbon Portugal 33 95 1.56
## 4 4 Barcelona Spain 37 136 1.59
## 5 5 Buenos Aires Argentina 17 67 1.22
## 6 5 Budapest Hungary 37 40 1.20
## travel_price post_work_drinks accomodation_price food_price average_climate
## 1 0.82 3.08 415.18 1.54 2624
## 2 0.19 2.90 179.25 2.90 2685
## 3 0.40 3.42 736.19 7.69 2806
## 4 1.01 5.12 768.46 10.25 2591
## 5 0.47 2.16 229.55 5.15 2525
## 6 0.72 2.40 366.66 4.81 1988
## tourist_attraction insta_photos
## 1 2262 28386616
## 2 2019 28528249
## 3 1969 10205538
## 4 2739 62894055
## 5 1660 21293975
## 6 1468 14267880
The correlation matrix above supports us to analyze the effect of different parameters over ranking. The correlation matrix heap here is very noteworthy. As opposed to what we might consider internet speed, co-working spaces, or tourist attraction as the parameters for affecting the rank of best cities to work in, it is the quite opposite. We can see that the highest correlation with the ranking are cost of post work drinks, food, travel, and caffeine which represents that people are looking for more quality of life
To further understand the parameters which deeply affect the global ranking of cities, we created scatterplots for the four variables which had highest coorelation with ranking. From the scatterplots, its evident that lower the cost of food, beverages, and transportation, higher the ranking of cities for working.
In this analysis, we are interpreting the dataset to understand the top 15 cheapest and top 15 most expensive cities for a workcation. To arrive at this conclusion, we have calculated the average monthly expenses for meals, beverages, transportation, and accommodation. We are then ranking the cities based on the cost of living. The higher the cost of living, the larger their ranking. We have then plotted the following based on the top 15 and least 15 rankings in the sorted data.
We are analyzing the four parameters from the dataset to observe the outliers. Box plot helps us in understanding the statistics of the dataset (mean, median, quartiles, and outliers). From the above plots, we can see that there are not many outliers in the dataset, therefore we need not remove them.
Initially, we ran an individual regression for each particular variable in order to see their effect on the total ranking of each country. After compiling all the individual regressions (Figures 1.1 – 1.10 in Appendix), we saw that coffee price, travel price, post work drink price, and food price are the only variables with an R-Squared value greater than 0.2. In a strong and predictive model, the R-Squared value is ideally greater than 0.7, which is obviously not the case for any of these variables. Because of this, we thought it might be beneficial to run a multiple linear regression with the four variables with the highest R-Squared value.
#Multiple Linear Regression Model
multiple_lr_model <- lm(workation_data$ranking ~ workation_data$coffee_price + workation_data$travel_price + workation_data$post_work_drinks + workation_data$food_price)
summary(multiple_lr_model)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$coffee_price +
## workation_data$travel_price + workation_data$post_work_drinks +
## workation_data$food_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.078 -27.234 1.834 26.735 68.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.885 8.251 1.198 0.232920
## workation_data$coffee_price 9.383 4.691 2.000 0.047404 *
## workation_data$travel_price 23.966 6.357 3.770 0.000239 ***
## workation_data$post_work_drinks 1.614 1.240 1.301 0.195214
## workation_data$food_price 1.048 1.034 1.014 0.312205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.37 on 142 degrees of freedom
## Multiple R-squared: 0.3668, Adjusted R-squared: 0.3489
## F-statistic: 20.56 on 4 and 142 DF, p-value: 2.2e-13
While the multiple linear regression model still did not produce an R-Squared value greater than 0.7 indicating it is strong and predictive, it did produce a value greater than any of the individual linear regression models. This shows that while the combination of these variables isn’t entirely predictive of the ranking for which city is the best for a workstation, it is better than any of them individually. We see that our r square value is low but p value is significant, so this indicates that our model doesn’t explain much of variation of the data but it is significant (better than not having a model)
avPlots(multiple_lr_model)
Figure 1.1 - Internet to Ranking Linear Regression Model
# Internet to Ranking Regression
regression_internet <- lm(workation_data$ranking ~ workation_data$internet)
summary(regression_internet)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$internet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.904 -32.537 0.785 35.758 74.135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.9244 6.7235 12.334 <2e-16 ***
## workation_data$internet -0.3350 0.2141 -1.565 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.39 on 145 degrees of freedom
## Multiple R-squared: 0.0166, Adjusted R-squared: 0.009822
## F-statistic: 2.448 on 1 and 145 DF, p-value: 0.1198
Figure 1.2 - Co-Working Spaces to Ranking Linear Regression Model
#Co-Working Spaces to Ranking Regression
regression_co_working_spaces <- lm(workation_data$ranking ~ workation_data$co_working_spaces)
summary(regression_co_working_spaces)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$co_working_spaces)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.302 -30.703 -3.312 32.804 123.287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.22865 4.44481 20.525 < 2e-16 ***
## workation_data$co_working_spaces -0.35697 0.06375 -5.599 1.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.76 on 145 degrees of freedom
## Multiple R-squared: 0.1778, Adjusted R-squared: 0.1721
## F-statistic: 31.35 on 1 and 145 DF, p-value: 1.046e-07
Figure 1.3 - Coffee Price to Ranking Linear Regression Model
#Coffee Price to Ranking Regression
regression_coffee <- lm(workation_data$ranking ~ workation_data$coffee_price)
summary(regression_coffee)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$coffee_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.932 -31.211 1.981 33.019 70.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.562 8.647 2.378 0.0187 *
## workation_data$coffee_price 23.413 3.542 6.609 6.88e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 145 degrees of freedom
## Multiple R-squared: 0.2315, Adjusted R-squared: 0.2262
## F-statistic: 43.68 on 1 and 145 DF, p-value: 6.881e-10
Figure 1.4 - Travel Price to Ranking Linear Regression Model
#Travel Price to Ranking Regression
regression_travel <- lm(workation_data$ranking ~ workation_data$travel_price)
summary(regression_travel)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$travel_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.788 -30.576 -0.432 26.452 82.046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.412 6.256 5.660 7.83e-08 ***
## workation_data$travel_price 38.319 5.437 7.048 6.81e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.89 on 145 degrees of freedom
## Multiple R-squared: 0.2552, Adjusted R-squared: 0.25
## F-statistic: 49.67 on 1 and 145 DF, p-value: 6.812e-11
Figure 1.5 - Post Work Drinks to Ranking Linear Regression Model
#Post Work Drinks to Ranking Regression
regression_drinks <- lm(workation_data$ranking ~ workation_data$post_work_drinks)
summary(regression_drinks)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$post_work_drinks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.216 -33.236 3.404 28.103 85.518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.6947 6.3484 6.410 1.92e-09 ***
## workation_data$post_work_drinks 5.3584 0.8882 6.033 1.28e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.22 on 145 degrees of freedom
## Multiple R-squared: 0.2006, Adjusted R-squared: 0.1951
## F-statistic: 36.4 on 1 and 145 DF, p-value: 1.279e-08
Figure 1.6 - Accommodation Price to Ranking Linear Regression Model
#Accommodation Price to Ranking Regression
regression_accommodation <- lm(workation_data$ranking ~ workation_data$accomodation_price)
summary(regression_accommodation)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$accomodation_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.330 -38.107 2.205 36.309 69.562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.018770 6.760284 8.434 3.07e-14 ***
## workation_data$accomodation_price 0.023515 0.008098 2.904 0.00426 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.56 on 145 degrees of freedom
## Multiple R-squared: 0.05496, Adjusted R-squared: 0.04844
## F-statistic: 8.432 on 1 and 145 DF, p-value: 0.004264
Figure 1.7 - Food Price to Ranking Linear Regression Model
#Food Price to Ranking Regression
regression_food <- lm(workation_data$ranking ~ workation_data$food_price)
summary(regression_food)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$food_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.111 -30.145 -0.596 28.688 83.447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.2540 6.4677 5.296 4.30e-07 ***
## workation_data$food_price 4.8641 0.6989 6.960 1.09e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.01 on 145 degrees of freedom
## Multiple R-squared: 0.2504, Adjusted R-squared: 0.2452
## F-statistic: 48.44 on 1 and 145 DF, p-value: 1.09e-10
Figure 1.8 - Average Climate to Ranking Linear Regression Model
#Average Climate to Ranking Regression
regression_climate <- lm(workation_data$ranking ~ workation_data$average_climate)
summary(regression_climate)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$average_climate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.935 -34.381 0.354 37.416 81.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.780031 13.189742 7.338 1.43e-11 ***
## workation_data$average_climate -0.009479 0.005279 -1.796 0.0746 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.28 on 145 degrees of freedom
## Multiple R-squared: 0.02175, Adjusted R-squared: 0.01501
## F-statistic: 3.224 on 1 and 145 DF, p-value: 0.07463
Figure 1.9 - Tourist Attraction to Ranking Linear Regression Model
#Tourist Attraction to Ranking Regression
regression_attractions <- lm(workation_data$ranking ~ workation_data$tourist_attraction)
summary(regression_attractions)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$tourist_attraction)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.173 -33.258 -0.478 30.749 94.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.873083 4.471587 19.651 < 2e-16 ***
## workation_data$tourist_attraction -0.013361 0.002901 -4.606 8.89e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.93 on 145 degrees of freedom
## Multiple R-squared: 0.1277, Adjusted R-squared: 0.1216
## F-statistic: 21.22 on 1 and 145 DF, p-value: 8.891e-06
Figure 1.10 - Instagram Photos to Ranking Linear Regression Model
#Instagram Photos to Ranking Regression
regression_instagram <- lm(workation_data$ranking ~ workation_data$insta_photos)
summary(regression_instagram)
##
## Call:
## lm(formula = workation_data$ranking ~ workation_data$insta_photos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.568 -30.050 -1.813 33.763 97.801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.404e+01 4.139e+00 20.308 < 2e-16 ***
## workation_data$insta_photos -5.364e-07 1.301e-07 -4.125 6.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.44 on 145 degrees of freedom
## Multiple R-squared: 0.105, Adjusted R-squared: 0.09884
## F-statistic: 17.01 on 1 and 145 DF, p-value: 6.227e-05