R Takehome Homework
1.
#Importing the data
bali_data <- read.csv("Bali Popular Destination for Tourist 2022 - Sheet1.csv", header = TRUE, stringsAsFactors = FALSE)
#Creating new variable (review density)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bali_clean <- bali_data %>%
filter(complete.cases(.))
bali_clean <- bali_clean %>%
mutate(ReviewDensity = Google.Reviews.Count/ Google.Maps.Rating)
bali_clean$ReviewDensity <- round(bali_clean$ReviewDensity, 0)
library(dplyr)
bali_clean <- bali_clean %>%
select(-Location, -Source, -Description)
head(bali_clean)
## Place Google.Maps.Rating Google.Reviews.Count
## 1 Tanah Lot 4.6 75899
## 2 Mount Batur 4.5 2580
## 3 Uluwatu Temple 4.6 28800
## 4 Ubud Monkey Forest 4.5 36099
## 5 Goa Gajah 4.2 6683
## 6 Jatiluwih Rice Terraces in Bali 4.7 7798
## Entrance.Fee ReviewDensity
## 1 Paid 16500
## 2 Paid 573
## 3 Paid 6261
## 4 Paid 8022
## 5 Paid 1591
## 6 Free 1659
The dataset includes variables such as destination name, visitor ratings, and whether an entrance fee is required. These variables help us analyse which types of attractions are most popular, how pricing may influence tourist interest, and which areas of Bali attract the most visitors.
2.
#Renaming the data sets
bali_clean <- bali_clean %>%
rename(
Destination = Place,
`GMaps Rating` = Google.Maps.Rating,
`GReviews Count` = Google.Reviews.Count,
`Entrance Fee` = Entrance.Fee,
`Review Density` = ReviewDensity)
#Create new datasets
paid_destinations <- bali_clean %>%
filter(`Entrance Fee` == "Paid")
free_destinations <- bali_clean %>%
filter(`Entrance Fee` == "Free")
head(paid_destinations)
## Destination GMaps Rating GReviews Count Entrance Fee Review Density
## 1 Tanah Lot 4.6 75899 Paid 16500
## 2 Mount Batur 4.5 2580 Paid 573
## 3 Uluwatu Temple 4.6 28800 Paid 6261
## 4 Ubud Monkey Forest 4.5 36099 Paid 8022
## 5 Goa Gajah 4.2 6683 Paid 1591
## 6 Pura Ulun Danu Bratan 4.7 29178 Paid 6208
head(free_destinations)
## Destination GMaps Rating GReviews Count Entrance Fee
## 1 Jatiluwih Rice Terraces in Bali 4.7 7798 Free
## 2 Tenggalang Rice Terrace 4.4 33732 Free
## 3 Seminyak Beach 4.5 3195 Free
## 4 Nusa Dua Beach 4.6 6171 Free
## 5 Kuta Beach 4.5 37663 Free
## 6 Pura Penataran Agung Lempuyang 4.3 6192 Free
## Review Density
## 1 1659
## 2 7666
## 3 710
## 4 1342
## 5 8370
## 6 1440
3.
#Creating summary
summary(paid_destinations[, c("GMaps Rating", "GReviews Count", "Review Density")])
## GMaps Rating GReviews Count Review Density
## Min. :4.200 Min. : 2422 Min. : 563.0
## 1st Qu.:4.500 1st Qu.: 4467 1st Qu.: 960.5
## Median :4.550 Median :11672 Median : 2508.5
## Mean :4.533 Mean :16656 Mean : 3664.9
## 3rd Qu.:4.600 3rd Qu.:18179 3rd Qu.: 4092.0
## Max. :4.800 Max. :75899 Max. :16500.0
The mean gives us the overall average across all paid destinations in this dataset. For example, a mean rating of 4.533 suggests that Bali’s paid spots are generally well-reviewed and popular. The high average review count 16,656 indicates strong tourist engagement.
As for the median, for instance, while the mean review count is 16,656, the median is lower at 11,672 suggesting some of the destinations have extremely high review counts that pull the average up.
The minimum GMaps Rating among paid destinations is 4.2, indicating that even the lowest-rated paid attractions are still considered good by visitors. This suggests that paid spots in Bali consistently deliver satisfying experiences. For many visitors, these destinations may represent a first-time encounter with Balinese traditions, landscapes, or spiritual sites, which can elevate their sense of wonder and appreciation.
4.
#Making a scatterplot
library(ggplot2)
ggplot(paid_destinations, aes(x = `GMaps Rating`, y = `GReviews Count`)) +
geom_point(color = "forestgreen", size = 2) +
labs(title = "Reviews vs. Rating",
x = "Google Maps Rating",
y = "Google Reviews Count") +
theme_classic()
The scatter plot reveals that while paid destinations in Bali tend to receive consistently high ratings, the number of reviews varies significantly. This suggests that visitor engagement is influenced by factors beyond satisfaction—such as location, cultural significance, or the age of the site. For example, Tanah Lot temple has existed long before the Dutch and Portuguese colonised Indonesia, and may have been featured in European travel literature for generations. Its historical prominence could explain its exceptionally high number of reviews. Notably, some destinations with moderate ratings still attract tens of thousands of reviews, highlighting their enduring popularity among tourists.
#Task 2
1.
#Importing the escel datasets
library(readxl)
mba_data <- read_excel("Business School.xlsx")
#Creating Histograms to showcase the amount of graduates from each majors
library(ggplot2)
ggplot(mba_data, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
geom_bar(color = "white") +
scale_fill_manual(values = c(
"Business" = "darkorange",
"Engineering" = "purple",
"Art" = "forestgreen",
"Science" = "aquamarine",
"Computer Science" = "lightblue",
"Finance" = "red"
)) +
scale_y_continuous(breaks = seq(0, 40, by = 5)) +
labs(title = "Distribution of Undergraduate Degrees",
x = "Undergraduate Degree",
y = "Count") +
theme_minimal()
As the graph shows, it looks like Business is the most common undergraduate degree among MBA students.
2.
#Creating histogram
library(ggplot2)
library(scales)
ggplot(mba_data, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 10000, fill = "darkolivegreen", color = "white") +
scale_x_continuous(labels = comma) +
labs(title = "Distribution of Annual Salaries",
x = "Annual Salary (USD)",
y = "Number of Students") +
theme_minimal()
The histogram illustrates the distribution of annual salaries among MBA students. Most earn between $75,000 and $125,000, with the highest concentration around $100,000. The distribution is skewed to the right, meaning that while the majority fall within a mid-range salary bracket, a smaller group earns significantly more (up to $300,000). This skew reflects the presence of high-earning outliers, potentially due to differences in industry placement, prior experience, or negotiation outcomes. Overall, the shape reveals a strong central cluster with a gradual extension toward higher salaries, highlighting income variability within the cohort.
3. We’re testing whether the mean MBA grade is equal to 74.
Null hypothesis (H₀): μ = 74
Alternative hypothesis (H₁): μ ≠ 74
#Testing the hypothesis with t.test
t.test(mba_data$"MBA Grade", mu = 74)
##
## One Sample t-test
##
## data: mba_data$"MBA Grade"
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
(mean(mba_data$`MBA Grade`) - 74) / sd(mba_data$`MBA Grade`)
## [1] 0.2658658
The sample mean MBA grade is 76.04, which is significantly higher than the hypothesized value of 74. The p-value of 0.00915 is below the conventional alpha level of 0.05, so we reject the null hypothesis and conclude that the difference is statistically significant. However, the effect size (Cohen’s d = 0.27) indicates that the practical significance of this difference is small, suggesting only a modest improvement in average performance.
#Task 3
library(readxl)
library(dplyr)
Apartments <- read_excel("Apartments.xlsx")
View(Apartments)
Description:
Apartments <- Apartments %>%
mutate(across(c(Balcony, Parking), as.factor))
t.test(Apartments$Price, mu = 1900)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Null hypothesis (H₀): μ = 1900
Alternative hypothesis (H₁): μ ≠ 1900
Based on the one-sample t-test, we reject the null hypothesis at the 5% significance level. The data provides strong evidence that the true mean apartment price is significantly different from €1900, with a sample mean of €2018.94.
fit1 <- lm(Price ~ Age, data = Apartments)
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Coefficient of Correlation The correlation coefficient of –0.230 indicates a weak negative relationship between apartment age and price. While older apartments tend to be slightly cheaper, age alone is not a strong predictor of price. This is consistent with the regression model, which shows a small but statistically significant negative slope and a low R² value.
Regression Coefficient The slope (-8.98) indicates that for each additional year of age, the apartment price decreases by approximately €8.98, on average.
Coefficient of Determination Rsquared = 0.053 → Only 5.3% of the variation in apartment prices is explained by age. This suggests that Age is not a strong predictor of Price on its own, other factors likely play a bigger role (Garage or Balcony).
library(GGally)
ggpairs(Apartments[, c("Price", "Age", "Distance")])
Based on the scatterplot matrix and correlation coefficients, there is
no evidence of multicollinearity between the independent variables Age
and Distance. Their correlation is very low (0.043), so both can be
safely included in a multiple regression model.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(fit2)
## Age Distance
## 1.001845 1.001845
The Variance Inflation Factor (VIF) values for Age and Distance are both approximately 1.00, indicating no multicollinearity. This confirms that the predictors are not linearly dependent and can be safely included in the multiple regression model without inflating standard errors or compromising interpretability.
standardized_res <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)
which(abs(rstandard(fit2)) > 2)
## 33 38 53
## 33 38 53
clean_apartments <- Apartments[-c(33, 38, 53), ]
fit2_clean <- lm(Price ~ Age + Distance, data = clean_apartments)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -404.0 -230.9 -51.4 190.6 504.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2455.768 73.296 33.505 < 2e-16 ***
## Age -6.011 3.086 -1.948 0.055 .
## Distance -23.543 2.665 -8.834 2.05e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared: 0.5179, Adjusted R-squared: 0.5057
## F-statistic: 42.44 on 2 and 79 DF, p-value: 3.042e-13
After removing three problematic observations, the multiple regression model improved substantially. The adjusted R² increased to 0.506, indicating that over half of the variation in apartment prices is explained by age and distance. Distance remains a strong and statistically significant predictor, while age shows a weaker, borderline effect. The cleaned model provides more reliable estimates and better predictive accuracy.
if (exists("fit3")) {
try({
plot(scale(fitted(fit3)), rstandard(fit3),
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Residuals vs. Standardized Fitted Values")
abline(h = 0, col = "green")
}, silent = TRUE)
} else {
message("Model 'fit3' not found.")
}
## Model 'fit3' not found.
The residual plot shows no signs of heteroskedasticity. The residuals are randomly distributed with roughly constant variance, which supports the assumption of homoscedasticity in the regression model. This means the model’s standard errors and significance tests are likely valid and trustworthy.
standardized_res <- rstandard(fit2_clean)
hist(standardized_res,
breaks = 20,
col = "skyblue",
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals")
qqnorm(standardized_res,
main = "Q-Q Plot of Standardized Residuals")
qqline(standardized_res, col = "red", lwd = 2)
shapiro.test(rstandard(fit2_clean))
##
## Shapiro-Wilk normality test
##
## data: rstandard(fit2_clean)
## W = 0.93911, p-value = 0.0007304
The Shapiro-Wilk test indicates a statistically significant deviation from normality (p = 0.00073). However, visual inspections via histogram and Q-Q plot suggest that the residuals are approximately symmetric and bell-shaped. Therefore, while the residuals are not perfectly normal, the deviation is likely mild and does not invalidate the regression model’s reliability.
fit2_clean <- lm(Price ~ Age + Distance, data = clean_apartments)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -404.0 -230.9 -51.4 190.6 504.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2455.768 73.296 33.505 < 2e-16 ***
## Age -6.011 3.086 -1.948 0.055 .
## Distance -23.543 2.665 -8.834 2.05e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared: 0.5179, Adjusted R-squared: 0.5057
## F-statistic: 42.44 on 2 and 79 DF, p-value: 3.042e-13
The cleaned regression model shows that both apartment age and distance from the city center negatively affect price. Distance is a strong and statistically significant predictor, while age has a weaker, borderline effect. The model explains over half of the variation in apartment prices, and diagnostics confirm that assumptions like linearity, homoscedasticity, and approximate normality are reasonably satisfied.
Apartments$Parking <- as.factor(Apartments$Parking)
Apartments$Balcony <- as.factor(Apartments$Balcony)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
The regression model shows that apartment price decreases with age and distance from the city center. Parking significantly increases price by €196, while balcony presence has no statistically significant effect (€2). The model explains half of the variation in prices, and all assumptions appear reasonably satisfied.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test shows that adding Parking and Balcony significantly improves the model fit (p = 0.010). This means fit3 explains more variation in apartment prices than fit2, and the improvement is unlikely due to chance. Therefore, fit3 is statistically preferred over fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
The F-statistic of 20.03 with a p-value < 0.001 indicates that the model is statistically significant overall. This means that at least one predictor (Age, Distance, Parking, or Balcony) has a meaningful effect on apartment price.
Apartments$fitted_fit3 <- fitted(fit3)
Apartments$residual_fit3 <- residuals(fit3)
Apartments[2, c("Price", "fitted_fit3", "residual_fit3")]
## # A tibble: 1 × 3
## Price fitted_fit3 residual_fit3
## <dbl> <dbl> <dbl>
## 1 2800 2357. 443.
Apartment ID2 is priced €2800, which is €442.59 above the model’s predicted value of €2357.41. This positive residual indicates that the apartment is more expensive than expected based on its age, location, and amenities — possibly due to factors not captured in the model.