R Takehome Homework

Task 1

1.

#Importing the data
bali_data <- read.csv("Bali Popular Destination for Tourist 2022 - Sheet1.csv", header = TRUE, stringsAsFactors = FALSE)
#Creating new variable (review density)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
bali_clean <- bali_data %>% 
  filter(complete.cases(.))
bali_clean <- bali_clean %>%
  mutate(ReviewDensity = Google.Reviews.Count/ Google.Maps.Rating)
bali_clean$ReviewDensity <- round(bali_clean$ReviewDensity, 0)
library(dplyr)
bali_clean <- bali_clean %>%
  select(-Location, -Source, -Description)

head(bali_clean)
##                             Place Google.Maps.Rating Google.Reviews.Count
## 1                       Tanah Lot                4.6                75899
## 2                     Mount Batur                4.5                 2580
## 3                  Uluwatu Temple                4.6                28800
## 4              Ubud Monkey Forest                4.5                36099
## 5                       Goa Gajah                4.2                 6683
## 6 Jatiluwih Rice Terraces in Bali                4.7                 7798
##   Entrance.Fee ReviewDensity
## 1         Paid         16500
## 2         Paid           573
## 3         Paid          6261
## 4         Paid          8022
## 5         Paid          1591
## 6         Free          1659

The dataset includes variables such as destination name, visitor ratings, and whether an entrance fee is required. These variables help us analyse which types of attractions are most popular, how pricing may influence tourist interest, and which areas of Bali attract the most visitors.

2.

#Renaming the data sets
bali_clean <- bali_clean %>%
  rename(
    Destination = Place,
    `GMaps Rating` = Google.Maps.Rating,
    `GReviews Count` = Google.Reviews.Count,
    `Entrance Fee` = Entrance.Fee,
    `Review Density` = ReviewDensity)
#Create new datasets
paid_destinations <- bali_clean %>%
  filter(`Entrance Fee` == "Paid")
free_destinations <- bali_clean %>%
  filter(`Entrance Fee` == "Free")
head(paid_destinations)
##             Destination GMaps Rating GReviews Count Entrance Fee Review Density
## 1             Tanah Lot          4.6          75899         Paid          16500
## 2           Mount Batur          4.5           2580         Paid            573
## 3        Uluwatu Temple          4.6          28800         Paid           6261
## 4    Ubud Monkey Forest          4.5          36099         Paid           8022
## 5             Goa Gajah          4.2           6683         Paid           1591
## 6 Pura Ulun Danu Bratan          4.7          29178         Paid           6208
head(free_destinations)
##                       Destination GMaps Rating GReviews Count Entrance Fee
## 1 Jatiluwih Rice Terraces in Bali          4.7           7798         Free
## 2         Tenggalang Rice Terrace          4.4          33732         Free
## 3                  Seminyak Beach          4.5           3195         Free
## 4                  Nusa Dua Beach          4.6           6171         Free
## 5                      Kuta Beach          4.5          37663         Free
## 6  Pura Penataran Agung Lempuyang          4.3           6192         Free
##   Review Density
## 1           1659
## 2           7666
## 3            710
## 4           1342
## 5           8370
## 6           1440

3.

#Creating summary
summary(paid_destinations[, c("GMaps Rating", "GReviews Count", "Review Density")])
##   GMaps Rating   GReviews Count  Review Density   
##  Min.   :4.200   Min.   : 2422   Min.   :  563.0  
##  1st Qu.:4.500   1st Qu.: 4467   1st Qu.:  960.5  
##  Median :4.550   Median :11672   Median : 2508.5  
##  Mean   :4.533   Mean   :16656   Mean   : 3664.9  
##  3rd Qu.:4.600   3rd Qu.:18179   3rd Qu.: 4092.0  
##  Max.   :4.800   Max.   :75899   Max.   :16500.0

The mean gives us the overall average across all paid destinations in this dataset. For example, a mean rating of 4.533 suggests that Bali’s paid spots are generally well-reviewed and popular. The high average review count 16,656 indicates strong tourist engagement.

As for the median, for instance, while the mean review count is 16,656, the median is lower at 11,672 suggesting some of the destinations have extremely high review counts that pull the average up.

The minimum GMaps Rating among paid destinations is 4.2, indicating that even the lowest-rated paid attractions are still considered good by visitors. This suggests that paid spots in Bali consistently deliver satisfying experiences. For many visitors, these destinations may represent a first-time encounter with Balinese traditions, landscapes, or spiritual sites, which can elevate their sense of wonder and appreciation.

4.

#Making a scatterplot
library(ggplot2)
ggplot(paid_destinations, aes(x = `GMaps Rating`, y = `GReviews Count`)) +
  geom_point(color = "forestgreen", size = 2) +
  labs(title = "Reviews vs. Rating",
       x = "Google Maps Rating",
       y = "Google Reviews Count") +
  theme_classic()

The scatter plot reveals that while paid destinations in Bali tend to receive consistently high ratings, the number of reviews varies significantly. This suggests that visitor engagement is influenced by factors beyond satisfaction—such as location, cultural significance, or the age of the site. For example, Tanah Lot temple has existed long before the Dutch and Portuguese colonised Indonesia, and may have been featured in European travel literature for generations. Its historical prominence could explain its exceptionally high number of reviews. Notably, some destinations with moderate ratings still attract tens of thousands of reviews, highlighting their enduring popularity among tourists.

#Task 2

1.

#Importing the escel datasets
library(readxl)
mba_data <- read_excel("Business School.xlsx")

#Creating Histograms to showcase the amount of graduates from each majors
library(ggplot2)

ggplot(mba_data, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
  geom_bar(color = "white") +
  scale_fill_manual(values = c(
    "Business" = "darkorange",    
    "Engineering" = "purple",  
    "Art" = "forestgreen",       
    "Science" = "aquamarine",      
    "Computer Science" = "lightblue",         
    "Finance" = "red"         
   )) +
  scale_y_continuous(breaks = seq(0, 40, by = 5)) + 
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree",
       y = "Count") +
  theme_minimal() 

As the graph shows, it looks like Business is the most common undergraduate degree among MBA students.

2.

#Creating histogram
library(ggplot2)
library(scales)

ggplot(mba_data, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 10000, fill = "darkolivegreen", color = "white") +
  scale_x_continuous(labels = comma) +  
  labs(title = "Distribution of Annual Salaries",
       x = "Annual Salary (USD)",
       y = "Number of Students") +
  theme_minimal()

The histogram illustrates the distribution of annual salaries among MBA students. Most earn between $75,000 and $125,000, with the highest concentration around $100,000. The distribution is skewed to the right, meaning that while the majority fall within a mid-range salary bracket, a smaller group earns significantly more (up to $300,000). This skew reflects the presence of high-earning outliers, potentially due to differences in industry placement, prior experience, or negotiation outcomes. Overall, the shape reveals a strong central cluster with a gradual extension toward higher salaries, highlighting income variability within the cohort.

3. We’re testing whether the mean MBA grade is equal to 74.

Null hypothesis (H₀): μ = 74

Alternative hypothesis (H₁): μ ≠ 74

#Testing the hypothesis with t.test
t.test(mba_data$"MBA Grade", mu = 74)
## 
##  One Sample t-test
## 
## data:  mba_data$"MBA Grade"
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055
(mean(mba_data$`MBA Grade`) - 74) / sd(mba_data$`MBA Grade`)
## [1] 0.2658658

The sample mean MBA grade is 76.04, which is significantly higher than the hypothesized value of 74. The p-value of 0.00915 is below the conventional alpha level of 0.05, so we reject the null hypothesis and conclude that the difference is statistically significant. However, the effect size (Cohen’s d = 0.27) indicates that the practical significance of this difference is small, suggesting only a modest improvement in average performance.

#Task 3

Import the dataset Apartments.xlsx

library(readxl)
library(dplyr)
Apartments <- read_excel("Apartments.xlsx")
View(Apartments)

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments <- Apartments %>%
  mutate(across(c(Balcony, Parking), as.factor))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price, mu = 1900)
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Null hypothesis (H₀): μ = 1900

Alternative hypothesis (H₁): μ ≠ 1900

Based on the one-sample t-test, we reject the null hypothesis at the 5% significance level. The data provides strong evidence that the true mean apartment price is significantly different from €1900, with a sample mean of €2018.94.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments)
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Coefficient of Correlation The correlation coefficient of –0.230 indicates a weak negative relationship between apartment age and price. While older apartments tend to be slightly cheaper, age alone is not a strong predictor of price. This is consistent with the regression model, which shows a small but statistically significant negative slope and a low R² value.

Regression Coefficient The slope (-8.98) indicates that for each additional year of age, the apartment price decreases by approximately €8.98, on average.

Coefficient of Determination Rsquared = 0.053 → Only 5.3% of the variation in apartment prices is explained by age. This suggests that Age is not a strong predictor of Price on its own, other factors likely play a bigger role (Garage or Balcony).

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(GGally)
ggpairs(Apartments[, c("Price", "Age", "Distance")])

Based on the scatterplot matrix and correlation coefficients, there is no evidence of multicollinearity between the independent variables Age and Distance. Their correlation is very low (0.043), so both can be safely included in a multiple regression model.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(fit2)
##      Age Distance 
## 1.001845 1.001845

The Variance Inflation Factor (VIF) values for Age and Distance are both approximately 1.00, indicating no multicollinearity. This confirms that the predictors are not linearly dependent and can be safely included in the multiple regression model without inflating standard errors or compromising interpretability.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

standardized_res <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)
which(abs(rstandard(fit2)) > 2)
## 33 38 53 
## 33 38 53
clean_apartments <- Apartments[-c(33, 38, 53), ]

fit2_clean <- lm(Price ~ Age + Distance, data = clean_apartments)
summary(fit2_clean)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -404.0 -230.9  -51.4  190.6  504.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2455.768     73.296  33.505  < 2e-16 ***
## Age           -6.011      3.086  -1.948    0.055 .  
## Distance     -23.543      2.665  -8.834 2.05e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared:  0.5179, Adjusted R-squared:  0.5057 
## F-statistic: 42.44 on 2 and 79 DF,  p-value: 3.042e-13

After removing three problematic observations, the multiple regression model improved substantially. The adjusted R² increased to 0.506, indicating that over half of the variation in apartment prices is explained by age and distance. Distance remains a strong and statistically significant predictor, while age shows a weaker, borderline effect. The cleaned model provides more reliable estimates and better predictive accuracy.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

if (exists("fit3")) {
  try({
    plot(scale(fitted(fit3)), rstandard(fit3),
         xlab = "Standardized Fitted Values",
         ylab = "Standardized Residuals",
         main = "Residuals vs. Standardized Fitted Values")
    abline(h = 0, col = "green")
  }, silent = TRUE)
} else {
  message("Model 'fit3' not found.")
}
## Model 'fit3' not found.

The residual plot shows no signs of heteroskedasticity. The residuals are randomly distributed with roughly constant variance, which supports the assumption of homoscedasticity in the regression model. This means the model’s standard errors and significance tests are likely valid and trustworthy.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

standardized_res <- rstandard(fit2_clean)
hist(standardized_res,
     breaks = 20,
     col = "skyblue",
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals")

qqnorm(standardized_res,
       main = "Q-Q Plot of Standardized Residuals")
qqline(standardized_res, col = "red", lwd = 2)

shapiro.test(rstandard(fit2_clean))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(fit2_clean)
## W = 0.93911, p-value = 0.0007304

The Shapiro-Wilk test indicates a statistically significant deviation from normality (p = 0.00073). However, visual inspections via histogram and Q-Q plot suggest that the residuals are approximately symmetric and bell-shaped. Therefore, while the residuals are not perfectly normal, the deviation is likely mild and does not invalidate the regression model’s reliability.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2_clean <- lm(Price ~ Age + Distance, data = clean_apartments)
summary(fit2_clean)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = clean_apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -404.0 -230.9  -51.4  190.6  504.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2455.768     73.296  33.505  < 2e-16 ***
## Age           -6.011      3.086  -1.948    0.055 .  
## Distance     -23.543      2.665  -8.834 2.05e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared:  0.5179, Adjusted R-squared:  0.5057 
## F-statistic: 42.44 on 2 and 79 DF,  p-value: 3.042e-13

The cleaned regression model shows that both apartment age and distance from the city center negatively affect price. Distance is a strong and statistically significant predictor, while age has a weaker, borderline effect. The model explains over half of the variation in apartment prices, and diagnostics confirm that assumptions like linearity, homoscedasticity, and approximate normality are reasonably satisfied.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

Apartments$Parking <- as.factor(Apartments$Parking)
Apartments$Balcony <- as.factor(Apartments$Balcony)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The regression model shows that apartment price decreases with age and distance from the city center. Parking significantly increases price by €196, while balcony presence has no statistically significant effect (€2). The model explains half of the variation in prices, and all assumptions appear reasonably satisfied.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test shows that adding Parking and Balcony significantly improves the model fit (p = 0.010). This means fit3 explains more variation in apartment prices than fit2, and the improvement is unlikely due to chance. Therefore, fit3 is statistically preferred over fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The F-statistic of 20.03 with a p-value < 0.001 indicates that the model is statistically significant overall. This means that at least one predictor (Age, Distance, Parking, or Balcony) has a meaningful effect on apartment price.

Save fitted values and claculate the residual for apartment ID2.

Apartments$fitted_fit3 <- fitted(fit3)
  Apartments$residual_fit3 <- residuals(fit3)
  Apartments[2, c("Price", "fitted_fit3", "residual_fit3")]
## # A tibble: 1 × 3
##   Price fitted_fit3 residual_fit3
##   <dbl>       <dbl>         <dbl>
## 1  2800       2357.          443.

Apartment ID2 is priced €2800, which is €442.59 above the model’s predicted value of €2357.41. This positive residual indicates that the apartment is more expensive than expected based on its age, location, and amenities — possibly due to factors not captured in the model.