Task 1

Data source: https://www.kaggle.com/datasets/ramisashararnidhi/mall-customer-csv

mydata <- read.table("./Mall customers.csv",
                              header = TRUE, 
                              sep = ";",
                              dec=".")
head(mydata)
##   CustomerID Gender Age Annual_income Visit_duration Total_Spend
## 1          1   Male  19            15             39     191.014
## 2          2   Male  21            15             81     468.744
## 3          3 Female  20            16              6     337.465
## 4          4 Female  23            16             77     284.474
## 5          5 Female  31            17             40      66.329
## 6          6 Female  22            17             76      66.061

Description:

mydata$Used_Car <- sample(c(0, 1), size = nrow(mydata), replace = TRUE)

head(mydata)
##   CustomerID Gender Age Annual_income Visit_duration Total_Spend Used_Car
## 1          1   Male  19            15             39     191.014        1
## 2          2   Male  21            15             81     468.744        0
## 3          3 Female  20            16              6     337.465        0
## 4          4 Female  23            16             77     284.474        1
## 5          5 Female  31            17             40      66.329        0
## 6          6 Female  22            17             76      66.061        0

Description: Used_Car: Did a mall customer use a car to go to the mall? (1: Yes, 0: No)

mydata <- mydata[ ,-1]

head(mydata, 3)
##   Gender Age Annual_income Visit_duration Total_Spend Used_Car
## 1   Male  19            15             39     191.014        1
## 2   Male  21            15             81     468.744        0
## 3 Female  20            16              6     337.465        0
mydata[100,3] <- 45

tail(mydata, 3)
##     Gender Age Annual_income Visit_duration Total_Spend Used_Car
## 98  Female  27            60             50     376.285        1
## 99    Male  48            61             42     201.567        0
## 100   Male  20            45             49     183.962        1
mydata$Gender <- factor(mydata$Gender)
library(pastecs)
round(stat.desc(mydata[ , -1]), 2)
##                  Age Annual_income Visit_duration Total_Spend Used_Car
## nbr.val       100.00        100.00         100.00      100.00   100.00
## nbr.null        0.00          0.00           0.00        0.00    53.00
## nbr.na          0.00          0.00           0.00        0.00     0.00
## min            18.00         15.00           3.00       18.73     0.00
## max            70.00         61.00          99.00      498.98     1.00
## range          52.00         46.00          96.00      480.26     1.00
## sum          3975.00       3940.00        4993.00    24838.37    47.00
## median         36.50         41.00          50.00      237.74     0.00
## mean           39.75         39.40          49.93      248.38     0.47
## SE.mean         1.56          1.41           2.17       14.78     0.05
## CI.mean.0.95    3.10          2.79           4.30       29.34     0.10
## var           244.19        197.62         469.00    21859.06     0.25
## std.dev        15.63         14.06          21.66      147.85     0.50
## coef.var        0.39          0.36           0.43        0.60     1.07

The youngest customer visiting a mall was 18 years old, and was 52 years younger than than the oldest person visiting a mall. Half of the sample customers earn $41.000 annually or less, and half of the sample customers earn more than $41.000 annually. On average, customers spend 49.93 minutes at the mall, while most visits are -+21.66 minutes shorter or longer than 49.93 minutes.

#install.packages("modeest")

library(modeest)

mydata$Used_Car <- factor(mydata$Used_Car, levels = c("1", "0"))

mlv(mydata$Used_Car, method = "mfv")
## [1] 0
## Levels: 1 0

Most of the customers did not use a car to visit the mall.

hist(mydata$Age,
     main = "Distribution of age of mall customers",
     xlab = "Age in years",
     ylab = "Number of customers",
     breaks = seq(from = 0, to = 70, by = 5))

The distribution of age among mall customers is slightly multimodal, reaching peaks at 20-25 and 45-50 years of age and 15 customers in each range. There is a secondary peak of less than 15 customers aged 30 to 35.

library(car)

scatterplot(y = mydata$Total_Spend,
            x = mydata$Annual_income,
            ylab = "Total expense of a visit (in $)",
            xlab = "Annual income (in $1,000)",
            smooth = FALSE, 
            col = "orange")

The effect of total annual income on a total expense of a visit is slightly positive, meaning a larger annual income only slightly affects higher expenditure on a mall visit.

scatterplot(Total_Spend ~ Annual_income | Gender,
            ylab = "Total expense of a visit (in $)",
            xlab = "Annual income (in $1,000)",
            smooth = FALSE, 
            data = mydata)

The positive effect of higher income is more apparent for male customers, if their annual salary is higher, they tend to be less frugal when visiting the mall. The results for females are reversed, if their annual salary is higher, they tend to be slightly more frugal.

library(ggplot2)

ggplot(mydata, aes(x = Total_Spend, fill = Gender)) +
  geom_histogram(position = "dodge", binwidth = 50, colour = "gray") +
  xlab("Total expense of a visit (in $)") + 
  ylab("Frequency")

  labs(fill = "Gender")
## <ggplot2::labels> List of 1
##  $ fill: chr "Gender"

More male customers tend to spend less on average, compared to female customers whose purchases are more expensive.

Task 2

library(readxl)

Business_School <- read_excel("./Business School.xlsx")

Business_School <- as.data.frame(Business_School)

head(Business_School)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before) Employability (After)
## 1          1         Business            68.4      90.2              No                    252                   276
## 2          2 Computer Science            70.2      68.7             Yes                    101                   119
## 3          3          Finance            76.4      83.3              No                    401                   462
## 4          4         Business            82.6      88.7              No                    287                   342
## 5          5          Finance            76.9      75.4              No                    275                   347
## 6          6 Computer Science            83.3      82.1              No                    254                   313
##   Status Annual Salary
## 1 Placed        111000
## 2 Placed        107000
## 3 Placed        109000
## 4 Placed        148000
## 5 Placed        255500
## 6 Placed        103500
#install.packages("forcats")
library(ggplot2)

ggplot(Business_School, aes(x = reorder(`Undergrad Degree`, -table(`Undergrad Degree`)[`Undergrad Degree`]), 
                            fill = `Undergrad Degree`)) +
  geom_bar(colour = "black") +
  xlab("Undergraduate Degree") +
  ylab("Count of Students") +
  labs(fill = "Degree") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

sum(Business_School$`Undergrad Degree` == "Business")
## [1] 35

The most common degree is a degree in business. There is 35 students in the MBA programme that got an undergraduate degree in business.

summary(Business_School$`Annual Salary`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

The total number of units in the sample is 100. The average annual salary of MBA students is $109,058. Half of the students make $103,500 and less annually, half of the students make more than that.

library(ggplot2)
#install.packages(scales)
library(scales)

ggplot(Business_School, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 15000, fill = "seagreen3", colour = "black") +
  xlab("Annual Salary ($)") +
  ylab("Count of Students") +
  ggtitle("Distribution of Annual Salary") +
  scale_x_continuous(labels = comma)

The distribution of annual salary among MBA students is asymetrical to the right, which means it has a positive skew of 2.22. It has outliers, both on the bottom and top of the salary range.

iqr_value <- IQR(Business_School$`Annual Salary`)

iqr_value
## [1] 36875

The central 50 % of the values of Annual Salaries are spread across a band that is 36,875 wide.

t.test(Business_School$`MBA Grade`, 
       mu = 74, 
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055
#install.packages("effectsize")
library(effectsize)

effectsize::cohens_d(Business_School$`MBA Grade`, mu=74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.
effectsize::interpret_cohens_d(0.27, rules="sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)

We can reject the null hypothesis of the average MBA grade for this years students was 74, at the p-value of 0.009 and within the 95% confidence interval at alpha 5%.

This years students MBA grades are slightly higher than 74 (Cohen d = 0.27) indicating a minor difference.

Task 3

Import the dataset Apartments.xlsx

library(readxl)

Apartments <- read_excel("./Apartments.xlsx")

Apartments <- as.data.frame(Apartments)

head(Apartments, 3)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

Apartments$Balcony <- factor(Apartments$Balcony, 
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
head(Apartments, 3)
##   Age Distance Price Parking Balcony
## 1   7       28  1640      No     Yes
## 2  18        1  2800     Yes      No
## 3   7       28  1660      No      No

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price, 
       mu = 1900, 
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941
library(effectsize)

effectsize::cohens_d(Apartments$Price, mu=1900)
## Cohen's d |       95% CI
## ------------------------
## 0.31      | [0.10, 0.53]
## 
## - Deviation from a difference of 1900.
effectsize::interpret_cohens_d(0.31, rules="sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)

We can conclude, that the average price of apartments was not 1900 eur per m2, and we can reject the null hypothesis with the p-value of 0.004. The 95% confidence interval does not include a price of 1900 eur per m2 and it suggests that the sample average price of a m2 of apartments is 2018.9. The effect size was small (Cohen’s d = 0.31).

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, 
           data = Apartments)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255

The regression coefficient is 2185.5 meaning that newly built apartments (age=0) are expected to cost 2185.5 eur per m2. The coefficient of correlation is -0.23 meaning that the linear relationship between price and age is negative and weak. As age increases, price slightly decreases.
The coefficient of determination is 0.0530 meaning that only 5.3% of variability of price is explained by the linear effect of age. This means that age alone is not a strong predictor of price and other factors have a bigger effect, like availability of parking, a balcony and the distance from the city center.

Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

scatterplotMatrix(~ Price + Age + Distance, 
                   data = Apartments,
                  smooth = FALSE)

library(Hmisc)
rcorr(as.matrix(Apartments[ , c(-4, -5)]))
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

There is no multicolinearity issues among age and distance, as the values are spread out around the graph and are not close to the linear function.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

Because the VIF values are under 5 it means that no variable is perfectly explained by another variable, which means there is no perfect multicolinearity and therefore no need to remove certain variables.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(Apartments$StdResid, 
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     xlim = c(-3, 3),
     col = "orangered")

hist(Apartments$CooksD, 
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances",
     col = "skyblue")

head(Apartments[order(Apartments$StdResid),], 6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013
## 20  13        8  1800      No      No   -1.381  0.012
## 35  14       16  1660      No     Yes   -1.261  0.008
## 36  24        5  1830     Yes      No   -1.189  0.012
head(Apartments[order(-Apartments$CooksD),], 6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038

Because of the cut point for influential cases for this sample is 4/n=4/85= 0.047, we have influential units in rows 38, 55, 33, 53 and 22.

Apartments$ID <- seq_len(nrow(Apartments))

library(dplyr)
Apartments <- Apartments %>%
  filter(!(ID %in% c(38, 55, 33, 53, 22)))
fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)
hist(Apartments$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "New histogram of Cooks distances")

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Apartments$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = Apartments$StdResid, x = Apartments$StdFitted,
            ylab= "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE, 
            smooth = FALSE)

#install.packages("olsrr")
library(olsrr)
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

Heteroskedasticity may be present because the variance of the residuals increases as the fitted values increase. We tested the assumption with the Breusch Pagan test, which discovered that we cannot reject the null hypothesis of a constant variance, that means there is no strong evidence of heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)
hist(Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "pink")

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.93418, p-value = 0.0004761

Standardized residuals are not normally distributed, they are asymetrical to the right. Using the Shapiro-Wilk test we can reject the null hypothesis of normally distributed with the p-value of 0.0004. This means that the standardized residuals are not normally distributed which affects the t-test and confidence intervals.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187

If distance from the city center is increased by 1 km, the price per m2 decreases by 24.06 eur on average at p<0.001 controlling for all other variables. If the age of an apartment in years is increased by 1 year, the price per m2 decreases by 8.67 eur on average at p<0.001 controlling for all other variables. 53.61% of variability can be explained by the linnear effect of age and distance. The predicted price for newly built apartments (age = 0) is 2502.46 eur per m2.

There is a strong linear relationship, meaning the variables age and distance predict price reasonably well.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = Apartments)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

We cannot reject the null hypothesis that adding balcony and parking does not significantly improve the model. This means that adding parking and balcony does not significantly improve the model and that age and price explain most of the variation in price.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = Apartments)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Given the age, distance and parking, apartments with a balcony are predicted to cost 6.032 eur more on average than apartments with no balcony. At the p-value of 0.9165 this result is not statistically significant at alpha=0.05. Given the age, distance and availability of a balcony, apartments with parking are predicted to cost 128.7 eur more on average than apartments wiht no parking. At the p-value of 0.037 this result is statistically significant at alpha=0.05.

The F-test is testing the significance of the whole model. The hypothesis being tested is H0: bAge = bDistance = bParkingYes = bBalconyYes = 0. The alternative hypothesis is that at least one of the coefficients does not equal 0.

Save fitted values and calculate the residual for apartment ID2.

Apartments$Fitted <- fitted(fit3)

Apartments$Residual <- residuals(fit3)

Apartments[2, c("Fitted", "Residual")]
##     Fitted Residual
## 2 2356.597 443.4026