Data source: https://www.kaggle.com/datasets/ramisashararnidhi/mall-customer-csv
mydata <- read.table("./Mall customers.csv",
header = TRUE,
sep = ";",
dec=".")
head(mydata)
## CustomerID Gender Age Annual_income Visit_duration Total_Spend
## 1 1 Male 19 15 39 191.014
## 2 2 Male 21 15 81 468.744
## 3 3 Female 20 16 6 337.465
## 4 4 Female 23 16 77 284.474
## 5 5 Female 31 17 40 66.329
## 6 6 Female 22 17 76 66.061
Description:
mydata$Used_Car <- sample(c(0, 1), size = nrow(mydata), replace = TRUE)
head(mydata)
## CustomerID Gender Age Annual_income Visit_duration Total_Spend Used_Car
## 1 1 Male 19 15 39 191.014 1
## 2 2 Male 21 15 81 468.744 0
## 3 3 Female 20 16 6 337.465 0
## 4 4 Female 23 16 77 284.474 1
## 5 5 Female 31 17 40 66.329 0
## 6 6 Female 22 17 76 66.061 0
Description: Used_Car: Did a mall customer use a car to go to the mall? (1: Yes, 0: No)
mydata <- mydata[ ,-1]
head(mydata, 3)
## Gender Age Annual_income Visit_duration Total_Spend Used_Car
## 1 Male 19 15 39 191.014 1
## 2 Male 21 15 81 468.744 0
## 3 Female 20 16 6 337.465 0
mydata[100,3] <- 45
tail(mydata, 3)
## Gender Age Annual_income Visit_duration Total_Spend Used_Car
## 98 Female 27 60 50 376.285 1
## 99 Male 48 61 42 201.567 0
## 100 Male 20 45 49 183.962 1
mydata$Gender <- factor(mydata$Gender)
library(pastecs)
round(stat.desc(mydata[ , -1]), 2)
## Age Annual_income Visit_duration Total_Spend Used_Car
## nbr.val 100.00 100.00 100.00 100.00 100.00
## nbr.null 0.00 0.00 0.00 0.00 53.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 18.00 15.00 3.00 18.73 0.00
## max 70.00 61.00 99.00 498.98 1.00
## range 52.00 46.00 96.00 480.26 1.00
## sum 3975.00 3940.00 4993.00 24838.37 47.00
## median 36.50 41.00 50.00 237.74 0.00
## mean 39.75 39.40 49.93 248.38 0.47
## SE.mean 1.56 1.41 2.17 14.78 0.05
## CI.mean.0.95 3.10 2.79 4.30 29.34 0.10
## var 244.19 197.62 469.00 21859.06 0.25
## std.dev 15.63 14.06 21.66 147.85 0.50
## coef.var 0.39 0.36 0.43 0.60 1.07
The youngest customer visiting a mall was 18 years old, and was 52 years younger than than the oldest person visiting a mall. Half of the sample customers earn $41.000 annually or less, and half of the sample customers earn more than $41.000 annually. On average, customers spend 49.93 minutes at the mall, while most visits are -+21.66 minutes shorter or longer than 49.93 minutes.
#install.packages("modeest")
library(modeest)
mydata$Used_Car <- factor(mydata$Used_Car, levels = c("1", "0"))
mlv(mydata$Used_Car, method = "mfv")
## [1] 0
## Levels: 1 0
Most of the customers did not use a car to visit the mall.
hist(mydata$Age,
main = "Distribution of age of mall customers",
xlab = "Age in years",
ylab = "Number of customers",
breaks = seq(from = 0, to = 70, by = 5))
The distribution of age among mall customers is slightly multimodal, reaching peaks at 20-25 and 45-50 years of age and 15 customers in each range. There is a secondary peak of less than 15 customers aged 30 to 35.
library(car)
scatterplot(y = mydata$Total_Spend,
x = mydata$Annual_income,
ylab = "Total expense of a visit (in $)",
xlab = "Annual income (in $1,000)",
smooth = FALSE,
col = "orange")
The effect of total annual income on a total expense of a visit is slightly positive, meaning a larger annual income only slightly affects higher expenditure on a mall visit.
scatterplot(Total_Spend ~ Annual_income | Gender,
ylab = "Total expense of a visit (in $)",
xlab = "Annual income (in $1,000)",
smooth = FALSE,
data = mydata)
The positive effect of higher income is more apparent for male customers, if their annual salary is higher, they tend to be less frugal when visiting the mall. The results for females are reversed, if their annual salary is higher, they tend to be slightly more frugal.
library(ggplot2)
ggplot(mydata, aes(x = Total_Spend, fill = Gender)) +
geom_histogram(position = "dodge", binwidth = 50, colour = "gray") +
xlab("Total expense of a visit (in $)") +
ylab("Frequency")
labs(fill = "Gender")
## <ggplot2::labels> List of 1
## $ fill: chr "Gender"
More male customers tend to spend less on average, compared to female customers whose purchases are more expensive.
library(readxl)
Business_School <- read_excel("./Business School.xlsx")
Business_School <- as.data.frame(Business_School)
head(Business_School)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before) Employability (After)
## 1 1 Business 68.4 90.2 No 252 276
## 2 2 Computer Science 70.2 68.7 Yes 101 119
## 3 3 Finance 76.4 83.3 No 401 462
## 4 4 Business 82.6 88.7 No 287 342
## 5 5 Finance 76.9 75.4 No 275 347
## 6 6 Computer Science 83.3 82.1 No 254 313
## Status Annual Salary
## 1 Placed 111000
## 2 Placed 107000
## 3 Placed 109000
## 4 Placed 148000
## 5 Placed 255500
## 6 Placed 103500
#install.packages("forcats")
library(ggplot2)
ggplot(Business_School, aes(x = reorder(`Undergrad Degree`, -table(`Undergrad Degree`)[`Undergrad Degree`]),
fill = `Undergrad Degree`)) +
geom_bar(colour = "black") +
xlab("Undergraduate Degree") +
ylab("Count of Students") +
labs(fill = "Degree") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
sum(Business_School$`Undergrad Degree` == "Business")
## [1] 35
The most common degree is a degree in business. There is 35 students in the MBA programme that got an undergraduate degree in business.
summary(Business_School$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
The total number of units in the sample is 100. The average annual salary of MBA students is $109,058. Half of the students make $103,500 and less annually, half of the students make more than that.
library(ggplot2)
#install.packages(scales)
library(scales)
ggplot(Business_School, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 15000, fill = "seagreen3", colour = "black") +
xlab("Annual Salary ($)") +
ylab("Count of Students") +
ggtitle("Distribution of Annual Salary") +
scale_x_continuous(labels = comma)
The distribution of annual salary among MBA students is asymetrical to the right, which means it has a positive skew of 2.22. It has outliers, both on the bottom and top of the salary range.
iqr_value <- IQR(Business_School$`Annual Salary`)
iqr_value
## [1] 36875
The central 50 % of the values of Annual Salaries are spread across a band that is 36,875 wide.
t.test(Business_School$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
#install.packages("effectsize")
library(effectsize)
effectsize::cohens_d(Business_School$`MBA Grade`, mu=74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
effectsize::interpret_cohens_d(0.27, rules="sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
We can reject the null hypothesis of the average MBA grade for this years students was 74, at the p-value of 0.009 and within the 95% confidence interval at alpha 5%.
This years students MBA grades are slightly higher than 74 (Cohen d = 0.27) indicating a minor difference.
library(readxl)
Apartments <- read_excel("./Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
head(Apartments, 3)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
Description:
Apartments$Parking <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(Apartments, 3)
## Age Distance Price Parking Balcony
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
library(effectsize)
effectsize::cohens_d(Apartments$Price, mu=1900)
## Cohen's d | 95% CI
## ------------------------
## 0.31 | [0.10, 0.53]
##
## - Deviation from a difference of 1900.
effectsize::interpret_cohens_d(0.31, rules="sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
We can conclude, that the average price of apartments was not 1900 eur per m2, and we can reject the null hypothesis with the p-value of 0.004. The 95% confidence interval does not include a price of 1900 eur per m2 and it suggests that the sample average price of a m2 of apartments is 2018.9. The effect size was small (Cohen’s d = 0.31).
fit1 <- lm(Price ~ Age,
data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255
The regression coefficient is 2185.5 meaning that newly built
apartments (age=0) are expected to cost 2185.5 eur per m2. The
coefficient of correlation is -0.23 meaning that the linear relationship
between price and age is negative and weak. As age increases, price
slightly decreases.
The coefficient of determination is 0.0530 meaning that only 5.3% of
variability of price is explained by the linear effect of age. This
means that age alone is not a strong predictor of price and other
factors have a bigger effect, like availability of parking, a balcony
and the distance from the city center.
library(car)
scatterplotMatrix(~ Price + Age + Distance,
data = Apartments,
smooth = FALSE)
library(Hmisc)
rcorr(as.matrix(Apartments[ , c(-4, -5)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
There is no multicolinearity issues among age and distance, as the values are spread out around the graph and are not close to the linear function.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Because the VIF values are under 5 it means that no variable is perfectly explained by another variable, which means there is no perfect multicolinearity and therefore no need to remove certain variables.
Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
xlim = c(-3, 3),
col = "orangered")
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances",
col = "skyblue")
head(Apartments[order(Apartments$StdResid),], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
## 20 13 8 1800 No No -1.381 0.012
## 35 14 16 1660 No Yes -1.261 0.008
## 36 24 5 1830 Yes No -1.189 0.012
head(Apartments[order(-Apartments$CooksD),], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 39 40 2 2400 No Yes 1.091 0.038
Because of the cut point for influential cases for this sample is 4/n=4/85= 0.047, we have influential units in rows 38, 55, 33, 53 and 22.
Apartments$ID <- seq_len(nrow(Apartments))
library(dplyr)
Apartments <- Apartments %>%
filter(!(ID %in% c(38, 55, 33, 53, 22)))
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "New histogram of Cooks distances")
Apartments$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = Apartments$StdResid, x = Apartments$StdFitted,
ylab= "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
#install.packages("olsrr")
library(olsrr)
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
Heteroskedasticity may be present because the variance of the residuals increases as the fitted values increase. We tested the assumption with the Breusch Pagan test, which discovered that we cannot reject the null hypothesis of a constant variance, that means there is no strong evidence of heteroskedasticity.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "pink")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.93418, p-value = 0.0004761
Standardized residuals are not normally distributed, they are asymetrical to the right. Using the Shapiro-Wilk test we can reject the null hypothesis of normally distributed with the p-value of 0.0004. This means that the standardized residuals are not normally distributed which affects the t-test and confidence intervals.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
If distance from the city center is increased by 1 km, the price per m2 decreases by 24.06 eur on average at p<0.001 controlling for all other variables. If the age of an apartment in years is increased by 1 year, the price per m2 decreases by 8.67 eur on average at p<0.001 controlling for all other variables. 53.61% of variability can be explained by the linnear effect of age and distance. The predicted price for newly built apartments (age = 0) is 2502.46 eur per m2.
There is a strong linear relationship, meaning the variables age and distance predict price reasonably well.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
We cannot reject the null hypothesis that adding balcony and parking does not significantly improve the model. This means that adding parking and balcony does not significantly improve the model and that age and price explain most of the variation in price.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Given the age, distance and parking, apartments with a balcony are predicted to cost 6.032 eur more on average than apartments with no balcony. At the p-value of 0.9165 this result is not statistically significant at alpha=0.05. Given the age, distance and availability of a balcony, apartments with parking are predicted to cost 128.7 eur more on average than apartments wiht no parking. At the p-value of 0.037 this result is statistically significant at alpha=0.05.
The F-test is testing the significance of the whole model. The hypothesis being tested is H0: bAge = bDistance = bParkingYes = bBalconyYes = 0. The alternative hypothesis is that at least one of the coefficients does not equal 0.
Apartments$Fitted <- fitted(fit3)
Apartments$Residual <- residuals(fit3)
Apartments[2, c("Fitted", "Residual")]
## Fitted Residual
## 2 2356.597 443.4026