mydataTask1 <- read.table("./Customer Purchasing Behaviors.csv",
header = TRUE,
sep = ",")
head(mydataTask1)
## UserId age AnnualIncome PurchaseAmount LoyaltyScore Region PurchaseFrequency
## 1 1 25 45000 200 4.5 North 12
## 2 2 34 55000 350 7.0 South 18
## 3 3 45 65000 500 8.0 West 22
## 4 4 22 30000 150 3.0 East 10
## 5 5 29 47000 220 4.8 North 13
## 6 6 41 61000 480 7.8 South 21
• Age: The age of the customer.
• Annual income: The customer’s annual income (in USD).
• Purchase amount: The total amount of purchases made by the customer (in USD).
• Loyalty score: The customer’s loyalty score (on a scale 1-10; 10 being highest).
• Region: The region where the customer lives (north, west, south, east).
• Purchase Frequency: Frequency of customer purchases (number of times per year).
mydataTask1$AnnualIncome1000 <- mydataTask1$AnnualIncome / 1000
head(mydataTask1)
## UserId age AnnualIncome PurchaseAmount LoyaltyScore Region PurchaseFrequency AnnualIncome1000
## 1 1 25 45000 200 4.5 North 12 45
## 2 2 34 55000 350 7.0 South 18 55
## 3 3 45 65000 500 8.0 West 22 65
## 4 4 22 30000 150 3.0 East 10 30
## 5 5 29 47000 220 4.8 North 13 47
## 6 6 41 61000 480 7.8 South 21 61
I created a new variable, which presents the Annual Income in 1000 USD.
colnames(mydataTask1) [2] <- "Age"
I renamed the variable “age” into “Age, for more cohesiveness.
mydataTask1.1 <- subset(mydataTask1, Age > 30)
head(mydataTask1.1)
## UserId Age AnnualIncome PurchaseAmount LoyaltyScore Region PurchaseFrequency AnnualIncome1000
## 2 2 34 55000 350 7.0 South 18 55
## 3 3 45 65000 500 8.0 West 22 65
## 6 6 41 61000 480 7.8 South 21 61
## 7 7 36 54000 400 6.5 West 19 54
## 9 9 50 70000 600 9.0 North 25 70
## 10 10 31 50000 320 5.5 South 17 50
I created a new database with the variable “Age” value over 30.
summary(mydataTask1[, c("Age", "AnnualIncome", "PurchaseAmount")])
## Age AnnualIncome PurchaseAmount
## Min. :22.00 Min. :30000 Min. :150.0
## 1st Qu.:31.00 1st Qu.:50000 1st Qu.:320.0
## Median :39.00 Median :59000 Median :440.0
## Mean :38.68 Mean :57408 Mean :425.6
## 3rd Qu.:46.75 3rd Qu.:66750 3rd Qu.:527.5
## Max. :55.00 Max. :75000 Max. :640.0
Age:
Min: The minimum age of customers from my data set is 22 years.
Max: The maximum age of customers from my data set is 55 years.
Annual income:
Median: Half of the customers’ annual income was lower than 59000 USD, and half of the customers’ annual income was higher.
3rd Quartile: 25 % of the customers’ annual income was higher than 66750 USD.
Purchase amount:
hist(mydataTask1$AnnualIncome,
main = "Distribution of Annual Income",
ylab = "Frequency",
xlab = "Annual Income (in USD)",
breaks = seq(30000, 80000, (10000)),
right = FALSE,
col = "magenta",
fill = "magenta4")
## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter
## Warning in axis(1, ...): "fill" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter
Explanation of the histogram for Annual Income: The customers’ annual income (in USD) has a normal distribution, which is slightly skewed to the left.
boxplot(mydataTask1$PurchaseAmount,
main = "Purchase Amount",
ylab = "USD",
col = "magenta",
border = "magenta4")
Explanation of the boxplot for Purchase Amount: We can see that there are no outliers; from the boxplot we can also see that the the median is 440 USD, it shows us the minimum (150 USD), the maximum (640 USD), the first quartile (320 USD) and the third quartile (527,5 USD).
library(readxl)
mydataTask2 <- read_xlsx("./Business School.xlsx")
head(mydataTask2)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade` `Work Experience` `Employability (Before)`
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1 Business 68.4 90.2 No 252
## 2 2 Computer Science 70.2 68.7 Yes 101
## 3 3 Finance 76.4 83.3 No 401
## 4 4 Business 82.6 88.7 No 287
## 5 5 Finance 76.9 75.4 No 275
## 6 6 Computer Science 83.3 82.1 No 254
## # ℹ 3 more variables: `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
library(ggplot2)
ggplot(mydataTask2, aes(x=`Undergrad Degree`)) +
geom_bar(colour = "magenta4", fill = "magenta") +
ylab("Frequency") +
xlab("Undergrad Degree")
From the histogram we can see that the most common Undergrad Degree among the MBA students is in Business.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:car':
##
## logit
describeBy(mydataTask2$`Annual Salary`)
## Warning in describeBy(mydataTask2$`Annual Salary`): no grouping variable requested
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22 9.41 4150.15
mydataTask2$`Annual Salary1000` <- mydataTask2$`Annual Salary` / 1000
library(ggplot2)
ggplot(mydataTask2, aes(x=`Annual Salary1000`))+
geom_histogram(color="magenta4", fill ="magenta")+
xlab("Annual Salary (in 1000 USD)")+
ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Explanation of the histogram for Annual Salary (in 1000 USD): we can see that the histogram is unimodal and it is skewed to the right. We can also notice that there are some outliers on the right, which we can see from the big gaps.
t.test(mydataTask2$`MBA Grade`,
mu=74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataTask2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
H0: mu = 74
H1: mu ≠ 74
We can reject the null hypothesis at p = 0.00915, and accept H1, which means that the average of the MBA grades is different from 74.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
cohens_d(mydataTask2$`MBA Grade`,
mu=74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
The value of Cohen’s d is 0.27, which means that there is a small effect size (95% confidence interval).
The value of Cohen’s d being 0.27, indicates that there is a small effect size with (95 % confidence). There is a modest difference between this year’s generation and last year’s generation.
library(readxl)
mydataTask3 <- read_excel("./Apartments.xlsx")
head(mydataTask3)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes
mydataTask3$Balcony <- factor(mydataTask3$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
mydataTask3$Parking <- factor(mydataTask3$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
t.test(mydataTask3$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataTask3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0: mu = 1900
H1: mu ≠ 1900
We can reject the null hypothesis at p = 0.004731, and accept H1, which means that the average price of the apartments per m2 is different from 1900.
fit1 <- lm(Price ~ Age, data = mydataTask3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydataTask3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
If the age of an apartment increases by 1 year, the price per m2 decreases by 8.975 euros on average (p = 0.034).
The correlation coefficient is 0.23, which means that the linear correlation between price and age of an apartment is weak and positive.
The coefficient of determination is 0.053, which means that age explains only 5.3 % of the variability of price per m2.
scatterplotMatrix(mydataTask3[c("Price", "Age", "Distance")], smooth = FALSE)
Based on the matrix we can see that there is no problem with potential multicollinearity, because none of the scatter plots show a strong relationship.
fit2 <- lm(Price ~ Age + Distance, data = mydataTask3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataTask3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
All VIF statistics are below 5 and very close to 1, and the average VIF statistic is 1.002, which is very close to 1. This means that there is no multicollinearity between the variables.
mydataTask3$StdResid <- round(rstandard(fit2), 3)
hist(mydataTask3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "magenta",
fill = "magenta4")
## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter
## Warning in axis(1, ...): "fill" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter
From the histogram we can see that all values are between 3 and -3, so there are no outliers.
mydataTask3$CooksD <- round(cooks.distance(fit2), 3)
hist(mydataTask3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances",
col = "magenta",
fill = "magenta4")
## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter
## Warning in axis(1, ...): "fill" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter
From the Cook’s distance histogram we can see that at least 1 unit has a value that is to some extent greater than the values of other units, so we are checking (which we can see from the gap between 0.15 and 0.30).
head(mydataTask3[order(-mydataTask3$CooksD), "CooksD"], 10)
## # A tibble: 10 × 1
## CooksD
## <dbl>
## 1 0.32
## 2 0.104
## 3 0.069
## 4 0.066
## 5 0.061
## 6 0.038
## 7 0.037
## 8 0.034
## 9 0.032
## 10 0.03
If we print the first 10 units with the highest Cook’s distances, we can see that the units with the greatest impact are 0.320 and 0.104. There’s another gap between the 5th and 6th value, so we can decide to remove them as well.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydataTask3 <- mydataTask3 %>%
filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))
fit2 <- lm(Price ~ Age + Distance, data = mydataTask3)
mydataTask3$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y=mydataTask3$StdResid, x=mydataTask3$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The points in the scatterplot of the standardized residuals against the standardized fitted values are randomly distributed and in a horizontal band of constant variability, so there is no problem of heteroskedasticity.
hist(mydataTask3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "magenta",
fill = "magenta4")
## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter
## Warning in axis(1, ...): "fill" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter
From the histogram of standardized residuals we can see that all values are between 3 and -3, so there are no outliers. The standardized residuals are slightly assimetrically distributed to the right.
shapiro.test(mydataTask3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataTask3$StdResid
## W = 0.93418, p-value = 0.0004761
The p-value is lower than 0.05, so we can reject the null hypothesis that the standardized residuals are normally distributed and we can conclude that errors are not distributed normally.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataTask3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
Age: If the age of an apartment increases by 1 year, the price per m2 decreases on average by 8.674 USD (p-value < 0.05), assuming distance remains constant.
Distance: If the distance from the city center increases by 1 km, the price per m2 decreases on average by 24.063 USD (p-value < 0.001), assuming age is constant.
Coefficient of determination = 0.5361, which means that 53.61 % of the variability in price per m2 is explained by the linear effect of age and distance.
Multiple correlation coefficient = 0.7322 , which means that the linear correlation between price, age and distance of the apartment is strong.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydataTask3)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
The results from the comparison of the two regression models show that the fit3 model does not fit the data better (p-value = 0.1135)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydataTask3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Parking: Given the values of the other explanatory variables, apartments with a parking space have on average a 128.700 USD higher price per m2 in comparison to the apartments without a parking space (p-value = 0.0376).
Balcony: we found no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not, because it is not statistically significant (p-value = 0.9165).
The F-statistic is testing H0: R-squared = 0.
mydataTask3$StdFittedValues <- fitted.values(fit3)
mydataTask3$StdResid <- residuals(fit3)
head(mydataTask3[ , colnames(mydataTask3) %in% c("ID", "Price", "StdFittedValues", "StdResid")])
## # A tibble: 6 × 3
## Price StdResid StdFittedValues
## <dbl> <dbl> <dbl>
## 1 1640 -88.6 1729.
## 2 2800 443. 2357.
## 3 1660 -62.6 1723.
## 4 1850 311. 1539.
## 5 1640 -349. 1989.
## 6 1770 -143. 1913.
Based on the estimated regression function, the expected price per m2 for this apartment is 2356.597 euros, whereas the actual price is 2800 euros, so the residual is 443.4 euros.