if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
Marketing_Campaign <- read_excel("Desktop/R Take Home Exam 2025/Task 1/Marketing Campaign.xlsx")
head(Marketing_Campaign,15)
## # A tibble: 15 × 16
## ID Age Job Marital Education Default Balance Housing Loan
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 58 management married tertiary no 2143 yes no
## 2 2 44 technician single secondary no 29 yes no
## 3 3 33 entrepreneur married secondary no 2 yes yes
## 4 4 47 blue-collar married unknown no 1506 yes no
## 5 5 33 unknown single unknown no 1 no no
## 6 6 35 management married tertiary no 231 yes no
## 7 7 28 management single tertiary no 447 yes yes
## 8 8 42 entrepreneur divorced tertiary yes 2 yes no
## 9 9 58 retired married primary no 121 yes no
## 10 10 43 technician single secondary no 593 yes no
## 11 11 41 admin. divorced secondary no 270 yes no
## 12 12 29 admin. single secondary no 390 yes no
## 13 13 53 technician married secondary no 6 yes no
## 14 14 58 technician married unknown no 71 yes no
## 15 15 57 services married secondary no 162 yes no
## # ℹ 7 more variables: Month_Contacted <chr>, Duration <dbl>, Campaign <dbl>,
## # Pdays <dbl>, Previous <dbl>, Poutcome <chr>, Y <chr>
LOGIC BEHIND THE DATA: Dataset comes from a Portuguese bank’s direct marketing campaigns from 2012. The campaigns were run over the phone: bank employees called clients to offer them financial products. The main product promoted was a term deposit (also called a time deposit or fixed deposit). There were 45211 clients contacted and 15 different variables recorded.
Age: Age of the client Job: Type of Job Marital: Marital status Education: Level of education Default: Credit in default (too late in payment of credit) Balance: Average yearly Balance in client’s bank account (in EUR) Housing: Has housing loan Loan: Has personal loan Month_Contacted: Month when the client was last contacted (during current campaign) Duration: Duration of the last contact (call) with the client (in seconds) Campaign: Number of contacts performed during this campaign and for this client Pdays: Number of days that passed by after the client was last contacted from a previous campaign (-1 means not previously contacted) Previous: Number of contacts performed before this campaign and for this client Poutcome: Outcome of the previous marketing campaign Y: Has the client subscribed a term deposit (marketing campaign)
Marketing_Campaign2 <- Marketing_Campaign |> rename(`Account Balance` = Balance) ##Change the name of variable to Account Balance
Marketing_Campaign2$BankName <- c(1) ## Add new f variable (as the last column) with value "1"
Marketing_Campaign2$BankNameF <- factor(Marketing_Campaign2$BankName,
levels = c(1),
labels = c("Portuguese Bank")) #Rename the new variable
Marketing_Campaign3 <- Marketing_Campaign[, -4] ##Delete fifth column
Marketing_Campaign4 <- na.omit(Marketing_Campaign) ## Delete rows with N/A values
summary(Marketing_Campaign)
## ID Age Job Marital
## Min. : 1 Min. :18.00 Length:45211 Length:45211
## 1st Qu.:11304 1st Qu.:33.00 Class :character Class :character
## Median :22606 Median :39.00 Mode :character Mode :character
## Mean :22606 Mean :40.94
## 3rd Qu.:33908 3rd Qu.:48.00
## Max. :45211 Max. :95.00
## Education Default Balance Housing
## Length:45211 Length:45211 Min. : -8019 Length:45211
## Class :character Class :character 1st Qu.: 72 Class :character
## Mode :character Mode :character Median : 448 Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## Loan Month_Contacted Duration Campaign
## Length:45211 Length:45211 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 103.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.2 Mean : 2.764
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :63.000
## Pdays Previous Poutcome Y
## Min. : -1.0 Min. : 0.0000 Length:45211 Length:45211
## 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character Class :character
## Median : -1.0 Median : 0.0000 Mode :character Mode :character
## Mean : 40.2 Mean : 0.5803
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
summary(Marketing_Campaign2)
## ID Age Job Marital
## Min. : 1 Min. :18.00 Length:45211 Length:45211
## 1st Qu.:11304 1st Qu.:33.00 Class :character Class :character
## Median :22606 Median :39.00 Mode :character Mode :character
## Mean :22606 Mean :40.94
## 3rd Qu.:33908 3rd Qu.:48.00
## Max. :45211 Max. :95.00
## Education Default Account Balance Housing
## Length:45211 Length:45211 Min. : -8019 Length:45211
## Class :character Class :character 1st Qu.: 72 Class :character
## Mode :character Mode :character Median : 448 Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## Loan Month_Contacted Duration Campaign
## Length:45211 Length:45211 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 103.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.2 Mean : 2.764
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :63.000
## Pdays Previous Poutcome Y
## Min. : -1.0 Min. : 0.0000 Length:45211 Length:45211
## 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character Class :character
## Median : -1.0 Median : 0.0000 Mode :character Mode :character
## Mean : 40.2 Mean : 0.5803
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
## BankName BankNameF
## Min. :1 Portuguese Bank:45211
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
summary(Marketing_Campaign3)
## ID Age Job Education
## Min. : 1 Min. :18.00 Length:45211 Length:45211
## 1st Qu.:11304 1st Qu.:33.00 Class :character Class :character
## Median :22606 Median :39.00 Mode :character Mode :character
## Mean :22606 Mean :40.94
## 3rd Qu.:33908 3rd Qu.:48.00
## Max. :45211 Max. :95.00
## Default Balance Housing Loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## Month_Contacted Duration Campaign Pdays
## Length:45211 Min. : 0.0 Min. : 1.000 Min. : -1.0
## Class :character 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.0
## Mode :character Median : 180.0 Median : 2.000 Median : -1.0
## Mean : 258.2 Mean : 2.764 Mean : 40.2
## 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.: -1.0
## Max. :4918.0 Max. :63.000 Max. :871.0
## Previous Poutcome Y
## Min. : 0.0000 Length:45211 Length:45211
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.5803
## 3rd Qu.: 0.0000
## Max. :275.0000
summary(Marketing_Campaign4)
## ID Age Job Marital
## Min. : 1 Min. :18.00 Length:45211 Length:45211
## 1st Qu.:11304 1st Qu.:33.00 Class :character Class :character
## Median :22606 Median :39.00 Mode :character Mode :character
## Mean :22606 Mean :40.94
## 3rd Qu.:33908 3rd Qu.:48.00
## Max. :45211 Max. :95.00
## Education Default Balance Housing
## Length:45211 Length:45211 Min. : -8019 Length:45211
## Class :character Class :character 1st Qu.: 72 Class :character
## Mode :character Mode :character Median : 448 Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## Loan Month_Contacted Duration Campaign
## Length:45211 Length:45211 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 103.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.2 Mean : 2.764
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :63.000
## Pdays Previous Poutcome Y
## Min. : -1.0 Min. : 0.0000 Length:45211 Length:45211
## 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character Class :character
## Median : -1.0 Median : 0.0000 Mode :character Mode :character
## Mean : 40.2 Mean : 0.5803
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
library(psych)
describe(Marketing_Campaign[,c(-1,-3,-4,-5,-9,-10,-12,-13,-14,-15)])
## vars n mean sd median trimmed mad min max range
## Age 1 45211 40.94 10.62 39 40.25 10.38 18 95 77
## Default* 2 45211 1.02 0.13 1 1.00 0.00 1 2 1
## Balance 3 45211 1362.27 3044.77 448 767.21 664.20 -8019 102127 110146
## Housing* 4 45211 1.56 0.50 2 1.57 0.00 1 2 1
## Duration 5 45211 258.16 257.53 180 210.87 137.88 0 4918 4918
## Y* 6 45211 1.12 0.32 1 1.02 0.00 1 2 1
## skew kurtosis se
## Age 0.68 0.32 0.05
## Default* 7.24 50.49 0.00
## Balance 8.36 140.73 14.32
## Housing* -0.22 -1.95 0.00
## Duration 3.14 18.15 1.21
## Y* 2.38 3.68 0.00
I ALSO WANTED TO GET DATA FROM THIS INPUT BUT R CANNOT FIND STAT.DESC: “round(stat.desc(Marketing_Campaign[,c(-1,-3,-4,-5,-9,-10,-12,-13,-14,-15)]),1)”
From the descriptive statistic we remove the variables we do not tend to analyze. - to analyze the AGE of clients variable, we can see that the youngest client is 18 years old and the oldest 95 years old. Furthermore we can conclude that the bank has at least one client between the ages 18 and 95, as there are 77 distinct values. Average Age is 40,94, median 40,5. 50% of clients are older than 39 years. - Balance variable shows us that average balance on client’s account is 1362 eur, where 50% of clients have less than 448 eur on the account. Clients have in total 61,5 MIO eur in this bank. I would like to point out also std.dev (how spread out the values are around the mean), since it is quiet large. On average clients’ balances differ by 3044 eur from the mean. - Duration of a last call with client (in seconds): average call duration is 258sec (above 4 minutes), where 50% of calls lasted less or equal to 180 sec (3 minutes). Std.deviation is also big here, 257,5, almost the same as the mean, meaning that average call duration difffers about 257,5 seconds from the mean.
hist(Marketing_Campaign$Age,
breaks = seq(15,100,2), #Each bar accounts for span of 2 years of age
xlab = "Age",
main = "Histogram of Age",
right = FALSE)
From the next Graph below we can see that units are visibly asymetric to the right. We had to include also negative balances on the x axis. The graph isn’t perfect, as I didn’t find solution to draw all bars from 0 up (probably of too high differences in numbers). Comments also in the code.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(Marketing_Campaign, aes(x = `Balance`)) +
geom_histogram(binwidth = 500, fill = "green", color = "grey") + #Each Bar represents a range of 500eur in Account balances
scale_y_log10() + # For better understanding, we need to log this scale, otherwise, the majority of data would not be visible
labs(title = "Histogram of Account Balance",
x = "Account Balance (EUR)",
y = "Frequency (log scale)") +
theme_minimal()
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
The next graph shows correlation between Default (not repaying the debt) and account balance. We have removed extreme balances above 30000eur and below -5000eur. We can see that clients who have defaulted also have much lower balance on average (most below 0eur). we can assume that correlation between account balance and default risk is high.
Marketing_Campaign5 <- Marketing_Campaign %>%
filter(Balance >= -5000 & Balance <= 30000) # Create new dataset without extreme balances
ggplot(Marketing_Campaign5, aes(x = Default, y = Balance)) +
geom_boxplot() +
xlab("Default") #Graphical distribution of Balances by Default in a boxplot
At the next graph we test Clients in deafult by Job (which job accounts for the most Defaults). We had to create a new data set in the beginning to ease the further steps. There are also explainatory comment in the code. The results show that the Job title with the highest amount of Defaults is “Blue-Collar”, then the “Management” and at the third place “Technician”. The least “defaulted” are students.
library(ggplot2)
library(dplyr)
# Count number of defaults per job
defaults_by_job <- Marketing_Campaign %>%
filter(Default == "yes") %>%
count(Job, sort = TRUE)
# Bar plot
ggplot(defaults_by_job, aes(x = reorder(defaults_by_job$Job, n), y= defaults_by_job$n )) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # WE Flip x and y for better readability
labs(title = "Number of Clients in Default by Job",
x = "Job",
y = "Number of Defaults") +
theme_minimal()
## Warning: Use of `defaults_by_job$Job` is discouraged.
## ℹ Use `Job` instead.
## Warning: Use of `defaults_by_job$n` is discouraged.
## ℹ Use `n` instead.
library(readxl)
Business_School <- read_excel("Desktop/R Take Home Exam 2025/Task 2/Business School.xlsx")
header = TRUE
sep = ";"
dec = ","
head(Business_School)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
data <- read_excel("Desktop/R Take Home Exam 2025/Task 2/Business School.xlsx")
summary(Business_School)
## Student ID Undergrad Degree Undergrad Grade MBA Grade
## Min. : 1.00 Length:100 Min. : 61.20 Min. :58.14
## 1st Qu.: 25.75 Class :character 1st Qu.: 71.47 1st Qu.:71.14
## Median : 50.50 Mode :character Median : 76.65 Median :76.38
## Mean : 50.50 Mean : 76.90 Mean :76.04
## 3rd Qu.: 75.25 3rd Qu.: 81.70 3rd Qu.:82.15
## Max. :100.00 Max. :100.00 Max. :95.00
## Work Experience Employability (Before) Employability (After)
## Length:100 Min. :101.0 Min. :119.0
## Class :character 1st Qu.:245.8 1st Qu.:312.0
## Mode :character Median :256.8 Median :435.6
## Mean :257.9 Mean :422.7
## 3rd Qu.:261.0 3rd Qu.:529.0
## Max. :421.0 Max. :631.0
## Status Annual Salary
## Length:100 Min. : 20000
## Class :character 1st Qu.: 87125
## Mode :character Median :103500
## Mean :109058
## 3rd Qu.:124000
## Max. :340000
names(Business_School)
## [1] "Student ID" "Undergrad Degree" "Undergrad Grade"
## [4] "MBA Grade" "Work Experience" "Employability (Before)"
## [7] "Employability (After)" "Status" "Annual Salary"
ggplot(data, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "darkgreen") +
labs(title = "Distribution of Undergraduate Degrees",
x = "Undergraduate Degree",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
data %>%
count(`Undergrad Degree`, sort = TRUE)
## # A tibble: 5 × 2
## `Undergrad Degree` n
## <chr> <int>
## 1 Business 35
## 2 Computer Science 25
## 3 Finance 25
## 4 Engineering 9
## 5 Art 6
The most common degree is in Business, with 35 students
library(psych) #SHORTER ORDER FUNCTION
describe(data$`Annual Salary`)
## vars n mean sd median trimmed mad min max range skew
## X1 1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
## kurtosis se
## X1 9.41 4150.15
ggplot(data, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 8000, fill = "orange", color = "black") +
scale_x_continuous(labels = scales::comma) +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary",
y = "Count of Students") +
theme_minimal()
From the graph shown, I assume that distribution of units is
asymetrically distributed to the right.
t.test(Business_School$`MBA Grade`, mu=74)
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
t.test showed that p-value is below 0,05 (5%), which indicates that we can reject the H0 hypothesis as the arithmetic mean of MBA grade is higher than 74. With 95% confidence we can say, that the average MBA Grade is between 74,51 and 77,56.
library(readxl)
Apartments <- read_excel("Desktop/R Take Home Exam 2025/Task 3/Apartments.xlsx")
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Apartments$ParkingF <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$BalconyF <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(Apartments)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingF BalconyF
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(Apartments[c("Age", "Distance", "Price", "Parking", "Balcony")]), 2)
## Age Distance Price Parking Balcony
## nbr.val 85.00 85.00 85.00 85.00 85.00
## nbr.null 0.00 0.00 0.00 42.00 48.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 1.00 1.00 1400.00 0.00 0.00
## max 45.00 45.00 2820.00 1.00 1.00
## range 44.00 44.00 1420.00 1.00 1.00
## sum 1577.00 1209.00 171610.00 43.00 37.00
## median 18.00 12.00 1950.00 1.00 0.00
## mean 18.55 14.22 2018.94 0.51 0.44
## SE.mean 1.05 1.23 40.98 0.05 0.05
## CI.mean.0.95 2.09 2.45 81.50 0.11 0.11
## var 93.96 129.44 142764.34 0.25 0.25
## std.dev 9.69 11.38 377.84 0.50 0.50
## coef.var 0.52 0.80 0.19 0.99 1.15
I have displayed some descriptive statistics of the variables used in the Task
t.test(Apartments$Price, mu=1900)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
With the use of t-test, the model checked all of the 84 rows (data units) We can reject the Hypothesis, as the average Price of apartments is not equal to 1900, but rather 2018,941 eur. With a 95% certainty we can say that the average price of apartments lies between 1937,44 eur and 2100,44eur.
fit1 <-lm(Apartments$Price ~ Apartments$Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Apartments$Price ~ Apartments$Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Apartments$Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
corr_coef <- cor(Apartments$Age, Apartments$Price)
print(corr_coef)
## [1] -0.230255
Regression function: Price= 2185,455 - 8,975 x Age The function shows us that if Age is zero, the average price of apartments is 2185,455 eur. When we increase by 1, the price falls by 8,975 eur.
Multiple R-squared statistic that Age variable explains 5,302% of variability in Price of apartments.
The p-value test showed that we can with 95% certainty say that coefficients are not equal to 0. Therefore, we can reject the null hypothesis (H0: p<0,05). Age does impact the Price of apartments.
The correlation between Price and age is negative, which means that age negatively influences the average prices. Higher the age, lower the price. However, the correlation (influence) is not strong -> (-0,23).
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
vars <- Apartments[, c("Price", "Age", "Distance")]
scatterplotMatrix(vars, ,
smooth = FALSE)
The graph/ picture shows us multicollinearity between explanatory
variables. We look at the graphs in the first row. The upper, middle
picture (graph) shows that higher age negatively impacts the price
(older the apartments, lower the price). The upper right graph shows the
impact of Distance (from the city center) on the Price, where bigger
distance means lower prices for apartments.
The slope of upper right graph is falling faster than the slope of upper middle graph, which means that Distance has stronger impact on Price, than Age.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
P-value of F-statistic shows us that we can reject the H0 hypothesis, as the number is lower than 0,05. At least one of the predictors/ explanatory variables (probably distance from the data above) has a meaningful effect on price.
Price and Age together explain 43,96% of average price of the apartments.
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF function is used to evaluate the strength of the correlation between the explanatory variables. The higher the VIF statistic, the more strongly the variable is related to other explanatory variables.
Because VIF statistics is close to 1 for both explanatory variables Age and Distance, there are no multicolinearity concerns. The problem occurs, if VIF statistics equals >5. Only then we should not include them in the model.
Apartments$std_resid <- round(rstandard(fit2),3)
hist(Apartments$std_resid,
xlim = c(-3,3),
ylim = c(0,20),
main = "Histogram of standardized residuals",
xlab = "standardized residuals",
col = "lightgreen")
Based on the distribution of standardized residuals, we try to predict
the distribution of errors in the population.
-From the picture we assume that residuals are slightly asymmetrically distributed to the right. We could test this also with shapiro test.
cooks_d <- cooks.distance(fit2)
Apartments$CooksD <- round(cooks.distance(fit2),3) ### First two rows are here to define "CooksD" variable
head(Apartments[order(-Apartments$std_resid), c("Distance", "Price","Age", "CooksD","std_resid")],)
## # A tibble: 6 × 5
## Distance Price Age CooksD std_resid
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 45 2180 5 0.32 2.58
## 2 11 2790 2 0.069 2.05
## 3 1 2800 18 0.03 1.78
## 4 1 2800 18 0.03 1.78
## 5 2 2820 8 0.037 1.66
## 6 1 2810 10 0.032 1.60
cooks_d <- cooks.distance(fit2)
Apartments$CooksD <- round(cooks.distance(fit2),3)
hist(Apartments$CooksD,
main = "Histogram of Cooks Distances",
xlab = "Cooks Distances",
col = "yellow")
Using the Cook’s Distance, we can spot units with a high impact on the
estimated regression function and remove them (similar to outliers).
Cook’s distance is a value above 0, where a higher number means a larger
impact.
From the graph and “head” code, we can see that there is one apartment that stands out with a value of 0,32, therefore too high influence. This is apartment no. 38 and it has to be removed. We can remove it with function “filter”
head(Apartments[order(-Apartments$CooksD), c("Distance", "Price","Age", "CooksD","std_resid")],)
## # A tibble: 6 × 5
## Distance Price Age CooksD std_resid
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 45 2180 5 0.32 2.58
## 2 37 1740 43 0.104 1.44
## 3 11 2790 2 0.069 2.05
## 4 2 1760 7 0.066 -2.15
## 5 3 2540 37 0.061 1.58
## 6 2 2400 40 0.038 1.09
With this function we can remove problematic units:
library(dplyr) Apartments <- Apartments %>% filter(!CooksD %in% (0.320))
std_fitted <- as.numeric(scale(fitted(fit2)))
Apartments$stdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(x=Apartments$stdFitted, y=Apartments$std_resid,
xlab = "Standardized Fitted Values",
ylab = "Standardised Residuals",
main = "Heteroskedasticity Check",
boxplots = FALSE,
regLine = FALSE,
smooth= FALSE,
)
The points should be randomly distributed in a horizontal band of constant variability. Heteroskedasticity occurs, if the variability changes, and it affects the reliability of the estimated standard errors. We also test it with the use of Breuch-Pagan test, below.
From the graph (picture), we can see random distribution in a horizontal band. We assume that Homoskedasticity is not violated (The variance of errors is constant). We can confirm our assumptions with Breuch-pagan test. We cannot reject H0 hypothesis (the variance is constant), because p-value is above 0,05.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
hist(Apartments$std_resid,
xlim = c(-3,3),
main = "Histogram of standardized residuals",
xlab = "standardized residuals",
ylab = "Density",
prob = TRUE, ## MEANS THAT YOU ADJUST Y AXIS TO THE RED CURVE
col = "lightgreen")
curve(dnorm(x, mean = mean(Apartments$std_resid), sd = sd (Apartments$std_resid)),
col = "red",
lwd = 2,
add = TRUE)
We drew the same graph that was drawn in the previous question. To this graph I added red curve, which simulates normal distribution. We assume that standardized residuals are not normally distributed.
We can formally test the distribution of Standardized Residuals with Shapiro test.
shapiro.test(Apartments$std_resid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$std_resid
## W = 0.95303, p-value = 0.003645
Shapiro test shows p-value of 0,0021, which is below our value of 0,05. Therefore, we can reject the H0: standardized residuals are normally distributed.
With the 95% certainty we assume that Standardized residuals are not normally distributed.
Apartments <- Apartments[!(Apartments$CooksD == 0.320),] # ! removes rows with 0,320 values
fit2 <- lm(Price ~ Age + Distance, data = Apartments) # lm states as linear regression model
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
As done in few steps above, we can again exclude the unit with Cooks Distance 0,320. After that we calculate fit2 again and summarize the results.
From the output, we can read new regression function: “Price = 2456.076 - 6.464 x Age - 22.955 x Distance”. We can see that Age and Distance “negatively” impact the price. Higher the Age and Bigger the Distance to the city center, the cheaper is the average price of Apartments. If age is 0 and everything else stays the same, the price is 2456,076 eur. In other case, if Age is 1 and everything else stays the same, the price drops by 6,464. For every unit of Age rise, price drops by 6,464. The same happens with Distance, with a difference that Price with every additional unit of Distance drops by 22,955.
H0 Hypothesis we were testing (H0= coefficients are equal to 0) can be rejected with a 95% probability, as p-value is below 0,05.
R-squared tells us that 48,38% of Price variability is explained by Age and Distance variables together.
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
The function above provides us with coefficient of correlation. The number 0,6955 stands for strong and positive correlation between Age, Price and Distance.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
From p-value of F-statistics we reject the null hypothesis (H0: explanatory variables/ predictors have no effect) with 95% certainty, as p-value stands below 0,05. We conclude that explainatory variables together significantly improve the model compared to introducing no predictors.
At least one of the explanatory variables among Age, Distance, Parking and Balcony has a meaningful effect on price.
Price and Age together explain 52,7% of average price of the apartments.
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject the null hypothesis (H0: fit2 model is more suitable), as p-value is less than 0,05. We can confirm that fit3 model gives us more relevant result.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Regression coefficient for Parking shows that, if apartment includes Parking and everything else is unchanged, it is on average 167,53 eur more expensive, than without a balcony. Regression coefficient for Balcony shows that, if apartment possesses a Balcony and everything else stays unchanged, it is on average 15,2 eur cheaper. However, balcony variable has no significant effect.
F-statistic tests whether the regression model as a whole explains a significant portion of the variation in apartment prices. H0: β1 = β2 = β3 = β4 = 0 (none of the 4 predictors or explanatory variables have any effect on the price) H1: At least one βi ≠ 0 (At least one predictor or explanatory variable explains a significant portion of the variation in price) The Result shows that modul is highly significant, as p-value is below 0,05 (and even 0,001)
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
round(Apartments[2,12],3)
## # A tibble: 1 × 1
## Residuals
## <dbl>
## 1 428.
The residual for apartment ID2 tells us that the actual price for this apartment is by 427.8 eur higher than the estimated value from the regression.