HOMEWORK STATISTICS

NIK GREGORIČ

TASK 1

if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readxl)
Marketing_Campaign <- read_excel("Desktop/R Take Home Exam 2025/Task 1/Marketing Campaign.xlsx")

head(Marketing_Campaign,15)

## # A tibble: 15 × 16
##       ID   Age Job          Marital  Education Default Balance Housing Loan 
##    <dbl> <dbl> <chr>        <chr>    <chr>     <chr>     <dbl> <chr>   <chr>
##  1     1    58 management   married  tertiary  no         2143 yes     no   
##  2     2    44 technician   single   secondary no           29 yes     no   
##  3     3    33 entrepreneur married  secondary no            2 yes     yes  
##  4     4    47 blue-collar  married  unknown   no         1506 yes     no   
##  5     5    33 unknown      single   unknown   no            1 no      no   
##  6     6    35 management   married  tertiary  no          231 yes     no   
##  7     7    28 management   single   tertiary  no          447 yes     yes  
##  8     8    42 entrepreneur divorced tertiary  yes           2 yes     no   
##  9     9    58 retired      married  primary   no          121 yes     no   
## 10    10    43 technician   single   secondary no          593 yes     no   
## 11    11    41 admin.       divorced secondary no          270 yes     no   
## 12    12    29 admin.       single   secondary no          390 yes     no   
## 13    13    53 technician   married  secondary no            6 yes     no   
## 14    14    58 technician   married  unknown   no           71 yes     no   
## 15    15    57 services     married  secondary no          162 yes     no   
## # ℹ 7 more variables: Month_Contacted <chr>, Duration <dbl>, Campaign <dbl>,
## #   Pdays <dbl>, Previous <dbl>, Poutcome <chr>, Y <chr>

1. Explain the data set (the variables used in the analysis).

LOGIC BEHIND THE DATA: Dataset comes from a Portuguese bank’s direct marketing campaigns from 2012. The campaigns were run over the phone: bank employees called clients to offer them financial products. The main product promoted was a term deposit (also called a time deposit or fixed deposit). There were 45211 clients contacted and 15 different variables recorded.

Age: Age of the client Job: Type of Job Marital: Marital status Education: Level of education Default: Credit in default (too late in payment of credit) Balance: Average yearly Balance in client’s bank account (in EUR) Housing: Has housing loan Loan: Has personal loan Month_Contacted: Month when the client was last contacted (during current campaign) Duration: Duration of the last contact (call) with the client (in seconds) Campaign: Number of contacts performed during this campaign and for this client Pdays: Number of days that passed by after the client was last contacted from a previous campaign (-1 means not previously contacted) Previous: Number of contacts performed before this campaign and for this client Poutcome: Outcome of the previous marketing campaign Y: Has the client subscribed a term deposit (marketing campaign)

2. Perform some data manipulations (create new variable, delete some units due to missing data, rename variables, create new data.frame based on conditions, etc.).

Marketing_Campaign2 <- Marketing_Campaign |> rename(`Account Balance` = Balance) ##Change the name of variable to Account Balance

Marketing_Campaign2$BankName <- c(1) ## Add new f variable (as the last column) with value "1"
Marketing_Campaign2$BankNameF <- factor(Marketing_Campaign2$BankName, 
                                    levels = c(1), 
                                    labels = c("Portuguese Bank")) #Rename the new variable

Marketing_Campaign3 <- Marketing_Campaign[, -4] ##Delete fifth column
Marketing_Campaign4 <- na.omit(Marketing_Campaign) ## Delete rows with N/A values

summary(Marketing_Campaign)

##        ID             Age            Job              Marital         
##  Min.   :    1   Min.   :18.00   Length:45211       Length:45211      
##  1st Qu.:11304   1st Qu.:33.00   Class :character   Class :character  
##  Median :22606   Median :39.00   Mode  :character   Mode  :character  
##  Mean   :22606   Mean   :40.94                                        
##  3rd Qu.:33908   3rd Qu.:48.00                                        
##  Max.   :45211   Max.   :95.00                                        
##   Education           Default             Balance         Housing         
##  Length:45211       Length:45211       Min.   : -8019   Length:45211      
##  Class :character   Class :character   1st Qu.:    72   Class :character  
##  Mode  :character   Mode  :character   Median :   448   Mode  :character  
##                                        Mean   :  1362                     
##                                        3rd Qu.:  1428                     
##                                        Max.   :102127                     
##      Loan           Month_Contacted       Duration         Campaign     
##  Length:45211       Length:45211       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 103.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.2   Mean   : 2.764  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :63.000  
##      Pdays          Previous          Poutcome              Y            
##  Min.   : -1.0   Min.   :  0.0000   Length:45211       Length:45211      
##  1st Qu.: -1.0   1st Qu.:  0.0000   Class :character   Class :character  
##  Median : -1.0   Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   : 40.2   Mean   :  0.5803                                        
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                                        
##  Max.   :871.0   Max.   :275.0000

summary(Marketing_Campaign2)

##        ID             Age            Job              Marital         
##  Min.   :    1   Min.   :18.00   Length:45211       Length:45211      
##  1st Qu.:11304   1st Qu.:33.00   Class :character   Class :character  
##  Median :22606   Median :39.00   Mode  :character   Mode  :character  
##  Mean   :22606   Mean   :40.94                                        
##  3rd Qu.:33908   3rd Qu.:48.00                                        
##  Max.   :45211   Max.   :95.00                                        
##   Education           Default          Account Balance    Housing         
##  Length:45211       Length:45211       Min.   : -8019   Length:45211      
##  Class :character   Class :character   1st Qu.:    72   Class :character  
##  Mode  :character   Mode  :character   Median :   448   Mode  :character  
##                                        Mean   :  1362                     
##                                        3rd Qu.:  1428                     
##                                        Max.   :102127                     
##      Loan           Month_Contacted       Duration         Campaign     
##  Length:45211       Length:45211       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 103.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.2   Mean   : 2.764  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :63.000  
##      Pdays          Previous          Poutcome              Y            
##  Min.   : -1.0   Min.   :  0.0000   Length:45211       Length:45211      
##  1st Qu.: -1.0   1st Qu.:  0.0000   Class :character   Class :character  
##  Median : -1.0   Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   : 40.2   Mean   :  0.5803                                        
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                                        
##  Max.   :871.0   Max.   :275.0000                                        
##     BankName           BankNameF    
##  Min.   :1   Portuguese Bank:45211  
##  1st Qu.:1                          
##  Median :1                          
##  Mean   :1                          
##  3rd Qu.:1                          
##  Max.   :1

summary(Marketing_Campaign3)

##        ID             Age            Job             Education        
##  Min.   :    1   Min.   :18.00   Length:45211       Length:45211      
##  1st Qu.:11304   1st Qu.:33.00   Class :character   Class :character  
##  Median :22606   Median :39.00   Mode  :character   Mode  :character  
##  Mean   :22606   Mean   :40.94                                        
##  3rd Qu.:33908   3rd Qu.:48.00                                        
##  Max.   :45211   Max.   :95.00                                        
##    Default             Balance         Housing              Loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##  Month_Contacted       Duration         Campaign          Pdays      
##  Length:45211       Min.   :   0.0   Min.   : 1.000   Min.   : -1.0  
##  Class :character   1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.: -1.0  
##  Mode  :character   Median : 180.0   Median : 2.000   Median : -1.0  
##                     Mean   : 258.2   Mean   : 2.764   Mean   : 40.2  
##                     3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.: -1.0  
##                     Max.   :4918.0   Max.   :63.000   Max.   :871.0  
##     Previous          Poutcome              Y            
##  Min.   :  0.0000   Length:45211       Length:45211      
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.5803                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :275.0000

summary(Marketing_Campaign4)

##        ID             Age            Job              Marital         
##  Min.   :    1   Min.   :18.00   Length:45211       Length:45211      
##  1st Qu.:11304   1st Qu.:33.00   Class :character   Class :character  
##  Median :22606   Median :39.00   Mode  :character   Mode  :character  
##  Mean   :22606   Mean   :40.94                                        
##  3rd Qu.:33908   3rd Qu.:48.00                                        
##  Max.   :45211   Max.   :95.00                                        
##   Education           Default             Balance         Housing         
##  Length:45211       Length:45211       Min.   : -8019   Length:45211      
##  Class :character   Class :character   1st Qu.:    72   Class :character  
##  Mode  :character   Mode  :character   Median :   448   Mode  :character  
##                                        Mean   :  1362                     
##                                        3rd Qu.:  1428                     
##                                        Max.   :102127                     
##      Loan           Month_Contacted       Duration         Campaign     
##  Length:45211       Length:45211       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 103.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.2   Mean   : 2.764  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :63.000  
##      Pdays          Previous          Poutcome              Y            
##  Min.   : -1.0   Min.   :  0.0000   Length:45211       Length:45211      
##  1st Qu.: -1.0   1st Qu.:  0.0000   Class :character   Class :character  
##  Median : -1.0   Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   : 40.2   Mean   :  0.5803                                        
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                                        
##  Max.   :871.0   Max.   :275.0000

3. Present the DESCRIPTIVE statistics for the selected variables and explain at least 3 sample statistics (mean, median, etc.).

library(psych)
describe(Marketing_Campaign[,c(-1,-3,-4,-5,-9,-10,-12,-13,-14,-15)])

##          vars     n    mean      sd median trimmed    mad   min    max  range
## Age         1 45211   40.94   10.62     39   40.25  10.38    18     95     77
## Default*    2 45211    1.02    0.13      1    1.00   0.00     1      2      1
## Balance     3 45211 1362.27 3044.77    448  767.21 664.20 -8019 102127 110146
## Housing*    4 45211    1.56    0.50      2    1.57   0.00     1      2      1
## Duration    5 45211  258.16  257.53    180  210.87 137.88     0   4918   4918
## Y*          6 45211    1.12    0.32      1    1.02   0.00     1      2      1
##           skew kurtosis    se
## Age       0.68     0.32  0.05
## Default*  7.24    50.49  0.00
## Balance   8.36   140.73 14.32
## Housing* -0.22    -1.95  0.00
## Duration  3.14    18.15  1.21
## Y*        2.38     3.68  0.00

I ALSO WANTED TO GET DATA FROM THIS INPUT BUT R CANNOT FIND STAT.DESC: “round(stat.desc(Marketing_Campaign[,c(-1,-3,-4,-5,-9,-10,-12,-13,-14,-15)]),1)”

From the descriptive statistic we remove the variables we do not tend to analyze. - to analyze the AGE of clients variable, we can see that the youngest client is 18 years old and the oldest 95 years old. Furthermore we can conclude that the bank has at least one client between the ages 18 and 95, as there are 77 distinct values. Average Age is 40,94, median 40,5. 50% of clients are older than 39 years. - Balance variable shows us that average balance on client’s account is 1362 eur, where 50% of clients have less than 448 eur on the account. Clients have in total 61,5 MIO eur in this bank. I would like to point out also std.dev (how spread out the values are around the mean), since it is quiet large. On average clients’ balances differ by 3044 eur from the mean. - Duration of a last call with client (in seconds): average call duration is 258sec (above 4 minutes), where 50% of calls lasted less or equal to 180 sec (3 minutes). Std.deviation is also big here, 257,5, almost the same as the mean, meaning that average call duration difffers about 257,5 seconds from the mean.

4. Graph the distribution of the variables using histograms, scatterplots, and/or boxplots. Explain the results.

hist(Marketing_Campaign$Age,
     breaks = seq(15,100,2), #Each bar accounts for span of 2 years of age 
     xlab = "Age",
     main = "Histogram of Age",
     right = FALSE)

From the next Graph below we can see that units are visibly asymetric to the right. We had to include also negative balances on the x axis. The graph isn’t perfect, as I didn’t find solution to draw all bars from 0 up (probably of too high differences in numbers). Comments also in the code.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(Marketing_Campaign, aes(x = `Balance`)) +
  geom_histogram(binwidth = 500, fill = "green", color = "grey") + #Each Bar represents a range of 500eur in Account balances 
  scale_y_log10() +  # For better understanding, we need to log this scale, otherwise, the majority of data would not be visible
  labs(title = "Histogram of Account Balance", 
       x = "Account Balance (EUR)",
       y = "Frequency (log scale)") +
  theme_minimal()

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

The next graph shows correlation between Default (not repaying the debt) and account balance. We have removed extreme balances above 30000eur and below -5000eur. We can see that clients who have defaulted also have much lower balance on average (most below 0eur). we can assume that correlation between account balance and default risk is high.

Marketing_Campaign5 <- Marketing_Campaign %>%
  filter(Balance >= -5000 & Balance <= 30000) # Create new dataset without extreme balances


ggplot(Marketing_Campaign5, aes(x = Default, y = Balance)) + 
  geom_boxplot() + 
  xlab("Default") #Graphical distribution of Balances by Default in a boxplot

At the next graph we test Clients in deafult by Job (which job accounts for the most Defaults). We had to create a new data set in the beginning to ease the further steps. There are also explainatory comment in the code. The results show that the Job title with the highest amount of Defaults is “Blue-Collar”, then the “Management” and at the third place “Technician”. The least “defaulted” are students.

library(ggplot2)
library(dplyr)

# Count number of defaults per job
defaults_by_job <- Marketing_Campaign %>%
  filter(Default == "yes") %>%
  count(Job, sort = TRUE)

# Bar plot
ggplot(defaults_by_job, aes(x = reorder(defaults_by_job$Job, n), y= defaults_by_job$n ))  +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # WE Flip x and y for better readability
  
  labs(title = "Number of Clients in Default by Job",
       x = "Job",
       y = "Number of Defaults") +
  theme_minimal()

## Warning: Use of `defaults_by_job$Job` is discouraged.
## ℹ Use `Job` instead.

## Warning: Use of `defaults_by_job$n` is discouraged.
## ℹ Use `n` instead.

TASK 2

1. Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?

library(readxl)
Business_School <- read_excel("Desktop/R Take Home Exam 2025/Task 2/Business School.xlsx")
header = TRUE
sep = ";"
dec = ","
head(Business_School)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

data <- read_excel("Desktop/R Take Home Exam 2025/Task 2/Business School.xlsx")

summary(Business_School)

##    Student ID     Undergrad Degree   Undergrad Grade    MBA Grade    
##  Min.   :  1.00   Length:100         Min.   : 61.20   Min.   :58.14  
##  1st Qu.: 25.75   Class :character   1st Qu.: 71.47   1st Qu.:71.14  
##  Median : 50.50   Mode  :character   Median : 76.65   Median :76.38  
##  Mean   : 50.50                      Mean   : 76.90   Mean   :76.04  
##  3rd Qu.: 75.25                      3rd Qu.: 81.70   3rd Qu.:82.15  
##  Max.   :100.00                      Max.   :100.00   Max.   :95.00  
##  Work Experience    Employability (Before) Employability (After)
##  Length:100         Min.   :101.0          Min.   :119.0        
##  Class :character   1st Qu.:245.8          1st Qu.:312.0        
##  Mode  :character   Median :256.8          Median :435.6        
##                     Mean   :257.9          Mean   :422.7        
##                     3rd Qu.:261.0          3rd Qu.:529.0        
##                     Max.   :421.0          Max.   :631.0        
##     Status          Annual Salary   
##  Length:100         Min.   : 20000  
##  Class :character   1st Qu.: 87125  
##  Mode  :character   Median :103500  
##                     Mean   :109058  
##                     3rd Qu.:124000  
##                     Max.   :340000

names(Business_School)

## [1] "Student ID"             "Undergrad Degree"       "Undergrad Grade"       
## [4] "MBA Grade"              "Work Experience"        "Employability (Before)"
## [7] "Employability (After)"  "Status"                 "Annual Salary"

ggplot(data, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "darkgreen") +
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

data %>%
  count(`Undergrad Degree`, sort = TRUE)

## # A tibble: 5 × 2
##   `Undergrad Degree`     n
##   <chr>              <int>
## 1 Business              35
## 2 Computer Science      25
## 3 Finance               25
## 4 Engineering            9
## 5 Art                    6

The most common degree is in Business, with 35 students

2. Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution.

library(psych) #SHORTER ORDER FUNCTION
describe(data$`Annual Salary`)

##    vars   n   mean       sd median  trimmed     mad   min    max  range skew
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
##    kurtosis      se
## X1     9.41 4150.15

ggplot(data, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 8000, fill = "orange", color = "black") +
  scale_x_continuous(labels = scales::comma) +
  labs(title = "Distribution of Annual Salary",
       x = "Annual Salary",
       y = "Count of Students") +
  theme_minimal()

From the graph shown, I assume that distribution of units is asymetrically distributed to the right.

3. Test the following hypothesis: 𝐻0:𝜇MBA Grade=74. Explain the result and interpret the effect size.

t.test(Business_School$`MBA Grade`, mu=74)

## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

t.test showed that p-value is below 0,05 (5%), which indicates that we can reject the H0 hypothesis as the arithmetic mean of MBA grade is higher than 74. With 95% confidence we can say, that the average MBA Grade is between 74,51 and 77,56.

TASK 3

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("Desktop/R Take Home Exam 2025/Task 3/Apartments.xlsx")
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$ParkingF <- factor(Apartments$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

Apartments$BalconyF <- factor(Apartments$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
head(Apartments)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingF BalconyF
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 No       Yes     
## 2    18        1  2800       1       0 Yes      No      
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 No       Yes     
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 No       Yes

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

round(stat.desc(Apartments[c("Age", "Distance", "Price", "Parking", "Balcony")]), 2)

##                  Age Distance     Price Parking Balcony
## nbr.val        85.00    85.00     85.00   85.00   85.00
## nbr.null        0.00     0.00      0.00   42.00   48.00
## nbr.na          0.00     0.00      0.00    0.00    0.00
## min             1.00     1.00   1400.00    0.00    0.00
## max            45.00    45.00   2820.00    1.00    1.00
## range          44.00    44.00   1420.00    1.00    1.00
## sum          1577.00  1209.00 171610.00   43.00   37.00
## median         18.00    12.00   1950.00    1.00    0.00
## mean           18.55    14.22   2018.94    0.51    0.44
## SE.mean         1.05     1.23     40.98    0.05    0.05
## CI.mean.0.95    2.09     2.45     81.50    0.11    0.11
## var            93.96   129.44 142764.34    0.25    0.25
## std.dev         9.69    11.38    377.84    0.50    0.50
## coef.var        0.52     0.80      0.19    0.99    1.15

I have displayed some descriptive statistics of the variables used in the Task

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price, mu=1900)

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

With the use of t-test, the model checked all of the 84 rows (data units) We can reject the Hypothesis, as the average Price of apartments is not equal to 1900, but rather 2018,941 eur. With a 95% certainty we can say that the average price of apartments lies between 1937,44 eur and 2100,44eur.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <-lm(Apartments$Price ~ Apartments$Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Apartments$Price ~ Apartments$Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2185.455     87.043  25.108   <2e-16 ***
## Apartments$Age   -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

corr_coef <- cor(Apartments$Age, Apartments$Price)
print(corr_coef)

## [1] -0.230255

Regression function: Price= 2185,455 - 8,975 x Age The function shows us that if Age is zero, the average price of apartments is 2185,455 eur. When we increase by 1, the price falls by 8,975 eur.

Multiple R-squared statistic that Age variable explains 5,302% of variability in Price of apartments.

The p-value test showed that we can with 95% certainty say that coefficients are not equal to 0. Therefore, we can reject the null hypothesis (H0: p<0,05). Age does impact the Price of apartments.

The correlation between Price and age is negative, which means that age negatively influences the average prices. Higher the age, lower the price. However, the correlation (influence) is not strong -> (-0,23).

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

vars <- Apartments[, c("Price", "Age", "Distance")]
scatterplotMatrix(vars, ,
                  smooth = FALSE)

The graph/ picture shows us multicollinearity between explanatory variables. We look at the graphs in the first row. The upper, middle picture (graph) shows that higher age negatively impacts the price (older the apartments, lower the price). The upper right graph shows the impact of Distance (from the city center) on the Price, where bigger distance means lower prices for apartments.

The slope of upper right graph is falling faster than the slope of upper middle graph, which means that Distance has stronger impact on Price, than Age.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

P-value of F-statistic shows us that we can reject the H0 hypothesis, as the number is lower than 0,05. At least one of the predictors/ explanatory variables (probably distance from the data above) has a meaningful effect on price.

Price and Age together explain 43,96% of average price of the apartments.

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF function is used to evaluate the strength of the correlation between the explanatory variables. The higher the VIF statistic, the more strongly the variable is related to other explanatory variables.

Because VIF statistics is close to 1 for both explanatory variables Age and Distance, there are no multicolinearity concerns. The problem occurs, if VIF statistics equals >5. Only then we should not include them in the model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Standardized Residuals

Apartments$std_resid <- round(rstandard(fit2),3)
hist(Apartments$std_resid,
     xlim = c(-3,3),
     ylim = c(0,20),
     main = "Histogram of standardized residuals",
     xlab = "standardized residuals",
     col = "lightgreen")

Based on the distribution of standardized residuals, we try to predict the distribution of errors in the population.

-From the picture we assume that residuals are slightly asymmetrically distributed to the right. We could test this also with shapiro test.

Besides, we have no problems with outliers, as all the values are in between -3 and +3. This is shown also using the “head” function below.

cooks_d <- cooks.distance(fit2) 
Apartments$CooksD <- round(cooks.distance(fit2),3) ### First two rows are here to define "CooksD" variable

head(Apartments[order(-Apartments$std_resid), c("Distance", "Price","Age", "CooksD","std_resid")],)

## # A tibble: 6 × 5
##   Distance Price   Age CooksD std_resid
##      <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1       45  2180     5  0.32       2.58
## 2       11  2790     2  0.069      2.05
## 3        1  2800    18  0.03       1.78
## 4        1  2800    18  0.03       1.78
## 5        2  2820     8  0.037      1.66
## 6        1  2810    10  0.032      1.60

Cooks Distances

cooks_d <- cooks.distance(fit2)
Apartments$CooksD <- round(cooks.distance(fit2),3)
hist(Apartments$CooksD,
     main = "Histogram of Cooks Distances",
     xlab = "Cooks Distances",
     col = "yellow")

Using the Cook’s Distance, we can spot units with a high impact on the estimated regression function and remove them (similar to outliers). Cook’s distance is a value above 0, where a higher number means a larger impact.

From the graph and “head” code, we can see that there is one apartment that stands out with a value of 0,32, therefore too high influence. This is apartment no. 38 and it has to be removed. We can remove it with function “filter”

head(Apartments[order(-Apartments$CooksD), c("Distance", "Price","Age", "CooksD","std_resid")],)

## # A tibble: 6 × 5
##   Distance Price   Age CooksD std_resid
##      <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1       45  2180     5  0.32       2.58
## 2       37  1740    43  0.104      1.44
## 3       11  2790     2  0.069      2.05
## 4        2  1760     7  0.066     -2.15
## 5        3  2540    37  0.061      1.58
## 6        2  2400    40  0.038      1.09

With this function we can remove problematic units:

library(dplyr) Apartments <- Apartments %>% filter(!CooksD %in% (0.320))

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_fitted <- as.numeric(scale(fitted(fit2)))
Apartments$stdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(x=Apartments$stdFitted, y=Apartments$std_resid,
            xlab = "Standardized Fitted Values",
            ylab = "Standardised Residuals",
            main = "Heteroskedasticity Check",
            boxplots = FALSE,
            regLine = FALSE,
            smooth= FALSE,
            )

The points should be randomly distributed in a horizontal band of constant variability. Heteroskedasticity occurs, if the variability changes, and it affects the reliability of the estimated standard errors. We also test it with the use of Breuch-Pagan test, below.

From the graph (picture), we can see random distribution in a horizontal band. We assume that Homoskedasticity is not violated (The variance of errors is constant). We can confirm our assumptions with Breuch-pagan test. We cannot reject H0 hypothesis (the variance is constant), because p-value is above 0,05.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments$std_resid,
     xlim = c(-3,3),
     main = "Histogram of standardized residuals",
     xlab = "standardized residuals",
     ylab = "Density",
     prob = TRUE, ## MEANS THAT YOU ADJUST Y AXIS TO THE RED CURVE
     col = "lightgreen")

curve(dnorm(x, mean = mean(Apartments$std_resid), sd = sd (Apartments$std_resid)), 
      col = "red", 
      lwd = 2, 
      add = TRUE)

We drew the same graph that was drawn in the previous question. To this graph I added red curve, which simulates normal distribution. We assume that standardized residuals are not normally distributed.

We can formally test the distribution of Standardized Residuals with Shapiro test.

shapiro.test(Apartments$std_resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$std_resid
## W = 0.95303, p-value = 0.003645

Shapiro test shows p-value of 0,0021, which is below our value of 0,05. Therefore, we can reject the H0: standardized residuals are normally distributed.

With the 95% certainty we assume that Standardized residuals are not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Apartments <- Apartments[!(Apartments$CooksD == 0.320),] # ! removes rows with 0,320 values 
fit2 <- lm(Price ~ Age + Distance, data = Apartments) # lm states as linear regression model
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

As done in few steps above, we can again exclude the unit with Cooks Distance 0,320. After that we calculate fit2 again and summarize the results.

From the output, we can read new regression function: “Price = 2456.076 - 6.464 x Age - 22.955 x Distance”. We can see that Age and Distance “negatively” impact the price. Higher the Age and Bigger the Distance to the city center, the cheaper is the average price of Apartments. If age is 0 and everything else stays the same, the price is 2456,076 eur. In other case, if Age is 1 and everything else stays the same, the price drops by 6,464. For every unit of Age rise, price drops by 6,464. The same happens with Distance, with a difference that Price with every additional unit of Distance drops by 22,955.

H0 Hypothesis we were testing (H0= coefficients are equal to 0) can be rejected with a 95% probability, as p-value is below 0,05.

R-squared tells us that 48,38% of Price variability is explained by Age and Distance variables together.

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

The function above provides us with coefficient of correlation. The number 0,6955 stands for strong and positive correlation between Age, Price and Distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, 
           data = Apartments)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

From p-value of F-statistics we reject the null hypothesis (H0: explanatory variables/ predictors have no effect) with 95% certainty, as p-value stands below 0,05. We conclude that explainatory variables together significantly improve the model compared to introducing no predictors.

At least one of the explanatory variables among Age, Distance, Parking and Balcony has a meaningful effect on price.

Price and Age together explain 52,7% of average price of the apartments.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis (H0: fit2 model is more suitable), as p-value is less than 0,05. We can confirm that fit3 model gives us more relevant result.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Regression coefficient for Parking shows that, if apartment includes Parking and everything else is unchanged, it is on average 167,53 eur more expensive, than without a balcony. Regression coefficient for Balcony shows that, if apartment possesses a Balcony and everything else stays unchanged, it is on average 15,2 eur cheaper. However, balcony variable has no significant effect.

F-statistic tests whether the regression model as a whole explains a significant portion of the variation in apartment prices. H0: β1 = β2 = β3 = β4 = 0 (none of the 4 predictors or explanatory variables have any effect on the price) H1: At least one βi ≠ 0 (At least one predictor or explanatory variable explains a significant portion of the variation in price) The Result shows that modul is highly significant, as p-value is below 0,05 (and even 0,001)

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)

round(Apartments[2,12],3)

## # A tibble: 1 × 1
##   Residuals
##       <dbl>
## 1      428.

The residual for apartment ID2 tells us that the actual price for this apartment is by 427.8 eur higher than the estimated value from the regression.

HOMEWORK STATISTICS- Bootcamp

Nik Gregorič

2025-09-22