Homework Project

Exercise 1

library(readxl)
Shark_Tank <- read_excel("Shark_Tank.xlsx")
View(Shark_Tank)

In this exercise, I will be analyzing the data sets from the reality show Shark Tank. The show depicts the entrepreneurial ecosystem in the USA, or more precisely, it shows the pitches of over a thousand entrepreneurs, who wish to receive investment from the “sharks” (investors). This data set encompasses different information, some of which I won’t be needing, so the first thing is to create a clean format.

I will create a new data set named Shark_Tanks_Clean that includes only the variables that I want:

Shark_Tank_Clean <- Shark_Tank [ , c(5, 6, 8, 10, 16, 17, 18, 19, 20, 21, 22, 23)]

Out of the total 48 variables, in the new data set, I included the following:

Start up Name
Industry
Pitcher’s Gender
Pitcher’s State (Where s/he lives)
Original Asked Amount (in $)
Original Offered Equity (in %)
Valuation Requested (in %)
Got Deal (0 - no, 1 - yes)
Total Deal Amount (in $)
Total Deal Equity (in %)
Deal Valuation (in $)
Number of Sharks in the Deal

Shark_Tank_Clean$`Got Deal` <- factor(Shark_Tank_Clean$`Got Deal` ,
                                      levels = c(0, 1),
                                      labels = c("No", "Yes"))

In the “Got deal” variable, the categorical observations include the numbers 0 and 1. Instead, I converted the zeros into “No” and the ones (1’s) into “Yes”.

summary(Shark_Tank_Clean)

##  Startup Name         Industry         Pitchers Gender    Pitchers State    
##  Length:1274        Length:1274        Length:1274        Length:1274       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Original Ask Amount Original Offered Equity Valuation Requested Got Deal 
##  Min.   :  10000     Min.   :  1.0           Min.   :    40000   No :509  
##  1st Qu.: 100000     1st Qu.: 10.0           1st Qu.:   666667   Yes:765  
##  Median : 200000     Median : 10.0           Median :  1500000            
##  Mean   : 284137     Mean   : 13.8           Mean   :  3550595            
##  3rd Qu.: 350000     3rd Qu.: 20.0           3rd Qu.:  4000000            
##  Max.   :5000000     Max.   :100.0           Max.   :100000000            
##                                                                           
##  Total Deal Amount Total Deal Equity Deal Valuation    
##  Min.   :  10000   Min.   :  0.00    Min.   :       0  
##  1st Qu.: 100000   1st Qu.: 15.00    1st Qu.:  400000  
##  Median : 200000   Median : 20.00    Median : 1000000  
##  Mean   : 296063   Mean   : 24.28    Mean   : 2178226  
##  3rd Qu.: 350000   3rd Qu.: 30.00    3rd Qu.: 2083333  
##  Max.   :5000000   Max.   :100.00    Max.   :36000000  
##  NA's   :509       NA's   :509       NA's   :509       
##  Number of sharks in deal
##  Min.   :1.000           
##  1st Qu.:1.000           
##  Median :1.000           
##  Mean   :1.311           
##  3rd Qu.:2.000           
##  Max.   :5.000           
##  NA's   :509

Shark_Tank_Stat <- Shark_Tank_Clean[ , c(-1, -2, -3, -4)]

summary(Shark_Tank_Stat)

##  Original Ask Amount Original Offered Equity Valuation Requested Got Deal 
##  Min.   :  10000     Min.   :  1.0           Min.   :    40000   No :509  
##  1st Qu.: 100000     1st Qu.: 10.0           1st Qu.:   666667   Yes:765  
##  Median : 200000     Median : 10.0           Median :  1500000            
##  Mean   : 284137     Mean   : 13.8           Mean   :  3550595            
##  3rd Qu.: 350000     3rd Qu.: 20.0           3rd Qu.:  4000000            
##  Max.   :5000000     Max.   :100.0           Max.   :100000000            
##                                                                           
##  Total Deal Amount Total Deal Equity Deal Valuation    
##  Min.   :  10000   Min.   :  0.00    Min.   :       0  
##  1st Qu.: 100000   1st Qu.: 15.00    1st Qu.:  400000  
##  Median : 200000   Median : 20.00    Median : 1000000  
##  Mean   : 296063   Mean   : 24.28    Mean   : 2178226  
##  3rd Qu.: 350000   3rd Qu.: 30.00    3rd Qu.: 2083333  
##  Max.   :5000000   Max.   :100.00    Max.   :36000000  
##  NA's   :509       NA's   :509       NA's   :509       
##  Number of sharks in deal
##  Min.   :1.000           
##  1st Qu.:1.000           
##  Median :1.000           
##  Mean   :1.311           
##  3rd Qu.:2.000           
##  Max.   :5.000           
##  NA's   :509

In the table above, we can see the summary of the numerical values only. I removed the categorical variables, while leaving out only the “deals gotten” categorical variable. Interpretations:

The minimum ask amount was 10,000$, while the maximum ask amount was 5,000,000 $.

Interestingly enough, the maximum offered equity that a start up extended to the investors was 100%. The average offered equity was 13.8%.

On average, the valuation requested by the start ups was 3,550,595 $. On the other hand, 50% of the start ups asked for an investment amounted below or equal to 1,500,000 $, while the other 50% asked for an investment amount more than 1,500,000 $.

Out of the total start ups, 765 of them received an investment (got a deal), while 509 of them didn’t (deal not gotten). The number of NA’s after the variable “Got deal” refer to the deals not gotten. I will remove these rows and will further work with “deals gotten only” data.

library(tidyr)
Shark_Tank_Deals <- drop_na(Shark_Tank_Stat)

Now I will calculate some statistical information regarding the deals gotten. I will also remove the “Got deal” variable because it is a categorical variable and these are not applicable when calculating statistical parameters.

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

options(width = 120)
options(scipen = 999)
round(stat.desc(Shark_Tank_Deals), 2)

##              Original Ask Amount Original Offered Equity Valuation Requested Got Deal Total Deal Amount
## nbr.val                   765.00                  765.00              765.00       NA            765.00
## nbr.null                    0.00                    0.00                0.00       NA              0.00
## nbr.na                      0.00                    0.00                0.00       NA              0.00
## min                     10000.00                    1.00            40000.00       NA          10000.00
## max                   5000000.00                   51.00         50000000.00       NA        5000000.00
## range                 4990000.00                   50.00         49960000.00       NA        4990000.00
## sum                 200776000.00                 9996.10       2590975817.00       NA      226488166.00
## median                 200000.00                   10.00          1600000.00       NA         200000.00
## mean                   262452.29                   13.07          3386896.49       NA         296062.96
## SE.mean                 11245.23                    0.29           182999.08       NA          12973.46
## CI.mean.0.95            22075.23                    0.56           359240.72       NA          25467.87
## var               96738305639.39                   62.31   25618828274796.72       NA   128757715315.35
## std.dev                311027.82                    7.89          5061504.55       NA         358828.25
## coef.var                    1.19                    0.60                1.49       NA              1.21
##              Total Deal Equity    Deal Valuation Number of sharks in deal
## nbr.val                 765.00            765.00                   765.00
## nbr.null                  8.00              9.00                     0.00
## nbr.na                    0.00              0.00                     0.00
## min                       0.00              0.00                     1.00
## max                     100.00       36000000.00                     5.00
## range                   100.00       36000000.00                     4.00
## sum                   18572.18     1666343265.00                  1003.00
## median                   20.00        1000000.00                     1.00
## mean                     24.28        2178226.49                     1.31
## SE.mean                   0.56         136320.07                     0.02
## CI.mean.0.95              1.10         267606.37                     0.04
## var                     238.96 14216118786686.66                     0.38
## std.dev                  15.46        3770426.87                     0.61
## coef.var                  0.64              1.73                     0.47

Out of all deals gotten, the maximum deal valuation, was 36,000,000$.

Out of all the variables, the highest variation can be seen in deal valuation with a coefficient variation of 1.73.

The total deal amounts vary 358,828.25 $ around the mean.

hist(Shark_Tank_Deals$`Total Deal Equity`,
     main = "Distribution of Investor Deal Equities",
     xlab = "Percentages (%)",
     ylab = "Frequency")

The most frequent equity percentage negotiated was 20%. We can also check this information using the mode function:

library(modeest)
mlv(Shark_Tank_Deals$`Total Deal Equity`)

## Warning: argument 'method' is missing. Data are supposed to be continuous. 
##             Default method 'shorth' is used

## Warning: encountered a tie, and the difference between minimal and 
##                    maximal value is > length('x') * 'tie.limit'
## the distribution could be multimodal

## [1] 20.49922

Exercise 2

library(readxl)
Business_School <- read_excel("R Take Home Exam 2024/Task 2/Business School.xlsx")
View(Business_School)

Task no. 1 Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?

library(pastecs)
round(stat.desc(Business_School[ , c(-1, -2, -5, -8)]), 2)

##              Undergrad Grade MBA Grade Employability (Before) Employability (After) Annual Salary
## nbr.val               100.00    100.00                 100.00                100.00        100.00
## nbr.null                0.00      0.00                   0.00                  0.00          0.00
## nbr.na                  0.00      0.00                   0.00                  0.00          0.00
## min                    61.20     58.14                 101.00                119.00      20000.00
## max                   100.00     95.00                 421.00                631.03     340000.00
## range                  38.80     36.86                 320.00                512.03     320000.00
## sum                  7689.90   7604.06               25793.08              42269.06   10905800.00
## median                 76.65     76.38                 256.83                435.64     103500.00
## mean                   76.90     76.04                 257.93                422.69     109058.00
## SE.mean                 0.75      0.77                   5.93                 12.92       4150.15
## CI.mean.0.95            1.48      1.52                  11.78                 25.64       8234.80
## var                    55.68     58.91                3522.10              16701.30 1722373474.75
## std.dev                 7.46      7.68                  59.35                129.23      41501.49
## coef.var                0.10      0.10                   0.23                  0.31          0.38

library(ggplot2)
ggplot(Business_School, aes(x=`Undergrad Degree`)) + 
         geom_bar(color= "white", fill= "deepskyblue4") +
  ylab("Frequency")

The most common undergraduate degree is Business.

Task no. 2 Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution.

round(stat.desc(Business_School$`Annual Salary`), 2)

##       nbr.val      nbr.null        nbr.na           min           max         range           sum        median 
##        100.00          0.00          0.00      20000.00     340000.00     320000.00   10905800.00     103500.00 
##          mean       SE.mean  CI.mean.0.95           var       std.dev      coef.var 
##     109058.00       4150.15       8234.80 1722373474.75      41501.49          0.38

library(ggplot2)
ggplot(Business_School, aes(x= `Annual Salary`)) + 
  geom_histogram(width= 5, colour= "white", fill= "darkolivegreen") +
  ylab("Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The annual salary distribution is right skewed. The reason for that are the three values above 200,000, especially the highest that is pulling the data to be skewed. But this extreme amount is not an outlier because maybe in some cases it is normal for the annual starting salary to be above 30,000 - it’s just not as frequent. From the data we can see that the student had a previous working experience and we can assume that s/he started from a higher position. Either way, the value is explainable and will not be removed. From the graph we can analyze the most frequent salary which is 100,000.

Task no. 3 Test the following hypothesis: 𝐻0:𝜇MBA Grade = 74. Explain the result and interpret the effect size.

𝐻0:𝜇MBA Grade = 74

𝐻A:𝜇MBA Grade ≠ 74

Degrees of freedom => 100 - 1 = 99

t.test(Business_School$`MBA Grade`,
       mu= 74,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

p-value = 0,00915

Alpha = 0,05

Alpha > p-value

The p-value is less than the significance level of 0,05. So, we can reject the null hypothesis (p < 0,009). With 95% confidence interval, we can say that the population mean of the MBA Grade this year is not equal to 74 (conf. interval is 74,51 - 77,56) .

This year, the average of the MBA Grade has increased (mean = 76,04).

Exercise 3

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking,
                     levels= c(0, 1),
                     labels = c("No", "Yes"))

Apartments$Balcony <- factor(Apartments$Balcony,
                             levels= c(0, 1),
                             labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price,
       mu= 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The p-value is less than the significance level of 0,05. So, we can reject the null hypothesis (p < 0,005). With 95% confidence interval, we can say that the population mean of the prices of apartments is not equal to 1900 EUR. (conf. interval is 1937,4 - 2100,4) .

The average of the apartment prices is higher (mean = 2018.9).

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = Apartments)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Price = 2185,4 - 8.97*Age

On average, if the Age of the apartment increases for 1 year, the price per m2 will decrease for 8.97 EUR.

Regression coefficient = -8.97. It depicts the relationship between the dependent variable (Price) and the explanatory variable (Age). In this case, it tells us that on average, if the Age of the apartment increases for 1 year, the price per m2 will decrease for 8.97 EUR.

Coefficient of determination - how well the explanatory variable explains the dependent variable. In this case, the coefficient of determination is quite low and closer to zero than to 1, so we can conclude that it explains only 5.3% of the variability in the prices per m2.

cor(Apartments$Price, Apartments$Age)

## [1] -0.230255

Coefficient of correlation (Pearson Correlation Coefficient) - This coefficient shows what kind of relationship exists between the two variables. In this case, the relationship between price per m2 and age in years is linear, weak and negative.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplotMatrix(Apartments[c("Price", "Age", "Distance")],
                  smooth = TRUE)

Price - Age

The upper middle graph (where price = y and age = x) indicates a negative relationship, meaning that as age increases, price increases. However, as analysed before, this relationship is weak. As for the multicollinearity, there is no multicolinearity since we can not observe values that are quite close to the line which will create a multicolinearity problem. Even when using the “smooth” option, we can detect some observations stepping out of this predetermined line.

Price - Distance

The same observation and conclusion about multicollinearity goes for the relationship that can be observed on the top right graph. Though, compared to the price-age, this is a stronger negative relationship.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Because the VIF of both variables is the same = 1.001845, and is both below 5 and approximate to 1. We can conclude that there is no significant multicollinearity between age and distance.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Apartments$StdRes <- round(rstandard(fit2), 2)

hist(Apartments$StdRes,
     xlab = "Standardised Residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals",
     col = "magenta4")

shapiro.test(Apartments$StdRes)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdRes
## W = 0.95307, p-value = 0.003667

The standardised residuals are left skewed. The Shapiro test shows that the test distributions are not normally distributed because p < 0.05 - we reject H0 that the data is normally distributed.

The values above +3 and below -3 are outliers and should be removed.

Apartments$CooksDist <- round(cooks.distance(fit2), 3)

hist(Apartments$CooksDist,
     xlab = "Cooks Distances",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances",
     col = "darkseagreen3")

In this graph we can observe a value that has a great impace on our data, and through the below outlined function, we can detect outliers or high values.

head(Apartments[order(-Apartments$CooksDist), c("Price", "CooksDist")], 6)

## # A tibble: 6 × 2
##   Price CooksDist
##   <dbl>     <dbl>
## 1  2180     0.32 
## 2  1740     0.104
## 3  2790     0.069
## 4  1760     0.066
## 5  2540     0.061
## 6  2400     0.038

Outliers High Values:

Price - 2180; Price - 1740; Price - 2790; Price - 1760.

I remove the high values:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

ApartmentsNew <- Apartments %>% filter(!Price %in% c(2180, 1740, 2790, 1760))

fit2.2 <- lm(Price ~ Age + Distance,
           data = ApartmentsNew)


ApartmentsNew$CooksDist <- round(cooks.distance(fit2.2), 3)
hist(ApartmentsNew$CooksDist,
     xlab = "Cooks Distances",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances with removed outliers",
     col = "green4")

ApartmentsNew$StdRes <- round(rstandard(fit2.2), 3)

hist(ApartmentsNew$StdRes,
     xlab = "Standardised Residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals with removed outliers",
     col = "darksalmon")

shapiro.test(ApartmentsNew$StdRes)

## 
##  Shapiro-Wilk normality test
## 
## data:  ApartmentsNew$StdRes
## W = 0.94297, p-value = 0.001286

Despite removing the outliers, the p-value of the Shapiro-Wilk normality test is less than 0.05, which indicates that the data is not normally distributed. However, our sample size is over 30, so a violation of this assumption would not significantly affect the results.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

ApartmentsNew$Fitted <- scale (fit2.2$fitted.values)

library(car)
scatterplot(y = ApartmentsNew$StdRes, 
            x = ApartmentsNew$Fitted,
            ylab = "Standardized residuals", 
            xlab = "Standardized fitted values", 
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE,
            col = "red4")

Looking at this graph, there is no potential risk of heteroskedasticity. We can also check the heteroskedasticity using the Breusch-Pagan test below:

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2.2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.64762 
##  Prob > Chi2   =    0.1992832

Since the p-value is greater than 1, we can not reject the null hypothesis stating that the variance is constant. This confirms our data homoskedasticity.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Age

On average, if the age of the apartment increases for one year, the price of the apartment decreases for 7.9 EUR per m2 (p < 0.05).

Distance

On average, if the apartment’s distance from the city center increases for 1 km, the price of the apartment decreases for 20.6 EUR per m2 (p < 0.01).

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = Apartments)

With function anova check if model fit3 fits data better than model fit2.

anova(fit3, fit2)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance + Parking + Balcony
## Model 2: Price ~ Age + Distance
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5991088                              
## 2     82 6720983 -2   -729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, fit3 fits better than fit2 data.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415 < 0.0000000000000002 ***
## Age           -6.799      3.110  -2.186              0.03172 *  
## Distance     -18.045      2.758  -6.543        0.00000000528 ***
## ParkingYes   196.168     62.868   3.120              0.00251 ** 
## BalconyYes     1.935     60.014   0.032              0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 0.00000000001849

Balcony

It’s not statistically significant because p > 0.05.

Parking

Given the values of the other explanatory variables (age, distance, and balcony), apartments that have a parking are on average more expensive for 196.17 EUR per m2, than those who don’t have a parking (p < 0.05).

Hypothesis

H0: There is no relationship between the dependent variable and the explanatory variables. H1: There is a relationship between the dependent variable and the explanatory variables.

Since we don’t reject the H0, we can conclude that the regression model is significant, in a sense that, the dependent variable (price), can be explained with the given explanatory variables, i.e. they have an influence over the price.

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)

head(Apartments[ , -6])

## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony CooksDist Fitted Residuals
##   <dbl>    <dbl> <dbl> <fct>   <fct>       <dbl>  <dbl>     <dbl>
## 1     7       28  1640 No      Yes         0.007  1751.    -111. 
## 2    18        1  2800 Yes     No          0.03   2357.     443. 
## 3     7       28  1660 No      No          0.006  1749.     -88.8
## 4    28       29  1850 No      Yes         0.008  1590.     260. 
## 5    18       18  1640 Yes     Yes         0.005  2053.    -413. 
## 6    28       12  1770 No      Yes         0.005  1897.    -127.

Residual for apartment ID2 is the residual for the apartment in the second row, which in this example is 442.59 ≈ 443.