Task 1

Source: I have selected a data set that is integrated in R. The data set “Salaries” is from the “carData” package. There is no information about the source of this data.

#data()
#data(package = .packages(all.available = TRUE))

#install.packages("carData")
library(carData)

mydata <- force(Salaries)
head(mydata)

##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Explanation of data set:

The data is about the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.. The data set contains 397 units of observation and 6 variables.

Variables:

rank:a factor with levels AssocProf, AsstProf, Prof
discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments).
yrs.since.phd: years since PhD.
yrs.service: years of service.
sex:a factor with levels Female Male
salary: nine-month salary, in dollars.

str(mydata[ , -1]) #Compact and informative description

## 'data.frame':    397 obs. of  5 variables:
##  $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
##  $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

mydata2 <- mydata[ , -5] #Deleting the 5 column
head(mydata2)

##        rank discipline yrs.since.phd yrs.service salary
## 1      Prof          B            19          18 139750
## 2      Prof          B            20          16 173200
## 3  AsstProf          B             4           3  79750
## 4      Prof          B            45          39 115000
## 5      Prof          B            40          41 141500
## 6 AssocProf          B             6           6  97000

In this step I removed the variable “sex”, because I think that this variable is not needed and does not affect salaries.

colnames(mydata2) <- c("Level", "Department", "Years.since.PhD", "Service", "Earnings") #Changes of names of the variables
head(mydata2)

##       Level Department Years.since.PhD Service Earnings
## 1      Prof          B              19      18   139750
## 2      Prof          B              20      16   173200
## 3  AsstProf          B               4       3    79750
## 4      Prof          B              45      39   115000
## 5      Prof          B              40      41   141500
## 6 AssocProf          B               6       6    97000

mydata3 <- subset(mydata2, Service > 0) #Filtering an existing data frame

I filtered my data so that only Professors with at least 1 year of service are in data set.

mydata3 <- mydata3[c(1, 5, 2, 3, 4)] # Reordering the variables
tail(mydata3, 3)

##        Level Earnings Department Years.since.PhD Service
## 395     Prof   101738          A              42      25
## 396     Prof    95329          A              25      15
## 397 AsstProf    81035          A               8       4

In this step I reorder the columns and moved “Earnings” in the second column, next to variable “Level”. I also extracted the last 3 rows of a data frame.

mydata4 <- mydata3
mydata4$Earnings.in.thousands <- mydata3$Earnings *0.001
head(mydata4)

##       Level Earnings Department Years.since.PhD Service Earnings.in.thousands
## 1      Prof   139750          B              19      18                139.75
## 2      Prof   173200          B              20      16                173.20
## 3  AsstProf    79750          B               4       3                 79.75
## 4      Prof   115000          B              45      39                115.00
## 5      Prof   141500          B              40      41                141.50
## 6 AssocProf    97000          B               6       6                 97.00

In this data manipulation I divided all earnings with 0.001 and get new variable that shows earnings in thousands dollars. I did that only because of exercise and I will not include this variable in my future analysis.

I will now present some descriptive statistics

round(mean(mydata3$Earnings))

## [1] 114560

The arithmetic mean for “Earnings” is 114560, that means that average nine-month academic salary for professors that were included in research is 114560 dollars.

library(psych)
describe(mydata3)

##                 vars   n      mean       sd   median   trimmed      mad   min
## Level*             1 386      2.54     0.74      3.0      2.67     0.00     1
## Earnings           2 386 114560.02 30247.47 108150.0 112380.21 29133.09 57800
## Department*        3 386      1.54     0.50      2.0      1.55     0.00     1
## Years.since.PhD    4 386     22.82    12.69     21.5     22.38    14.08     1
## Service            5 386     18.12    12.84     17.0     17.01    14.83     1
##                    max  range  skew kurtosis      se
## Level*               3      2 -1.23    -0.06    0.04
## Earnings        231545 173745  0.68     0.17 1539.56
## Department*          2      1 -0.17    -1.98    0.03
## Years.since.PhD     56     55  0.29    -0.80    0.65
## Service             60     59  0.65    -0.33    0.65

In RStudio there are many different function that we can use to do some descriptive statistics. Above I used two function. First one is single function and gives only result for particular parameter that you want to know. The second function “describe” on the other hand gives us broader overview for many different parameters.

Below I interpreted three results:

median “Services”: The median for variable “Services” is 17, that means that half of the professors have 17 years of service or less and the other half have 17 years of service or more.
range “Earnings”: The range for variable “Earnings” is 173745, that means that difference between the highest and lowest nine-month salary is 173745 dollars.
skew “Years.since.PhD”: The skewness for variable “Years.since.PhD” is 0.29, that means that the distribution is slightly positively skewed or we can also say that is right skewed.That means that distribution slope is falling to the right.

summary(mydata3[c(1, 3)])

##        Level     Department
##  AsstProf : 57   A:177     
##  AssocProf: 64   B:209     
##  Prof     :265

Because I have also two factor variables, I use this function to present number of the each category in the variable “Level” and “Department”(a factor with A means “theoretical” departments and B is referred to “applied” departments). A:177 means that there are 177 professors who worked in theoretical department.

hist(mydata3$Earnings,
     main = "Salaries",
     xlab = "Salary",
     ylab = "Frequency",
     col = "antiquewhite",
     breaks = seq(from = 0, to = 250000, by = 10000))

In the histogram above it is visible that salary is positively or right skewed. We can see that the most frequent salary is between 100000 dollars and 110000 dollars. In the histogram that there are only few professors that have salary more than 200000 dollars.

#install.packages("car")
library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot (y=mydata3$Earnings,
             x = mydata3$Service,
             ylab = "Salary",
             xlab = "Years of service",
             boxplots = FALSE,
             smooth = FALSE)

I draw scatter plot with function “scatterplot” from library “car”. On the X axis I plotted years of services and on the Y axis I plotted salaries. We can see positive correlation between years of service and salaries and we can describe this as the professors with more years of service have on average higher salary.

scatterplot(Earnings ~ Service | Department, 
            ylab = "Salary", 
            xlab = "Years of service", 
            smooth = FALSE, 
            data = mydata3)

In this scatter plot I included also factor variable “Department”.A factor with mark A means “theoretical” departments and with mark B is referred to “applied” departments. The slope of the line for “applied” departments is slightly higher than the slope of the line for “theoretical” departments. Because the slope is higher for “applied” departments, we can say that professors that works in “applied” departments, on average have higher salaries.

Task 2

In Task 2 I will analyse the effects of online schooling on body weight in ninth graders.

mydata_1 <- read.table("C:/LUKA/IMB/BootCamp/R Take Home Exam/R Take Home Exam/Task 2/Body mass.csv",
                       header = TRUE,
                       sep = ";",
                       dec = ",")
head(mydata_1)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

#install.packages(pastecs)
library(pastecs)
round(stat.desc(mydata_1), 2) #Descriptive statistics, rounded to two decimal places

##                   ID    Mass
## nbr.val        50.00   50.00
## nbr.null        0.00    0.00
## nbr.na          0.00    0.00
## min             1.00   49.70
## max            50.00   83.20
## range          49.00   33.50
## sum          1275.00 3143.80
## median         25.50   62.80
## mean           25.50   62.88
## SE.mean         2.06    0.85
## CI.mean.0.95    4.14    1.71
## var           212.50   36.14
## std.dev        14.58    6.01
## coef.var        0.57    0.10

#install.packages(pastecs)
library(pastecs)
round(stat.desc(mydata_1[-1]), 2) #Excluding ID variable

##                 Mass
## nbr.val        50.00
## nbr.null        0.00
## nbr.na          0.00
## min            49.70
## max            83.20
## range          33.50
## sum          3143.80
## median         62.80
## mean           62.88
## SE.mean         0.85
## CI.mean.0.95    1.71
## var            36.14
## std.dev         6.01
## coef.var        0.10

hist(mydata_1$Mass,
     main = "Distribution of 50 nine-graders weight",
     ylab = "Frequency",
     xlab = "Weight in kg",
     breaks = seq( 0 , 100 , 5 ),
     col = "darkgreen")

Hypothesis testing

Given data: * The average weight in 2018/2019 was 59.5kg.

Our hypotheses: * H0: Mu = 59.5 kg * H1: Mu is not equal to 59.5

t.test(mydata_1$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata_1$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

For Hypothesis testing I used “t.test” function.

From the t-test above we can reject H0 at ( p < 0.001), accept H1 and conclude that on average, the weight of nine-graders, measured in year 2020/2021 is not equal to the average weight of nine-graders, measured in year 2018/2019. The average weight of nine-graders in 2020/2021 increased to 62.88 kg. The 95 percent confidence interval lies between 61.16 and 64.59. We can conclude with 95% certainty that the average weight in 2020/2021 lies between 61.16 and 64.59.

#install.packages("effectsize")
library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(mydata_1$Mass, mu = 59.5)

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

interpret_cohens_d(0.56, rules = "sawilowsky2009")

## [1] "medium"
## (Rules: sawilowsky2009)

Above we can see that Cohen’s values is 0.56. By using interpretation rules by Sawilowsky, we can say that the effect size is medium.

Task 3

Import the dataset Apartments.xlsx

#install.packages("readxl")
library(readxl)
mydata_2 <- read_xlsx("C:/LUKA/IMB/BootCamp/R Take Home Exam/R Take Home Exam/Task 3/Apartments.xlsx")
mydata_2 <- as.data.frame(mydata_2)
head(mydata_2)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata_2$ParkingFactor <- factor(mydata_2$Parking,
                                 levels = c( 0, 1),
                                 labels = c("No", "Yes"))
mydata_2$BalconyFactor <- factor(mydata_2$Balcony,
                                 levels = c ( 0, 1),
                                 labels = c ("No", "Yes"))
head(mydata_2)

##   Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 1   7       28  1640       0       1            No           Yes
## 2  18        1  2800       1       0           Yes            No
## 3   7       28  1660       0       0            No            No
## 4  28       29  1850       0       1            No           Yes
## 5  18       18  1640       1       1           Yes           Yes
## 6  28       12  1770       0       1            No           Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

H0: Mu price = 1900EUR H1: Mu price is not equal to 1900EUR

t.test(mydata_2$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata_2$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject H0 at (p < 0.05), accept H1 and conclude that on average, the price per m2 of apartments in the Ljubljana region is not equal to 1900EUR. We can also say with 95% certainty that average price per m2 lies between 1937.44 and 2100.44.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = mydata_2)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(mydata_2$Price, mydata_2$Age)

## [1] -0.230255

Regression function: Price = 2185.455 - 8.975Age

Regression coefficient: -8.975 This coefficient tells us that if the age of an apartment increases by 1 year, the price for m2 decreases by 8.975EUR.

Coefficient of correlation: -0.230255 -0.23 means that it is negative correlation between price per m2 and age.

Coefficient of determination: 0.05302 The Multiple R-squared tells us that 5.3% of the variability in the price is explained by the age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

#install.packages("car")
library(car)
scatterplotMatrix(mydata_2[ , c(-4,-5,-6,-7)],
                  smooth = FALSE)

When we want to check if there is potential problem with multicolinearity, we need to check relationship between the explanatory variables. In our case we need to take a look of correlation between Age and Distance. From the scatter plot above, we can see that the slope of the line that presents correlation between Age and Distance is slightly positive correlated. That means that there is no significant correlation between Age and Distance and we can conduct that in our case there is no potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata_2)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

With VIF statistics we can easily check the multicolinearity. The rule is that value of VIF statistics for explanatory variables needs to be less than 5 and the mean of VIF statistics needs to be as close to 1 as possible. In our case both values are significantly less than 5 and also the mean of VIF values is very close to 1. We can again conduct that there is no problem with multicolinearity in our case.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata_2$StdResid <- round(rstandard(fit2), 3)
mydata_2$CooksD <- round(cooks.distance(fit2), 3)
head(mydata_2)

##   Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## 1   7       28  1640       0       1            No           Yes   -0.665
## 2  18        1  2800       1       0           Yes            No    1.783
## 3   7       28  1660       0       0            No            No   -0.594
## 4  28       29  1850       0       1            No           Yes    0.754
## 5  18       18  1640       1       1           Yes           Yes   -1.073
## 6  28       12  1770       0       1            No           Yes   -0.778
##   CooksD
## 1  0.007
## 2  0.030
## 3  0.006
## 4  0.008
## 5  0.005
## 6  0.005

hist(mydata_2$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     breaks = seq( from = -3, to = 3, by = 1))

hist(mydata_2$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

From the Histogram of Standard residuals, we can see that there is no residuals that have value more than 3 or less than -3, which means that on the first look we do not have any outliers. However when we take a look in Histogram of Cooks distances we can see there is one value that has particular higher Cooks distance than others and we can interpret that as a fact that it has a higher effect than other observations. I will check this observation below.

head(mydata_2[order(-mydata_2$CooksD), ], 10) # Ten units with highest value of Cooks distance

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## 38   5       45  2180       1       1           Yes           Yes    2.577
## 55  43       37  1740       0       0            No            No    1.445
## 33   2       11  2790       1       0           Yes            No    2.051
## 53   7        2  1760       0       1            No           Yes   -2.152
## 22  37        3  2540       1       1           Yes           Yes    1.576
## 39  40        2  2400       0       1            No           Yes    1.091
## 58   8        2  2820       1       0           Yes            No    1.655
## 25   8       26  2300       1       1           Yes           Yes    1.571
## 57  10        1  2810       0       0            No            No    1.601
## 2   18        1  2800       1       0           Yes            No    1.783
##    CooksD
## 38  0.320
## 55  0.104
## 33  0.069
## 53  0.066
## 22  0.061
## 39  0.038
## 58  0.037
## 25  0.034
## 57  0.032
## 2   0.030

In the findings above, we can see that there is in 1 observation, which out stands not only with Cooks distance but also with standard residual close to 3 (2.577). With this finding we can confirm that there is one observation with high influence. (I removed the unit, but then I have had problem with next exercise (there was written an error and I was not able to solve it), so I decided not to remove unit, so I could do the next exercise).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydata_2$StdFitted <- scale(fit2$fitted.values)

scatterplot(y = mydata_2$StdResid, x = mydata_2$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

In this scatter plot it is visible that the variance of standardized residuals is constant and there is no problem with potential heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata_2$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     breaks = seq(from = -3, to = 3, by = 1))

From the histogram we can say that standardized residuals are probably not distributed normally, but we should formally test it with the Shapiro-Wilk normality test.

H0: Standard residuals are normally distributed H1: Standard residuals are not normally distributed

shapiro.test(mydata_2$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_2$StdResid
## W = 0.95303, p-value = 0.003645

We can reject H0 with (p < 0.05) and conclude that standard residuals are not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

mydata_2 <- mydata_2[!(mydata_2$CooksD == 0.320), ]
head(mydata_2[order(-mydata_2$CooksD), ])

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## 55  43       37  1740       0       0            No            No    1.445
## 33   2       11  2790       1       0           Yes            No    2.051
## 53   7        2  1760       0       1            No           Yes   -2.152
## 22  37        3  2540       1       1           Yes           Yes    1.576
## 39  40        2  2400       0       1            No           Yes    1.091
## 58   8        2  2820       1       0           Yes            No    1.655
##    CooksD  StdFitted
## 55  0.104 -2.6533964
## 33  0.069  0.7902277
## 53  0.066  1.3743712
## 22  0.061  0.3416766
## 39  0.038  0.3291583
## 58  0.037  1.3426980

fit2 <- lm(Price ~ Age + Distance,
           data = mydata_2)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

At first I used function that remove unit with Cooks Diference = 0.320. After I removed it, I used function summary to present values.

Explain all coefficients:

The regression function: Price= 2456.076 - 6.464Age - 22.955Distance
If age increases by 1 year, the price of an apartment will on average decrease by 6.464EUR per m2 if everything else stays equal.
If distance from the city center increases by 1km, the price of an apartment on average decrease by 22.955EUR per m2 if everything else stays equal.
The Multiple R-squared tells us that 48.38% of the variability in the price is explained by Age and Distance

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

Linear regresion function:

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata_2)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From Analysis of Variance Table, it is visible that p < 0.05, and we can conclude that model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2329.724     93.066  25.033  < 2e-16 ***
## Age                -5.821      3.074  -1.894  0.06190 .  
## Distance          -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFactorYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFactorYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Explanation of categorical factors:

The apartments with parking will on average have a higher price for 167.531 EUR assuming that all other variables are the same.
The apartments with balcony will on average have a lower price for 15.207 EUR assuming that all other variables are the same.

Save fitted values and claculate the residual for apartment ID2.

mydata_2$Fitted <- fitted.values(fit3)
mydata_2$Residuals <- residuals(fit3)

head(mydata_2[colnames(mydata_2) %in% c("Price", "Fitted", "Residuals")])

##   Price   Fitted  Residuals
## 1  1640 1705.952  -65.95206
## 2  2800 2372.197  427.80292
## 3  1660 1721.159  -61.15894
## 4  1850 1563.431  286.56890
## 5  1640 2012.244 -372.24396
## 6  1770 1908.177 -138.17733

The residual represents “actual number” - “fitted number”. The difference between “actual number” and “fitted number” or the residual is 427.80 EUR.

R take-home exam

Luka Henigman

2023-09-23