Ian Gošnak, 19228866

Task 1

# data(package = .packages(all.available = TRUE)) = Code used to show all data sets available in my R studio, even the ones that aren't loaded at the moment <- This is how I found my data set in package "boot"

mydata <- boot::melanoma

head(mydata)

##   time status sex age year thickness ulcer
## 1   10      3   1  76 1972      6.76     1
## 2   30      3   1  56 1968      0.65     0
## 3   35      2   1  41 1977      1.34     0
## 4   99      3   0  71 1968      2.90     0
## 5  185      1   1  52 1965     12.08     1
## 6  204      1   1  28 1971      4.84     1

Description of variables:

time: Survival time in days since the operation
status: The patients status at the end of the study. 1 indicates that they had died from melanoma, 2 indicates that they were still alive and 3 indicates that they had died from causes unrelated to their melanoma.
sex: The patients sex; 1=male, 0=female.
age: Age in years at the time of the operation.
year: Year of operation.
thickness: Tumour thickness in mm.
ulcer: Indicator of ulceration; 1=present, 0=absent.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- dplyr::mutate(mydata, ID = row_number())

head((mydata), 3)

##   time status sex age year thickness ulcer ID
## 1   10      3   1  76 1972      6.76     1  1
## 2   30      3   1  56 1968      0.65     0  2
## 3   35      2   1  41 1977      1.34     0  3

In this step I added the variable “ID” to make tracking observations easier.

mydata <- mydata[c(8, 3, 4, 5, 6, 7, 1, 2)]

colnames(mydata) <- c("ID", "Sex", "Age", "YearOfOp", "Thickness", "Ulceration", "Time", "Status")

mydata$SexF <- factor(mydata$Sex,
                      levels = c(0, 1),
                      labels = c("Female", "Male"))

mydata$UlcerationF <- factor(mydata$Ulceration,
                            levels = c(0, 1),
                            labels = c("Absent", "Present"))

mydata$StatusF <- factor(mydata$Status,
                            levels = c(1, 2, 3),
                            labels = c("Died from Melanoma", "Alive", "Death from Othr"))

tail(mydata)

##      ID Sex Age YearOfOp Thickness Ulceration Time Status   SexF UlcerationF StatusF
## 200 200   0  19     1965      1.13          1 4479      2 Female     Present   Alive
## 201 201   1  29     1965      7.06          1 4492      2   Male     Present   Alive
## 202 202   0  40     1965      6.12          0 4668      2 Female      Absent   Alive
## 203 203   0  42     1965      0.48          0 4688      2 Female      Absent   Alive
## 204 204   0  50     1964      2.26          0 4926      2 Female      Absent   Alive
## 205 205   0  41     1962      2.90          0 5565      2 Female      Absent   Alive

Here I re-organised the column order and renamed to variables to make the data set more organised. I also added the Factor variables for the 3 categorical variables.

mydata <- mydata[c(-2, -4, -6, -8)]  # Removing the non Factor variables and variable YearOfOp since it is not necessarily useful (unless we would like to research the difference in effects of new technology)

mydata <- mydata[c(1, 5, 2, 3, 6, 4, 7)] # Reordering the variables

mydata1 <- mydata[!(mydata$StatusF == "Death from Othr"), ] 

summary(mydata1[7])

##                StatusF   
##  Died from Melanoma: 57  
##  Alive             :134  
##  Death from Othr   :  0

In this step I removed the variables which are not needed anymore, reordered the columns and created a new data set without the observations whose cause of death was other than melanoma (they are not relevant in the research). Summary of variable StatusF in the new data set shows us that there are no such observations present.

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

round(stat.desc(mydata[c(-1, -2, -5, -7)]), 2)

##                   Age Thickness       Time
## nbr.val        205.00    205.00     205.00
## nbr.null         0.00      0.00       0.00
## nbr.na           0.00      0.00       0.00
## min              4.00      0.10      10.00
## max             95.00     17.42    5565.00
## range           91.00     17.32    5555.00
## sum          10755.00    598.57  441324.00
## median          54.00      1.94    2005.00
## mean            52.46      2.92    2152.80
## SE.mean          1.16      0.21      78.37
## CI.mean.0.95     2.30      0.41     154.52
## var            277.95      8.76 1259020.14
## std.dev         16.67      2.96    1122.06
## coef.var         0.32      1.01       0.52

We can see that there were 205 observations in this data set. Of these 205 observations the minimum age of patients was 4yo and the max was 95yo, which makes the range of Age in the data set 91 years. The average age of the patients was 52.46 years and the median was 54 years. This means that 50% of the patients were older than 54 and 50% were younger. The average tumor thickness of the patients was 2.92mm. The 95% CI for the mean of tumor thickness is 0.41, so we can say with 95% certainty that the real population mean lies somewhere between 2.92mm +- 0.41mm.

summary(mydata[c(2, 5, 7)])

##      SexF      UlcerationF                StatusF   
##  Female:126   Absent :115   Died from Melanoma: 57  
##  Male  : 79   Present: 90   Alive             :134  
##                             Death from Othr   : 14

Here is the summary data for only the Factor variables which shows us the number of each category in the variables.

ggplot(mydata, aes(x = Thickness))+
  theme_bw()+
  geom_histogram(binwidth = 1, colour = "black", fill = "dark green")+
  ggtitle("Distribution of tumor thickness among patients")+
  ylab("Frequency")+
  xlab("Thickness of tumor in mm")

Here we have the distribution of tumor thickness values in mm for the data set. On the Histogram we can see that the thickness is positively or right skewed. The size of the bins is 1mm so we can see that the most frequent size of patients’ tumor was between 0.5mm and 1.5mm. This Histogram also leads us to believe that any tumor above 5mm is quite rare.

scatterplot(y = mydata$Thickness, x = mydata$Age,
            ylab = "Tumor thickness in mm",
            xlab = "Age",
            boxplots = FALSE,
            smooth = FALSE)

In this scatter plot I plotted the values of Age on the X axis and tumor thickness on the Y axis searching for a relationship. I predicted that there will be a positive correlation between age and thickness as the tumor gets the chance grow with age. My prediction was correct as we can see with the regression line on the scatter plot, the slope is positive.

ggplot(mydata1, aes(x = StatusF, y = Thickness, fill = SexF))+
  theme_bw()+
  geom_boxplot()+
  scale_fill_brewer(palette = "Dark2")+
  ylab("Tumor thickness in mm")+
  xlab("Status")+
  labs(fill = "Sex")

In this box plot I decided to compare the tumor thickness by “Status” and “Sex” to see if there are any differences in thickness of those patients who died and those that survived as well as between genders. For this box plot I used the data set without observations whose cause of death was not melanoma. We can see that the ones who died from melanoma had on average thicker tumors that the ones who did not. We can also depict from the graph that Male patients on average had thicker tumors than Female patients.

Task 2

mydata2 <- read.table("~/Desktop/IMB/R data/R Take Home Exam/Task 2/Body mass.csv", 
                      sep = ";",
                      dec = ",",
                      header = TRUE)

library(pastecs)
round(stat.desc(mydata2[-1]), 2)

##                 Mass
## nbr.val        50.00
## nbr.null        0.00
## nbr.na          0.00
## min            49.70
## max            83.20
## range          33.50
## sum          3143.80
## median         62.80
## mean           62.88
## SE.mean         0.85
## CI.mean.0.95    1.71
## var            36.14
## std.dev         6.01
## coef.var        0.10

ggplot(mydata2, aes(x = Mass))+
  theme_bw()+
  geom_histogram(binwidth = 5, colour = "black", fill = "light blue")+
  ggtitle("Distribution of body mass of 9th graders 2021/22")+
  ylab("Frequency")+
  xlab("Body mass in kg")

t.test(mydata2$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

With this t-test we researched two hypothesis. H0 = average weight of 9th graders in 2018/19 is equal to the avg weight of 9th graders in 2021/22, Ha = average weight of 9th graders in 2018/19 is NOT equal to the avg weight of 9th graders in 2021/22. Because we are checking equality the mean of 2021/22 can either be higher or lower than the mean of 2018/19 which makes this test a two tailed test. From observing the histogram we can already suspect that the two means are not equal as the most frequent values lay between 62.5 and 67.5kg. Conducting the t-test we can deduct that at a 5% level of significance (a = 5%) we can reject the null hypothesis. The p value for the t-test is 0.0002 or 0.02% which also equals to the percentage of us making a mistake and the null hypothesis being correct. Therefore we can safely say that the mean of 2018/19 in not equal to the mean of 2021/22, it is higher in 2021/22. The 95% CI also shows us that we can say with 95% certainty that the population mean of 2021/2022 lies between 61.17 and 64.58 which does not include 59.5 (2018/19 mean).

Task 3

Import the dataset Apartments.xlsx

#install.packages("readxl")
library(readxl)
mydata3 <- read_xlsx("~/Desktop/IMB/R data/R Take Home Exam/Task 3/Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors

mydata3$ParkingFactor <- factor(mydata3$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

mydata3$BalconyFactor <- factor(mydata3$Balcony,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

str(mydata3)

## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Age          : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance     : num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price        : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking      : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony      : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingFactor: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ BalconyFactor: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata3$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0 : mu = 1900 Ha : mu is not = 1900

I can conclude that the mean price of apartments is not equal to 1900 at a 5% significance level (a = 5%), therefore we can reject H0. The 95% CI for the apartment price is between 1937.443 and 2100.440 which means that we can say with 95% certainty that the true population means lies between those 2 values. This interval does not include 1900 which again lets us reject the null hypothesis at a 5% significance level.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
          data = mydata3) 

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

sqrt(summary(fit1)$r.squared)  # Coefficient of correlation (in absolute terms)

## [1] 0.230255

The regression function is constructed as Price = 2185.455 - 8.975 x Age

The intercept coefficient means that on average is Age is 0 and everything else stays the same the price will be 2185.455. The regression coefficient for Age tells us the if Age increases by 1, the price will on average decrease by 8.975 if all else stays equal. Both of the coefficients have a p value below 0.05 which leads us to reject the null hypothesis at each. The null hypothesis is that the coefficient in equal to 0 therefore we can say at a 5% level of significance, or even less with the intercept, that the coefficients are not equal to 0

The multiple r squared statistic is the coefficient of determination an it tells us that 5.302% of the variability in the price is explained by the x variables (in this case Age).

The square root of R-squared is the coefficient of correlation and it tells us the absolute correlation between the observed variables and the predicted variables, in this case it is 0.23.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(mydata3[ , 1:3],
                  smooth = FALSE)

There would be a potential problem with multicolinearity when the explanatory variables would be correlated with each other. In the case of the scatter plot we can deduct if this is the case by looking at the regression line on the scatter plot between the explanatory variables, in this case Age and Distance. If we look at the plot we can see that the slope of the line is close to 0 therefore meaning that there is no significant correlation between the two variables which tells us that there is not problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
          data = mydata3) 

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

The VIF statistic checks for multicolinearity between the explanatory variables. The rule for the VIF statistic is that it should be less than 5 for all explanatory variables and the mean of the VIF statistic should be as close to one. If a variable has a VIF of higher than 5 this variable is causing the problem of multicolinearity and should be looked into/removed. In our case the explanatury variables are both less than 5 and the mean is close to 1 which tells us there is no problem with multycolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

mydata3$StdResid <- round(rstandard(fit2), 3) 
head(mydata3[order(-mydata3$StdResid), ])

## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>
## 1     5       45  2180       1       1 Yes           Yes               2.58
## 2     2       11  2790       1       0 Yes           No                2.05
## 3    18        1  2800       1       0 Yes           No                1.78
## 4    18        1  2800       1       1 Yes           Yes               1.78
## 5     8        2  2820       1       0 Yes           No                1.66
## 6    10        1  2810       0       0 No            No                1.60

tail(mydata3[order(-mydata3$StdResid), ])

## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>
## 1    24        5  1830       1       0 Yes           No               -1.19
## 2    14       16  1660       0       1 No            Yes              -1.26
## 3    13        8  1800       0       0 No            No               -1.38
## 4    12       14  1650       0       1 No            Yes              -1.50
## 5    12       14  1650       0       0 No            No               -1.50
## 6     7        2  1760       0       1 No            Yes              -2.15

mydata3$CooksD <- round(cooks.distance(fit2), 3)
head(mydata3[order(-mydata3$CooksD), ])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes           Yes               2.58  0.32 
## 2    43       37  1740       0       0 No            No                1.44  0.104
## 3     2       11  2790       1       0 Yes           No                2.05  0.069
## 4     7        2  1760       0       1 No            Yes              -2.15  0.066
## 5    37        3  2540       1       1 Yes           Yes               1.58  0.061
## 6    40        2  2400       0       1 No            Yes               1.09  0.038

hist(mydata3$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

hist(mydata3$StdResid, 
     xlab = "Standardized Residuals", 
     ylab = "Frequency", 
     main = "Histogram of Standardized Residuals")

By looking at the standardized residuals we can see that none are above 3 or below -3, which would automatically categorize them as outlines and they would have been removed. However when looking at the cooks distance we can see that one observation in particular has a significantly higher cooks distance than all others. This means that it has a higher effect on the regression than other observations. When combining this fact with the facts that the standardized residual of this observation is very close to 3 (2.577) I decided to remove it from my data set. This will be done later when estimating fit 2 again in order to avoid problems when completing the next 2 exercises. In the next 2 exercises I will use the regression with the “outlier” included.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydata3$StdFittedValues <- scale(fit2$fitted.values)

scatterplot(y = mydata3$StdResid, x = mydata3$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Heteroskedasticity occurs when the variance is not constant. on this scatter plot we can see that the variance of standardized residuals is constant so therefore we can say there is no heteroskedasticity and there is homoskedasticity, which is not a problem.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata3$StdResid, 
     prob = TRUE,
     xlab = "Standardized residuals", 
     ylab = "Density", 
     main = "Histogram of standardized residuals")
curve(dnorm(x, mean = mean(mydata3$StdResid), sd = sd(mydata3$StdResid)), 
      add = TRUE,
      lwd = 2,
      col = "red")

shapiro.test(mydata3$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.95303, p-value = 0.003645

From the histogram we can already suspect that the standardized residuals are not distributed normally because the shape of the distribution does not match the one of the normal distribution. When we formally test it with the Shapiro-Wilk normality test where the H0 : distribution is normal, and Ha : distribution is not normal we can reject the null hypothesis at a 5% significance level. This means that the distribution is not normal.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

mydata3 <- mydata3[!(mydata3$CooksD == 0.320), ]

head(mydata3[order(-mydata3$CooksD), ])

## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD StdFittedValues[,1]
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>               <dbl>
## 1    43       37  1740       0       0 No            No                1.44  0.104              -2.65 
## 2     2       11  2790       1       0 Yes           No                2.05  0.069               0.790
## 3     7        2  1760       0       1 No            Yes              -2.15  0.066               1.37 
## 4    37        3  2540       1       1 Yes           Yes               1.58  0.061               0.342
## 5    40        2  2400       0       1 No            Yes               1.09  0.038               0.329
## 6     8        2  2820       1       0 Yes           No                1.66  0.037               1.34

fit2 <- lm(Price ~ Age + Distance,
          data = mydata3) 

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

Here I finally removed the observation that I decided needed to be removed, more specifically the on with a cooks distance of 0.320 and standardized residual of 2.577. After that i checked if it was properly removed, and it was. I later estimated fit2 again to “update” the regression without the removed observation.

The regression functions is constructed as Price = 2456.076 - 6.464 x Age - 22.955 x Distance

The intercept coefficient means that on average is Age and Distance are 0 and everything else stays the same the price will be 2456.076. The regression coefficient for Age tells us the if Age increases by 1, the price will on average decrease by 6.464 if all else stays equal. The regression coefficient for Distance tells us the if Distance increases by 1, the price will on average decrease by 22.955 if all else stays equal. All 3 of the coefficients have a p value below 0.05 which leads us to reject the null hypothesis at each. The null hypothesis is that the coefficient in equal to 0 therefore we can say at a 5% level of significance, or even less with the intercept and reg coeff of Distance, that the coefficients are not equal to 0

The multiple r squared statistic is the coefficient of determination an it tells us that 48.38% of the variability in the price is explained by the x variables (in this case Age and Distance).

The square root of R-squared is the coefficient of correlation and it tells us the absolute correlation between the observed variables and the predicted variables, in this case it is 0.6955609.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata3)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the analysis of variance we can see that at a 5% level of significance we can conclude that the fit3 regression fits data better than fit2. This is because if the p value is lower than 0.05 anova says the more complex model is better (fit3) and if the p value is more than 0.05 anova says the less complex is better (fit2)

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2329.724     93.066  25.033  < 2e-16 ***
## Age                -5.821      3.074  -1.894  0.06190 .  
## Distance          -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFactorYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFactorYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

The regression coefficient for Parking tells us that is the apartment has a parking spot the price will on average be higher by 167.531 if all else stays the same. This coefficient also has a p value of below 0.01 so we can say at a 1% level of significance that the coefficient is not equal to 0. The coefficient for balcony has a p value of 0.79795 therefore we can not say that the coefficient is not equal to 0.

The null hypothesis for the test with the F-statistic is that the population coefficient of determination equals to 0 and the alternative hypothesis is that the population coefficient of determination is greater than 0.

Save fitted values and claculate the residual for apartment ID2.

mydata3$FittedValues <- fitted.values(fit3)
mydata3$Residuals <- residuals(fit3)

round(mydata3[2, 12], 2)

## # A tibble: 1 × 1
##   Residuals
##       <dbl>
## 1      428.

The residual for apartment ID2 is equal to 427.8 which means that the difference between the actual price and price calculated by the regression is 427.8, or that the actual price is higher by 427.8 that the estimated value by the regression.

R_Exam

2022-09-21