Task 1

Post-Coma Recovery of IQ

I did an analysis of two different data sets. I completed the TitanicSurvival analysis first but upon rereading the Task 1 instructions, I realised I needed more numeric variables, I redid it. But I liked the Titanic analysis so much, I decided to leave it in. :)

Importing data

data(package = .packages(all.available = TRUE))

I chose Wong, data set about Post-Coma Recovery of IQ.

library(carData)
PostComaIQ <- force(Wong) 
head(PostComaIQ)
##     id days duration  sex      age piq viq
## 1 3358   30        4 Male 20.67077  87  89
## 2 3535   16       17 Male 55.28816  95  77
## 3 3547   40        1 Male 55.91513  95 116
## 4 3592   13       10 Male 61.66461  59  73
## 5 3728   19        6 Male 30.12731  67  73
## 6 3790   13        3 Male 57.06229  76  69

A data frame with 331 observations on the following 11 variables:

  • id: patient ID number.
  • days: number of days post coma at which IQs were measured.
  • duration: duration of the coma in days.
  • sex: a factor with levels Female and Male.
  • age: in years at the time of injury.
  • piq: performance (i.e., mathematical) IQ.
  • viq: verbal IQ.

The summary of the data:

summary(PostComaIQ)
##        id            days            duration         sex           age        
##  Min.   : 405   Min.   :   13.0   Min.   :  0.0   Female: 71   Min.   : 6.513  
##  1st Qu.:3808   1st Qu.:   59.0   1st Qu.:  1.0   Male  :260   1st Qu.:21.737  
##  Median :5085   Median :  150.0   Median :  7.0                Median :26.877  
##  Mean   :4735   Mean   :  481.9   Mean   : 14.3                Mean   :31.853  
##  3rd Qu.:5712   3rd Qu.:  416.0   3rd Qu.: 16.0                3rd Qu.:40.923  
##  Max.   :7548   Max.   :11628.0   Max.   :255.0                Max.   :80.033  
##       piq              viq        
##  Min.   : 50.00   Min.   : 64.00  
##  1st Qu.: 77.00   1st Qu.: 85.00  
##  Median : 87.00   Median : 94.00  
##  Mean   : 87.56   Mean   : 94.96  
##  3rd Qu.: 97.00   3rd Qu.:105.00  
##  Max.   :133.00   Max.   :132.00

Data Manipulation

#Renaming a column
colnames(PostComaIQ)[4] <- "Gender"
#Creating new variables
PostComaIQ$Gender <- factor(PostComaIQ$Gender, 
                             levels = c("Male", "Female"), 
                             labels = c(0, 1))
#Deleting units with missing data
library(tidyr)
PostComaIQ <- PostComaIQ %>% drop_na()
#Creating a new data.frame based on conditions
PostComaIQ2 <- data.frame("Illness" = c("Parkinsons", "Melanoma", "Diabetes", "Cancer"), 
                     "Age" = c(30, 26, 16, 34),
                     "Gender" = c("F", "M", "M", "M"),
                     "Married" = c("YES", "NO", "YES", "NO"))

print(PostComaIQ2)
##      Illness Age Gender Married
## 1 Parkinsons  30      F     YES
## 2   Melanoma  26      M      NO
## 3   Diabetes  16      M     YES
## 4     Cancer  34      M      NO
#I want to see the average time patients were in a coma
mean(PostComaIQ$duration)
## [1] 14.29607
# I want to see the average time females spent in a coma
mean(PostComaIQ[PostComaIQ$Gender == 1, ]$duration)
## [1] 9.267606

Descriptive statistics

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
round(stat.desc(PostComaIQ), 2)
##                      id       days duration Gender      age      piq      viq
## nbr.val          331.00     331.00   331.00     NA   331.00   331.00   331.00
## nbr.null           0.00       0.00    46.00     NA     0.00     0.00     0.00
## nbr.na             0.00       0.00     0.00     NA     0.00     0.00     0.00
## min              405.00      13.00     0.00     NA     6.51    50.00    64.00
## max             7548.00   11628.00   255.00     NA    80.03   133.00   132.00
## range           7143.00   11615.00   255.00     NA    73.52    83.00    68.00
## sum          1567184.00  159493.00  4732.00     NA 10543.41 28981.00 31433.00
## median          5085.00     150.00     7.00     NA    26.88    87.00    94.00
## mean            4734.69     481.85    14.30     NA    31.85    87.56    94.96
## SE.mean           83.24      62.50     1.43     NA     0.76     0.83     0.77
## CI.mean.0.95     163.75     122.95     2.82     NA     1.50     1.64     1.52
## var          2293522.34 1292958.09   678.08     NA   192.32   228.96   197.49
## std.dev         1514.44    1137.08    26.04     NA    13.87    15.13    14.05
## coef.var           0.32       2.36     1.82     NA     0.44     0.17     0.15

Explanations:

  • The patients were in a coma for an average of 14.3 days.
  • The average time they waited before they took the IQ measurement was 481.85 days.
  • 50% of the patients were older than 26.88 years at the time of injury.
  • The youngest patient was 6.51 years old and the oldest was 80.03.
  • 50% of the patients scored 87 in performance (i.e., mathematical IQ) and 95 in verbal IQ.
  • The values of age are dispersed around the mean by 44%
  • The values of performance and verbal IQ are dispered around the mean by 17% and 15% respectively.

Graphing the distribution based on gender

library(ggplot2)
ggplot(PostComaIQ, aes(x=duration)) + 
  geom_histogram(binwidth = 10, colour="gray", fill = "dodgerblue") + 
  theme_dark() + 
  facet_wrap(~Gender, ncol = 1) + 
  ylab("Frequency")

So… this histogram is not pretty at all, because we have quite a few outliers. As we can see, most male patients were in a coma up to 30 days, most of them around 10 days or less, similar is true for female patients. But the data includes patients that were in a coma for 255, 180, quite a few for 130 days, so that elongates the histogram and makes it so severly skewed to the right.

To eliminate the outliers, I would have to use a histogram of Standardized residuals and Cooks distance function, find the outliers and eliminate them to get a histogram that is nice, with normally distributed data.

library(car) 
scatterplot(PostComaIQ$viq ~ PostComaIQ$piq,  
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Tested verbal IQ",
            xlab = "Tested mathematical IQ") 

I wanted to see if the tested mathematical and verbal IQs have any correlation and as we can see, they have a strong correlation. Now I just want to check if the duration of the coma had any effect on the tested mathematical and verbal IQs:

library(car) 
scatterplot(PostComaIQ$duration ~ PostComaIQ$piq,  
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Duration of the coma",
            xlab = "Tested mathematical IQ") 

library(car) 
scatterplot(PostComaIQ$duration ~ PostComaIQ$viq,  
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Duration of the coma",
            xlab = "Tested verbal IQ") 

Again, we see the outliers at the top, but nevertheless, there is some correlation between the duration of the coma and the tested verbal and mathematical IQs.

library(ggplot2)
ggplot(PostComaIQ, aes(x=Gender, y=piq, fill=duration)) + 
  geom_boxplot() +
  scale_fill_brewer(palette="Spectral") + 
  xlab("Gender") + 
  ylab("Tested Mathematical IQ")
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

  labs(fill="Tested Mathematical IQ")
## $fill
## [1] "Tested Mathematical IQ"
## 
## attr(,"class")
## [1] "labels"

From the boxplot we can see that the Tested mathematical IQ did not vary in mean between the male (0 in data) patients and female (1 in data) patients. We can recognise that is true from psychological theory. We can also see that female patients scored a larger range on the Mathematical IQ test than male patients.

Titanic Survival data

I chose the data set TitanicSurvival, about the Survival of Passengers on the Titanic

library(carData)
mydata <- force(TitanicSurvival) 
head(mydata)
##                                 survived    sex     age passengerClass
## Allen, Miss. Elisabeth Walton        yes female 29.0000            1st
## Allison, Master. Hudson Trevor       yes   male  0.9167            1st
## Allison, Miss. Helen Loraine          no female  2.0000            1st
## Allison, Mr. Hudson Joshua Crei       no   male 30.0000            1st
## Allison, Mrs. Hudson J C (Bessi       no female 25.0000            1st
## Anderson, Mr. Harry                  yes   male 48.0000            1st

A data frame with 1309 observations on the following 4 variables. Explanation of the variables:

  • survived:no or yes.
  • sex: female or male
  • age:in years (and for some children, fractions of a year); age is missing for 263 of the passengers.
  • passengerClass: 1st, 2nd, or 3rd class.

The summary of the data:

summary(mydata)
##  survived      sex           age          passengerClass
##  no :809   female:466   Min.   : 0.1667   1st:323       
##  yes:500   male  :843   1st Qu.:21.0000   2nd:277       
##                         Median :28.0000   3rd:709       
##                         Mean   :29.8811                 
##                         3rd Qu.:39.0000                 
##                         Max.   :80.0000                 
##                         NA's   :263

Data Manipulation

#Renaming a column
colnames(mydata)[2] <- "Gender" 
#Creating new variables
mydata$survived <- factor(mydata$survived, 
                             levels = c("yes", "no"), 
                             labels = c("1", "0"))
mydata$Gender <- factor(mydata$Gender, 
                             levels = c("male", "female"), 
                             labels = c("1", "0"))
#Deleting units with missing data
library(tidyr)
mydata <- mydata %>% drop_na()
#Creating a new data.frame based on conditions
mydata2 <- data.frame("Ticket" = c("YES", "YES", "NO", "YES"), 
                     "Age" = c(30, 26, 16, 34),
                     "Gender" = c("F", "M", "M", "M"),
                     "Survived" = c("YES", "NO", "YES", "NO"))

print(mydata2)
##   Ticket Age Gender Survived
## 1    YES  30      F      YES
## 2    YES  26      M       NO
## 3     NO  16      M      YES
## 4    YES  34      M       NO
#I want to see the average age of people who survived
mean(mydata[mydata$survived == 1, ]$age)
## [1] 28.91823

Descriptive statistics

library(pastecs)
round(stat.desc(mydata), 2)
##          survived Gender      age passengerClass
## nbr.val        NA     NA  1046.00             NA
## nbr.null       NA     NA     0.00             NA
## nbr.na         NA     NA     0.00             NA
## min            NA     NA     0.17             NA
## max            NA     NA    80.00             NA
## range          NA     NA    79.83             NA
## sum            NA     NA 31255.67             NA
## median         NA     NA    28.00             NA
## mean           NA     NA    29.88             NA
## SE.mean        NA     NA     0.45             NA
## CI.mean        NA     NA     0.87             NA
## var            NA     NA   207.75             NA
## std.dev        NA     NA    14.41             NA
## coef.var       NA     NA     0.48             NA

Explanations:

  • The youngest passenger on the Titanic was 0.17 years old and the oldest was 80 years old. The range or the difference between the minimum and maximum age is 79.83 years.
  • The average age of passengers was 29.88 years.
  • 50% of passengers were older that 28 years.
  • The CI mean means that our true mean is on the interval (29.88 - 0.87, 29.88 + 0.87) or (29, 30.8)
  • The standard deviation or the dispersion of values is 14.41 years.
  • The coefficient of variation explains that the data varies from the mean by 0.48 years.

Graphing the distribution

library(ggplot2) #najbolj uporaben polotting system je ggplot
ggplot(mydata, aes(x=age)) + #daš eas(thetics) in daš noter, kar hočeš na x osi
  geom_histogram(binwidth = 5, colour="gray", fill = "dodgerblue") + 
  theme_dark() + 
  facet_wrap(~survived, ncol = 1) + #s tem ločimo glede na, v našem primeru, movie factors
  ylab("Frequency")

In this histogram it appears that the distributions of survivors (1 in data) and victims (0 in data) of the Titanic is skewed to the right, so the distribution is not normal. We cannot check that with the Shapiro-Wilk normality test, because the values are not numeric. If I changed them to ones and zeros, the test would still not work, because the values must be between 3 and 5000.

But, nonetheless, we can see that most victims were aged between 20 and 30, and the survivors were more evenly distributed.

library(ggplot2)
ggplot(mydata, aes(x=age, fill=survived)) +
  geom_histogram(position=position_dodge(2), binwidth = 5, colour = "gray") + 
  facet_wrap(~passengerClass, ncol = 1) + 
  ylab("Frequency") +
  labs(fill="survived") 

With this graph, I wanted to see if traveling in a specific class affected the survival rate (if they survived, that is 1 in data and if they died that is 0 in data). And, as we can see, the people who travelled in first class were luckier than the 3rd class passengers. I find this analysis very interesting. Honestly, because of these graphs, I started liking R a whole lot more. :)

library(ggplot2)
ggplot(mydata, aes(x=age)) +
  theme_bw() +
  geom_histogram(binwidth = 5, colour="gray") +
  facet_wrap(~Gender:survived) + 
  ylab("Frequency") +
  labs(fill="Survived")

I wanted to check if gender affected survival rate and as we can see, many more women (0 in data) survived (1 in data, 0 in column survived means they died) than men (1 in data). I believe that is because they had a policy of “Women and children first” while boarding the escape boats.

library(ggplot2)
ggplot(mydata, aes(x=Gender, y=age, fill=survived)) + 
  geom_boxplot() +
  scale_fill_brewer(palette="Spectral") + 
  xlab("Gender") + 
  labs(fill="Survied")

In this boxplot we can see that the mean of women (0 in data) that survived (1 in data) is higher than the mean of men (1 in data) that survived, and the mean of victims (0 in data) that were men is higher than the mean of victims, that were female. This encourages the finding we specified while observing the 3rd histogram above.

Task 2

library(readxl)
Business_School <- read_excel("~/Desktop/R Take Home Exam 2024/Task 2/Business School.xlsx")
View(Business_School)

Business_School <- as.data.frame(Business_School)

head(Business_School)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500
colnames(Business_School)[2]<- "UndergradDegree" 
colnames(Business_School)[3]<- "UndergradGrade" 
colnames(Business_School)[4]<- "MBAGrade" 
colnames(Business_School)[5]<- "WorkExp" 
colnames(Business_School)[9]<- "AnnSalary"

head(Business_School)
##   Student ID  UndergradDegree UndergradGrade MBAGrade WorkExp
## 1          1         Business           68.4     90.2      No
## 2          2 Computer Science           70.2     68.7     Yes
## 3          3          Finance           76.4     83.3      No
## 4          4         Business           82.6     88.7      No
## 5          5          Finance           76.9     75.4      No
## 6          6 Computer Science           83.3     82.1      No
##   Employability (Before) Employability (After) Status AnnSalary
## 1                    252                   276 Placed    111000
## 2                    101                   119 Placed    107000
## 3                    401                   462 Placed    109000
## 4                    287                   342 Placed    148000
## 5                    275                   347 Placed    255500
## 6                    254                   313 Placed    103500
Business_School$UndergradDegree <- factor(Business_School$UndergradDegree,
                                          levels = c("Art", "Business", "Computer Science", "Engineering", "Finance"),
                                          labels = c("Art", "Business", "Computer Science", "Engineering", "Finance"))
library(ggplot2)
ggplot(Business_School, aes(x=UndergradDegree)) + 
  geom_bar() + 
  ylab("Frequency")

By a good margin, the most common Undergraduate Degree is Buisness, followed by Finance and Computer Science.

library(pastecs)
round(stat.desc(Business_School$AnnSalary))
##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0
options(scipen = 999)
library(ggplot2)
ggplot(Business_School, aes(x=AnnSalary)) + 
  geom_histogram() + 
  ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The minimal annual salary is 20000, the maximum is 340000. The average annual salary is 109058 and 50% od the MBA students have annual salaries higher than 103500. According to the histogram, the distribution seems to be skewed to the right, but we appear to have some outliers, so we should do the Shapiro-Wilks test to check.

mean(Business_School$MBAGrade)
## [1] 76.04055
sd(Business_School$MBAGrade)
## [1] 7.675114
t.test(Business_School$MBAGrade,
       mu = 74,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Business_School$MBAGrade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Based on the p-value (p = 0.00915), which is smaller that 0.05, we can reject the null hypothesis and say that the true mean is not equal to 74. Based on the 95 percent confidence interval, we can stipulate that the true mean is somewhere between 74.51764 and 77.56346.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/Desktop/R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)

Apartments <- as.data.frame(Apartments)

head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking,
                                          levels = c(0, 1),
                                          labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
                                          levels = c(0, 1),
                                          labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

mean(Apartments$Price)
## [1] 2018.941
sd(Apartments$Price)
## [1] 377.8417
t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the p-value (p = 0.004731), which is smaller that 0.05, we can reject the null hypothesis and say that the true mean is not equal to 1900. Based on the 95 percent confidence interval, we can stipulate that the true mean is somewhere between 1937.443 and 2100.440.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
          data = Apartments)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanations:

  • estimate of regression coefficient: If the age of the apartment increases by 1 year ???
  • coefficient of correlation: If we want to see the correlation, we do the t-test (H0: beta1 = 0, H1: beta1 =/= 0), we get p < 0.05, so we reject H0.
  • coefficient of determination: 5.3 % of the variablity in price is explained by the linear effect of age, so this is a very bad model.

Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(Apartments[ , c(1, 2, 3)], 
                  smooth = FALSE) 

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(Apartments[ , c(1, 2, 3)]))
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

As we can see from the correlation matrix and the Pearson correlation coefficient, there is a weak correlation between Age and Distance and Age and Price, but there is a strong correlation between Price and Distance.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

Chris - mail!

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645
hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

As we can see, we have outliers. Let’s find the outliers and eliminate them.

head(Apartments[order(Apartments$StdResid),], 3) 
##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013
head(Apartments[order(-Apartments$CooksD),], 6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038

As we can see, there is a gap between aparment no. 53 and 13 and between no. 38, 55 and 33 (according to Cooks Distance). But, let’s remove aparment no. 38 and see if our model is going to be better.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Apartments <- Apartments %>%
  filter(!CooksD == 0.320)

Let’s see if it works better now:

fit3 <- lm(Price ~ Age + Distance,
           data = Apartments)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221 < 0.0000000000000002 ***
## Age           -6.464      3.159  -2.046                0.044 *  
## Distance     -22.955      2.786  -8.240     0.00000000000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 0.000000000002339
Apartments$StdResid <- round(rstandard(fit3), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit3), 3) #Cooks distances

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95649, p-value = 0.006355
hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We see the test is better, but we have to remove another outlier.

head(Apartments[order(Apartments$StdResid),], 3) 
##    Age Distance Price Parking Balcony StdResid CooksD
## 52   7        2  1760      No     Yes   -2.237  0.072
## 13  12       14  1650      No     Yes   -1.488  0.013
## 71  12       14  1650      No      No   -1.488  0.013
head(Apartments[order(-Apartments$CooksD),], 6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 54  43       37  1740      No      No    1.598  0.129
## 33   2       11  2790     Yes      No    2.225  0.084
## 52   7        2  1760      No     Yes   -2.237  0.072
## 22  37        3  2540     Yes     Yes    1.473  0.056
## 25   8       26  2300     Yes     Yes    1.825  0.052
## 57   8        2  2820     Yes      No    1.705  0.039
library(dplyr)

Apartments <- Apartments %>%
  filter(!CooksD == 0.129)

Let’s check again:

fit4 <- lm(Price ~ Age + Distance,
           data = Apartments)

summary(fit4)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684 < 0.0000000000000002 ***
## Age           -7.850      3.244  -2.420               0.0178 *  
## Distance     -23.945      2.826  -8.473    0.000000000000953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 0.000000000001173
Apartments$StdResid <- round(rstandard(fit4), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit4), 3) #Cooks distances

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95952, p-value = 0.01044
hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

And as we see, this works way better. We eliminated the outliers one at a time and kept an eye on R-squared, which is now at almost 50%. That is a good outcome.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Apartments$StdFitted <- scale(fit4$fitted.values)

library(car)
scatterplot(y = Apartments$StdResid, x = Apartments$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

I think we have heteroskedasticity, at least it looks plausible according to the scatterplot, but, let’s check.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit4)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

The null hypothesis, that the variance of errors is constant, cannot be rejected (p-value is above 0.05). We therefore assume homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Apartments$StdResid <- round(rstandard(fit4), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit4), 3) #Cooks distances

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95952, p-value = 0.01044

To formally test the normal distribution of our standardized residuals, I did the Shapiro Wilks test. I postulated H0: The standardized residuals are normally distributed, H1: The standardized residuals are not normally distributed. Based on the p-value that is less than 0.05, we can reject the null hypothesis and assume the distribution is not normal.

Estimate the fit2 (In my data, the model without excluded units is fit4) again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit4)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684 < 0.0000000000000002 ***
## Age           -7.850      3.244  -2.420               0.0178 *  
## Distance     -23.945      2.826  -8.473    0.000000000000953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 0.000000000001173

Explanations:

  • estimate of regression coefficient: If the age of the appartment goes up by one year, the price of the appartment decreases by 7.850 euro/m^2. If the distance of the appartment increases by 1 km, the price of the appartment decreases by 23.945 euro/m^2
  • coefficient of correlation: We can see that there is a strong correlation between Distance and Price is strong (based on the t-test) and not so strong, but still statistically significant between Age and Price.
  • Multiple R-squared: 49.68% of variability of apartment price can be explained by the linear relationship between price, age and distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3 (I saved it to fit5, because otherwise I would have overriden my own data, so I saved them as fit3 and fit4).

fit5 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = Apartments)

With function anova check if model fit5 fits data better than model fit4.

anova(fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We postulate two hypothesis: H0: ∆ro^2=0 and H1: ∆ro^2>0. Based on Pr(>F) = 0.02813, which is smaller than 0.05, we can reject H0 and assume that Model 2 is statistically better.

Show the results of fit5 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit5)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185 < 0.0000000000000002 ***
## Age           -7.197      3.148  -2.286              0.02499 *  
## Distance     -21.241      2.911  -7.296       0.000000000214 ***
## ParkingYes   168.921     62.166   2.717              0.00811 ** 
## BalconyYes    -6.985     58.745  -0.119              0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 0.000000000001449

54% of variability in Apartment prices can be explained by Age, Distance, whether it has a parking space and/or a balcony.

Given the Age, Distance and Balcony, appartments that have a parking space on average sell for a 168.921 euro/m^2 higher price than appartments who don’t have parking.

We postulate two hypothesis: H0: ∆ro^2=0 and H1: ∆ro^2>0. Based on p-value > 0.001, which is smaller than 0.05, we can reject H0 and assume that Parking is statistically significant.

Because Pr(>abs(t))=0.90566, Balcony is not statistically relevant and we will not interpret it.

Save fitted values and calculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit5) 
Apartments$Residuals <- residuals(fit5) 

Apartments[2, c("Fitted", "Residuals")]
##     Fitted Residuals
## 2 2377.043  422.9572

The residual for apartment ID2 is 422.9572

The end :)