Exam Homework

R take-home exam - Luka Mihailović Potrč

Introduction

For my take-home exam, I have selected a set of data that is already integrated in R - Edgar Anderson’s Iris Data from the “datasets” package. The data contains 5 variables, with one of them being categorical.

Two data sets exist for this data; I used the “iris” one.

Source:

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

Task 1

Explaining the data set and variables

I will first present a preview of the data using the “head” function below. I will then explain the data below.

#data()
#data(package = .packages(all.available = TRUE))
mydata <- force(iris)
head(mydata)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Explanations of the data set:

the respective unit (ID) represents an iris flower
Sepal.Length: Measurement of the respective iris flower’s sepal length in centimeters
Sepal.Width: Measurement of the respective iris flower’s sepal width in centimeters
Petal.Length: Measurement of the respective iris flower’s petal length in centimeters
Petal.Width: Measurement of the respective iris flower’s petal width in centimeters
Species: Classification of the respective iris flower into one of the three species: Iris setosa, versicolor, and virginica.

Explanations of variables: The first four variables (first four columns) that deal with measurements of an iris’ sepal and petal length and width. I believe all of these to be numerical variables of the ratio scale. This is because these measurements are scalable (we can say that one flower’s petal is twice as long as another) and because they present an absolute zero in the sense that length and width yield non negative results (I realize this absolute zero thing is more a rule of a thumb and does not always apply, but still).

The fifth variable tells us about the species the observed iris flower belongs to - it is either of setosa, versicolor, or virginica species. In the first 6 units observed, all of the flowers belong to the setosa species, because they are sorted by species by default.

Performing data manipulations

I will now perform some data manipulations. The data above is very intuitively named, but I will still do some changes for the sake of this exercise.

Example 1: I want to create a data set that does not include the “species” variable.

mydata2 <- mydata[ , -5] #Deleting the fifth column
head(mydata2)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Example 2: I want to create a data set that excludes the first three iris flowers measured (this is really a work of fiction but still, I would like to show how to delete rows).

mydata3 <- mydata[ c(-1, -2, -3), ] #deleting the first three units with IDs 1, 2, and 3
head(mydata3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 7          4.6         3.4          1.4         0.3  setosa
## 8          5.0         3.4          1.5         0.2  setosa
## 9          4.4         2.9          1.4         0.2  setosa

Let us imagine we made an error in a measurement and found it later - let’s say this measurement happened with the first flower in the original data set; we measured its petal length as 1.4 when it should have been 1.5, and petal width as 0.2, when it should have been 0.3. We change those specific variables like so:

mydata4 <- mydata
mydata4[ 1, 3] <- 1.5
mydata4[ 1, 4] <- 0.3
head(mydata4)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.5         0.3  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

I will not delete any data because none is NA (missing) and I will not add random data only to some of a dataset of 150 units just for the sake of the exercise because it doesn’t make much sense.

Let me instead do another practice of data manipulation: I will now increase all petal length measurements by 1cm (this will be seen in a new variable called “PetLengthFix”).

mydata5 <- mydata
mydata5$PetLengthFix <- mydata5$Petal.Length +1 #increasing all petal length values by 1cm in a new added variable called "PetLengthFix
head(mydata5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species PetLengthFix
## 1          5.1         3.5          1.4         0.2  setosa          2.4
## 2          4.9         3.0          1.4         0.2  setosa          2.4
## 3          4.7         3.2          1.3         0.2  setosa          2.3
## 4          4.6         3.1          1.5         0.2  setosa          2.5
## 5          5.0         3.6          1.4         0.2  setosa          2.4
## 6          5.4         3.9          1.7         0.4  setosa          2.7

Let’s also rename some of the variables above for the sake of the exercise.

mydata6 <- mydata
colnames(mydata6) <- c("Sep Length", "Sep width", "Pet Length", "Pet Width", "Specie")
head(mydata6)

##   Sep Length Sep width Pet Length Pet Width Specie
## 1        5.1       3.5        1.4       0.2 setosa
## 2        4.9       3.0        1.4       0.2 setosa
## 3        4.7       3.2        1.3       0.2 setosa
## 4        4.6       3.1        1.5       0.2 setosa
## 5        5.0       3.6        1.4       0.2 setosa
## 6        5.4       3.9        1.7       0.4 setosa

Presenting descriptive statistics

Let us now take a look at some descriptive statistics. We can use different functions to do so; in class, we have used the “mean” and “sd” functions to present specific descriptive statistics, and “summary”, “describe” and “data.desc” to paint a broader overview. I will not be using all of them, for they serve a similar purpose, but will make use of some for the sake of showing understanding.

If we wish to check the arithmetic mean of, for example, petal length of irises in centimeters, we can do the following:

round(mean(mydata$Sepal.Length), 2)

## [1] 5.84

This tells us that on average, the observed irises’ sepals are 5.84cm long (rounding applies).

median(mydata$Petal.Width)

## [1] 1.3

Looking at the median above, we can say that a half of the observed irises’ petals are up to 1.3cm wide, and the other half are wider than 1.3cm.

I will now use the “describe” function for a broader overview of descriptive statistics for each of the variables.

library(psych)
describe(mydata)

##              vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31    -0.61 0.07
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31     0.14 0.04
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27    -1.42 0.14
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10    -1.36 0.06
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00    -1.52 0.07

I will now interpret three results for the sake of practice.

min of “Sepal.Width”: The narrowest sepal among the measured irises was 2cm wide.
max of “Petal.Width”: The widest petal among the measured irises was 2.5cm wide.
skew of “Sepal.Length”: When looking at all 150 measurements of the length of the irises’ sepals, we can see that the distribution is slightly right (positively) skewed. This means that our results tend to cluster slightly on the left and the distribution’s slope can be seen falling slightly on the right (the tail is longer on the right side).

Graphing the distribution

Let us first graph the distribution using a histogram. For practice, I will graph a histogram of the irises’ sepal length below.

hist(mydata$Sepal.Length,
     main = "Distribution of iris sepal length",
     xlab = "Sepal length",
     ylab = "Frequency",
     col = "darkred",
     border = "black",
     breaks = seq(from = 0, to = 10, by = 0.5))

As we can see, the histogram is not quite normally distributed - it is slightly right (positively) skewed, just like we confirmed above with the “describe” function. The length of sepals clusters around the middle point of 6cm, which can be confirmed with the data we arrived at with the “describe” function above: the mean was 5.84cm and the median was 5.80cm. The most measurements fell between 6 and 7.25cm (mode).

Let us graph another histogram below, this time for petal width.

hist(mydata$Petal.Width,
     main = "Distribution of iris petal width",
     xlab = "Petal width",
     ylab = "Frequency",
     col = "darkred",
     border = "black",
     breaks = seq(from = 0, to = 3, by = 0.2))

As we can see from the graph above, the distribution is not the classic normal one. We have a lot of petal measurements with width below 0.5cm. Because the graph is fragmented, it is slightly harder to tell from the picture, but widths of petals scatter around the length of about 1.25cm on the picture - this is confirmed with our “describe” function above, which shows a mean of 1.20cm and a median of 1.30cm. The most measurements fell between 0.0 and 0.25cm.

We can argue that the distribution above is bimodal, as we can kind of notice two peaks in the data frequency. This is likely due to differences in the width of flowers’ petals based on their species, since a different species could feature narrower or wider petals from the rest. We can intuitively see this by inspecting our “mydata” data set and seeing that generally, the setosa species have much narrower petals, followed by versicolor. Virginica’s petals appear to be the widest.

In order to confirm this notion, we can use the “ggplot” function to observe data separated by species. We do the following:

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x=Petal.Width)) +
  geom_histogram(binwidth = 0.1, colour="black", fill="darkred") +
  facet_wrap(~Species, ncol = 1) + 
  xlab("Petal width") +
  ylab("Frequency")

As suspected, we can see that the reason behind the bimodal distribution regardless of species above is due to the width itself differing between various species. When accounting for that fact with the “ggplot” function, we can see that irises differ notably in the width of their petals based on what species they belong to.

Let’s also draw our boxplots for petal width accounting for species with the “ggplot” function. I could draw a single boxplot for all units of irises, but seeing as their petal width differs notably with species, I believe it makes more sense to draw the boxplots separately.

ggplot(mydata, aes(x=Species, y=Petal.Width)) +
  geom_boxplot() +
  xlab("Species") +
  ylab("Petal Width")

The boxplots above contain a lot of information.

Firstly, let’s talk about what we can derive from a boxplot. The minimum and the maximum measurements are shown by the horizontal lines both above and below the box. As such, we can see, for example, that the minimal petal width of a plant of the setosa species is 0.1cm and the maximal petal width of a plant of the virginica species is 2.5cm.

The medians are represented by the bold lines inside the boxes - we can see that the median is around 0.2cm for the plants of the setosa species, around 1.4cm for the plants of the versicolor species, and 2.0cm for the plants of the virginica species.

The borders of the boxes themselves tell us about the first (the bottom border) and the third (the upper border) quartiles of petal width of the respective species. We can, for example, see, that the third quartile of petal width for the versicolor species is 1.5cm. This means that 75% of those units have a petal width of up to 1.5cm, with the rest of them having a petal width above 1.5cm.

We can also notice two outliers in petal width with the setosa species - one has a petal width of 0.5cm and the other has a petal width of 0.6. But it is not sensible to exclude those measurements just because their dispersion around the mean is greater; they still fit into the broader story of iris petal width.

In order to have another grasp of the situation, let’s also show the differences in petal width between different iris species using the function “describeBy”.

library(psych)
describeBy(mydata$Petal.Width, mydata$Species)

## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 50 0.25 0.11    0.2    0.24   0 0.1 0.6   0.5 1.18     1.26 0.01
## ------------------------------------------------------------------------------------------ 
## group: versicolor
##    vars  n mean  sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 50 1.33 0.2    1.3    1.32 0.22   1 1.8   0.8 -0.03    -0.59 0.03
## ------------------------------------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 50 2.03 0.27      2    2.03 0.3 1.4 2.5   1.1 -0.12    -0.75 0.04

As we can see, the setosa species’ petals are on average 0.25cm wide with the middle value of 0.2cm. With the versicolor species, they are on average 1.33cm wide with the middle value of 1.3cm, and with the virginica species, they are on average 2.03 cm wide with the middle value of 2cm. We can also observe the minimum and maximum values differ quite notably along with certain other results such as skewness, but I will not go into overt detail of that. The point of the “describeBy” function was simply to back up my interpretations of the histograms and boxplots above with specific numbers as proof.

Task 2

We will now analyze the data provided in a separate excel sheet.The data is from the school year 2021/2022 and examines the effect of online schooling on nine-graders by measuring their body weight.

Let’s have a look at the data below.

mydatax <- read.table("./Body mass.csv", 
                     header=TRUE, 
                     sep=";", 
                     dec=",")
head(mydatax, 3)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5

Descriptive statistics

In this first part, we will show the descriptive statistics of the data about body mass and additionally portray its distribution in a histogram. Since we have already used the “describe” function, let us opt for “stat.desc” this time around. We will exclude the ID column from stat.desc because we don’t want descriptive statistics for the corresponding “order number” of an observed unit - it would tell us nothing.

library(pastecs)
round(stat.desc(mydatax[ , -1]), 2)

##      nbr.val     nbr.null       nbr.na          min          max        range          sum       median         mean 
##        50.00         0.00         0.00        49.70        83.20        33.50      3143.80        62.80        62.88 
##      SE.mean CI.mean.0.95          var      std.dev     coef.var 
##         0.85         1.71        36.14         6.01         0.10

As we can see, the data ranges from 49.70 to 83.20kg. The middle value is 62.80kg and the arithmetic mean is 62.88kg. Since these two coincide so well, outliers are unlikely since intuitively, were there outliers, the mean value would very likely be skewed in comparison to the median (unless those outliers’ influences on the mean “cancel out”). In addition, we can see that the coefficient of variation is rather low at 0.10. If there were notable outliers, it would quite likely be higher.

We can even construct a manual confidence interval at the default alpha 0.05 with the data above by adding and subtracting the 0.95 CI mean (1.71) to our arithmetic mean. Our 95% CI for the arithmetic mean at alpha = 0.05 is then 61.17 < Mu < 64.59. We can be fairly certain that our the population mean falls within our interval estimate.

Next, let’s design a histogram of our data. Although it may make the histogram appear smaller, I will structure the x-axis to begin with 0 and end with 100, because beginning a scale at 40 or so is odd and probably statistically incorrect.

hist(mydatax$Mass, 
     main = "Distribution of nine-graders' weight",
     xlab = "Weight",
     ylab = "Frequency",
     col= "darkred",
     border = "black",
     breaks = seq(from = 0, to = 100, by = 10))

As we can see from the histogram above, most nine-graders’ weight falls between 60kg and 70kg (mode). The distribution is slightly positively (right) skewed.

Hypothesis testing, result interpretation and effect size

In order to check the effect of online schooling on ninth-graders’ weight, we have been provided with the average weight from the school year 2018/2019, against which we will pit our results so see if we can claim them to be different (higher) with statistical significance. The average weight in 2018/2019 was 59.5kg.

Our hypotheses are as follows:

H0: Mu = 59.5kg

H1: Mu =/= 59.5kg

Let’s do the test below.

t.test(mydatax$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydatax$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

We can see a rather high t-value, which makes it likely for our alternative hypothesis to be confirmed. And indeed it does, as we can see that the p-value is even lower than 0.001, making it certainly lower than the 0.05 threshold for rejecting the null hypothesis. We reject H0 at p<0.001 and accept H1. We conclude that on average, the weight of nine-graders in the school year 2020/2021 has changed in comparison to that in the school year 2018/2019 at statistical significance. The weight increased.

The estimate of the mean (Y_hat) is 62.876kg.

We can also show this using the 95% confidence interval, which yields the same results as the one I computed manually a few paragraphs above (with some differences due to rounding):

61 < Mu < 65, alpha = 0.05

Because 59.5 is not included in the interval, we can reject H0 in favor of H1.

Now, let’s check whether the effect we discovered is statistically large or small.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(mydatax$Mass, mu = 59.5)

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

Our Cohen’s d value is 0.56. Let’s check how large or small that is by interpreting it using R. I will use the interpretation rules by Sawilowsky from 2009 that we mentioned in class as the most frequently used in economics.

interpret_cohens_d(0.56, rules = "sawilowsky2009")

## [1] "medium"
## (Rules: sawilowsky2009)

We can see that by the rules we applied, the effect perceived in the increase of weight is estimated as medium.

Task 3

I will copy and paste the structure that includes sub-quesitons from “Apartments.Rmd” here in order to make reading more intuitive.

Import the dataset Apartments.xlsx

library(readxl)
mydatay <- read_xlsx("./Apartments.xlsx")
mydatay <- as.data.frame(mydatay) 
head(mydatay)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydatay$ParkingFactor <- factor(mydatay$Parking, 
                             levels = c(1, 0), 
                             labels = c("Y", "N"))

mydatay$BalconyFactor <- factor(mydatay$Balcony, 
                             levels = c(1, 0), 
                             labels = c("Y", "N"))
mydatay <- as.data.frame(mydatay) 
head(mydatay)

##   Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 1   7       28  1640       0       1             N             Y
## 2  18        1  2800       1       0             Y             N
## 3   7       28  1660       0       0             N             N
## 4  28       29  1850       0       1             N             Y
## 5  18       18  1640       1       1             Y             Y
## 6  28       12  1770       0       1             N             Y

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

H0: Mu_price = 1900EUR

H1: Mu_price =/= 1900 EUR

t.test(mydatay$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydatay$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Because the p-value is lower than 0.05, we can reject the null hypothesis in favor of the alternative. We reject H0 at p<0.05 and accept H1. We conclude that on average, the the price per m2 of apartments in the Ljubljana region differs from 1900EUR. It is higher.

The estimate of the mean (Y_hat) is 2018.941EUR

We can also show this using the 95% confidence interval:

1937 < Mu < 2019, alpha = 0.05

Because 1900 is not included in the interval, we can reject H0 in favor of H1.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Let’s first design a scatterplot with the regression line.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(mydatay$Price ~ mydatay$Age,  
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Price per m2 in EUR",
            xlab = "Age in years")

Now that we have the scatterplot, let’s now get the summary of the regression model.

fit1 <- lm(Price ~ Age,
          data = mydatay)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydatay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Let’s look at our results. Our regression equation is as follows:

Price per m2_bar = 2185.5 - 9 * Age

Coefficients have the following meanings:

2185.5 is the intercept - this shows us the expected price per m2 of an apartment that is new (0 years of age).

-9 is the b1 coefficient - it tells us that if the age of an apartment increases by 1 year, the price of its m2 decreases by 9EUR on average.

The coefficient of determination (R_squared) tells us that 5.30% of the variability in prices per m2 of apartments is explained by variability in their age.

The t-test for the explanatory variable of age explains the following.

The hypotheses:

H0: B1 = 0

H1: B1 =/= 0

p=0.034

We test that B1 (the coefficient of age) has a statistically significant effect upon the price per m2. Since the p-value is lower than 0.05 we can accept H1 and claim that the slope of B1 differs from 0 (p=0.034). We rejected H0 at p<0.05. Age affects price per m2 with statistic significance.

If beta 1 (B1) were zero, it would mean that age has no impact on price per m2 - the regression line would be horizontal.

Finally, let’s compute and explain the correlation coefficient.

cor(mydatay$Price, mydatay$Age)

## [1] -0.230255

We can see that there is a negative weak (between 0.1 and 0.3) linear relationship between price per m2 and age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(mydatay[ , c(-4, -5, -6, -7)],
                  smooth = FALSE)

On the diagonal we can see distributions for Age, Distance, and Price.

The distribution of Age is right-skewed but still somewhat resembling normal distribution. We can see that there more apartments are younger rather than older.

The distribution of Distance is quite right-skewed, suggesting that a lot of the apartments cluster around the city center. It also looks slightly bimodal, which may be explained by apartments being clustered around the city center and the suburbs, but less so in between.

The distribution of Price is also right-skewed and looks slightly bimodal, suggesting that most often appearing value of apartment price per m2 is either on the lower on medium end (relatively to other apartment prices).

For checking multicolinearity, we observe the relationship between the explanatory variables - that is, distance and age. They appear very weakly positively correlated, so I do not think multicolinearity should be a problem, judging from the scatterplot.

We cannot attest to the exact strength of relationship of observed variables from the matrix above. In order to be sure about multicolinearity, let’s take a look at exact correlation data below.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydatay[ , c(-4, -5, -6, -7)]))

##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

Multicolinearity is a phenomenon of strong correlation between explanatory variables themselves. In our case, the explanatory variables are age and distance. We can see that these two feature a very weak positive correlation of 0.04. From such a minor influence, I doubt multicolinearity would prove an issue.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydatay)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydatay)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

The function is as follows:

Price per m2_bar = 2460.1 - 7.9 * Age - 20.7 * Distance

We expect the price per m2 of a new apartment positioned directly in the center to be 2460.1EUR.

If age increases by 1 year, the price of an apartment will on average decrease by 7.9EUR per m2 (p=0.016), controlling for distance (assuming distance remains constant).

If distance from the city center increases by 1km, the price of an apartment will on average decrease by 20.7EUR per m2 (p<0.001), controlling for age (assuming age remains constant).

43.96% of the variability of an apartment’s price per m2 can be explained by variability of its age and distance from the city center.

We can see that both coefficients have a statistically significant effect on an apartment’s price per m2, because both p-values are below 0.05.

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Like discussed two questions above when we observed the scatterplot matrix and correlation between age and distance, we did not suspect for multicolinearity to be an issue between these two explanatory variables. We have now confirmed this with testing for variance inflation factor - if the results given for age and distance above would be higher than 5, it would mean that multicolinearity is certainly present.

As a rule of thumb, a VIF value that exceeds 5 indicates a problematic amount of collinearity. We mentioned the VIF value of 5 as a border for detecting multicolinearity. Since our VIF values are below 5, and our VIF mean is 1, we can say that there is no troubling multicolinearity in our data.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Here we will take a look at the distribution of our residual values to see whether there are any outliers in our data. We will also use a Shapiro-Wilk test to check whether the standardized residuals are distributed normally.

Residual: e = y - y_hat. Then we standardize them. We can use them to find outliers. The standardized value of the corresponding residual must be between -3 and +3, otherwise the unit is an outlier and needs to be removed by the standardized normal distribution (because by 3 standard deviations from the mean, it falls outside of 99.7% of the data distribution).

Cook’s distance is the other function. We need it to find units that are influential cases - those with huge impact.

mydatay$StdResid <- round(rstandard(fit2), 3) #Adding standardized residuals to our data set
mydatay$CooksD <- round(cooks.distance(fit2), 3) #Adding cook's distances to our data set

Let’s first check for standardized residuals by plotting a histogram to see their distribution.

hist(mydatay$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col= "darkred",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 0.5))

We can see that no variable in our distribution falls outside of the designated area. Our distribution does not look quite normal. But there are no outliers beyond -3 and 3. This means that we do not need to remove any units from our data as outliers.

Let’s formally test for the normality of our distribution of standardized residuals with a Shapiro-Wilk normality test.

shapiro.test(mydatay$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydatay$StdResid
## W = 0.95303, p-value = 0.003645

H0: Variables are normally distributed

H1: Variables are not normally distributed

Since the p-value is lower than 0.05, we reject the H0. We conclude that the standardized residuals above are not normally distributed (p=0.004).

Because our data size (n) is way larger than 30, the concern for normal distribution is not too big - according to the central limit theorem, we can dismiss the abnormal distribution as a trend in our sample. We simply assume normality and realize that having used a a t-test on our data should not prove a problem. As such, we leave the data as is, even though it is not normally distributed.

Let’s move on to Cook’s distances.

hist(mydatay$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency",
     col= "darkred",
     border = "black",
     main = "Histogram of Cooks distances")

Generally speaking, we want to see that the values on the histogram of Cook’s distances are below 1 in order to be able to say that none of them have a significantly huge impact upon the data set.

However, we can observe that relatively speaking, we have an outlier value(s) with much greater impact upon the data sat than the others - this value is seen in the histogram above as falling between 0.30 and 0.35 - while still below 1, the value(s) in question have a relatively big impact on the data when compared to others, all of which are below 0.15 in terms of impact. We will check for the value(s) with huge impact and possibly remove them.

Let’s display the units with the largest value of Cook’s distance below. We are looking for those with the largest value because we wish to remove those that make the graph above right-skewed; the extreme values with large impact. We can do this in many ways, and I will use a function with the explanation below.

head(mydatay[order(-mydatay$CooksD), ], 4) #this function will show the 4 units of our data with the largest Cook's distances value (those that belong to the long tail on the right in the graph above). Because we want to show the units with the largest "CooksD" value, we want our function in descending order - that's what the minus in front of "mydatay" is for.

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## 38   5       45  2180       1       1             Y             Y    2.577  0.320
## 55  43       37  1740       0       0             N             N    1.445  0.104
## 33   2       11  2790       1       0             Y             N    2.051  0.069
## 53   7        2  1760       0       1             N             Y   -2.152  0.066

As we can see above, the person with the ID 38 stands out with a Cook’s distances value of 0.320. It is the unit we observed in the histogram above as the one with a relatively larger impact upon the data set. Let’s remove it from the data set.

We can remove a unit in a number of ways. I will use the “filter” function below. Looking in the “mydatay” data set and sorting by price, we can see that this unit is the only one with the price of 2180, so I can safely use Price as the filter variable when removing it below.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydatay <- mydatay %>%
    filter(!Price == "2180") #the "!" in front runs the opposite of the function. We basically told the program to only keep the unit with the Price value of 2180, then we reversed the function in order to only drop that value.

As we can see, we only have 84 units now. The unit with the Cook’s value has been removed because of its big influence on the data set.

Let’s plot the Cook’s distances histogram again and see the effect.

hist(mydatay$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency",
     col= "darkred",
     border = "black",
     main = "Histogram of Cooks distances")

As we can see, there are some more values in our data with a relatively big impact on our data set. Let’s order the data by Cooks values again.

head(mydatay[order(-mydatay$CooksD), ], 4)

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## 54  43       37  1740       0       0             N             N    1.445  0.104
## 33   2       11  2790       1       0             Y             N    2.051  0.069
## 52   7        2  1760       0       1             N             Y   -2.152  0.066
## 22  37        3  2540       1       1             Y             Y    1.576  0.061

The unit still causing trouble is the one with the CooksD value of 0.104 and price 1740. Let’s remove it and check the results. We can use Price as the filter variable again, because filtering by price, we can see in our data that this unit is the only one with the Price of 1740.

Strictly, speaking We don’t “have to” remove that unit; we are technically losing data by doing so because it’s really not that dramatically different from the rest above. But I want to see what happens and besides, this is practice. Let’s remove it and see.

mydatay <- mydatay %>%
    filter(!Price == "1740")

The unit has been removed and we are left with 83 units only. We can see in our data set “mydatay” when filtering by price that both units that had a large impact on our data set with respective prices of 2180 and 1740 have been removed from the data set.

Let’s try the Cook’s distances histogram yet once again.

hist(mydatay$CooksD, 
     xlab = "Cooks distance",
     ylab = "Frequency",
     col= "darkred",
     border = "black",
     main = "Histogram of Cooks distances")

The distribution looks much more streamlined now. We can see there are still some units with a relatively larger impact on the data set, but I will not remove those. I will explain why below.

head(mydatay[order(-mydatay$CooksD), ], 4)

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## 33   2       11  2790       1       0             Y             N    2.051  0.069
## 52   7        2  1760       0       1             N             Y   -2.152  0.066
## 22  37        3  2540       1       1             Y             Y    1.576  0.061
## 38  40        2  2400       0       1             N             Y    1.091  0.038

If we look at this table, we can see that the bracket in the histogram between 0.06 and 0.07 is actually made up of 3 units that have CooksD values larger than 0.06.

My reasoning is as follows: I have already removed the 2 units with the largest impact on the data. Given that our sample size was originally only 85, and is now 83, I will not continue removing units, because every time I do so, I limit how encompassing the total information of my data set is and consequently also the total information in the large-picture sense provided by the regression. Removing three more variables would severely reduce that data set and would not account for the “story” those units have to tell, so I will not be removing them. Besides, their impact is not that much greater than the rest. The two units I removed had Cook’s values of 0.320 and 0.104, whereas these three range between 0.06 and 0.07 - not relatively impactful enough from the rest that I would consider shredding the data set for it.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance,
           data = mydatay)

mydatay$StdFitted <- scale(fit2$fitted.values)

library(ggplot2)
ggplot(mydatay, aes(y=StdResid, x=StdFitted)) + 
  geom_point() +
  ylab("Standardized residuals") +
  xlab("Standardized fitted values") +
  theme_minimal()

Observing the scatterplot above, we may argue for some nonlinearity. If we are afraid about nonlinearity, go to the beginning and check for nonlinearity between dependent and explanatory variables. But we didn’t see those in the matrix neither in the VIF test, so we don’t bother with this too much.

Based on the data above, we might argue we see some heteroskedasticity.

Let’s run a Breusch-Pagan test heteroskedasticity test to confirm.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

As we can see, the p-value is 0.052, which is not smaller than 0.05 (it is larger than 0.05). As such, we do not have enough evidence to reject the null hypothesis.

We retain the null hypothesis and conclude that we lack sufficient evidence to claim that variance is anything other than constant. This means that we have no problems with heteroskedasticity in our data at p=0.052, but we are quite close to the margin of p=0.05. This could be due to me removing an additional unit of Cook’s distances above - had I only removed one, it could be that the p-value here would be a bit larger and I would be even more comfortable claiming homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

We have already done this above, but I will show it again for easier reading when observing this homework. We have, after all, removed two units. Let’s first plot the histogram of standardized residuals.

hist(mydatay$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col= "darkred",
     border = "black",
     breaks = seq(from = -3, to = 3, by = 0.5))

A said above, the histogram does not appear normal at glance. But since no standardized residual values fall outside of 3 and -3, I cannot mark them as outliers and will not be removing them.

And to confirm whether our distribution is in fact not normal, a Shapiro-Wilks test:

shapiro.test(mydatay$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydatay$StdResid
## W = 0.94963, p-value = 0.002636

H0: Variables are normally distributed

H1: Variables are not normally distributed

Since the p-value = 0.002 is lower than 0.05, we reject the H0. We can indeed conclude that variables are not normally distributed.

As explained above, we can dismiss the abnormal distribution due to the central limit theorem, because our sample size is larger than 30.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Let’s now show the summary again with the two units removed.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydatay)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

As we can see, the numbers changed somewhat. The intercept coefficient was 2460.1 before; it is now 2490.1. The age coefficient was -7.9; it is now slightly lower but still rounds to -7.9. And the distance coefficient was -20.7; it is not -23.9. R_squared changed from 43.98% to 49.68%. The p-value of Age increased slightly, and the p-value of Distance got even lower.

The function is as follows:

Price per m2_bar = 2490.1 - 7.9 * Age - 23.9 * Distance

We now expect the price per m2 of a new apartment positioned directly in the center to be 2490.1EUR (p<0.001).

If age increases by 1 year, the price of an apartment will still on average decrease by 7.9EUR per m2 (p=0.02), controlling for all other explanatory variables.

If distance from the city center increases by 1km, the price of an apartment will now on average decrease by 23.9EUR per m2 (p<0.001), controlling for all other expalantory variables.

49.68% of the variability of an apartment’s price per m2 can be explained by variability of its age and distance from the city center.

We can see that both coefficients of Age and Distance have a statistically significant effect on an apartment’s price per m2, because both p-values are below 0.05.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

Now to add categorical data we use dummies. We already factored parking and balcony before to “Y” and “N” to reflect whether an apartment has a parking or a balcony (Y) or not (N). Let’s include them in the regression function below.

We factored our categorical variables like so:

mydatay\(ParkingFactor <- factor(mydatay\)Parking, levels = c(1, 0), labels = c(“Y”, “N”))

mydatay\(BalconyFactor <- factor(mydatay\)Balcony, levels = c(1, 0), labels = c(“Y”, “N”))

Because we coded “Y” on the first place in both cases, that will be taken as a 0 in the dummy variable. “N” will be taken as a 1. This means that a dummy variable of 0 will mean the apartment has a parking/balcony, and a dummy variable of 1 will mean the apartment does NOT have them. Thus, we are comparing the change in price of an apartment not having a parking/balcony to an apartment having it.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydatay)

I will show the summary of the model fit3 below at the appropriate sub-question later.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypotheses of this test are as follows:

H0: Model 1 (“fit2”) is better

H1: Model 2 (“fit3”) is better

p = 0.03

Because the p-value is lower than 0.05, we can reject the null hypothesis. We accept the alternative hypothesis, which states that model 2 - which, in our case, is the model “fit3” fits the data better than the model “fit”.

In conclusion, the model fit3 is better. It explains more variability in price (r_squared will be higher).

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydatay)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2520.836     81.736  30.841  < 2e-16 ***
## Age              -7.197      3.148  -2.286  0.02499 *  
## Distance        -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFactorN -168.921     62.166  -2.717  0.00811 ** 
## BalconyFactorN    6.985     58.745   0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Like we said above, we can see that the dummy of “1” is represented by NOT having a parking/balcony. R tells us this by putting the variable that corresponds to the dummy variable of “1” beside the function name (example: BalconyFactor”N”).

We can observe that including parking and balcony slightly reduced the p-values of age and distance.

The multiple r-squared increased as expected - it was 49.68% before, and is now 54.08%. This means that now, 54.08% of the variability in the dependent variable is explained by variability in our explanatory variables.

Let’s explain our results a bit.

Firstly, we can see that age, distance, and parking are all statistically significant when it comes to explaining variability in price per m2. Balcony, on the other hand, is not statistically significant with p=0.91 (more than 0.05) and we will not explain its coefficient in continuation.

Let’s explain our significant variables, their coefficients and p-values, and the r-squared value.

AGE:

H0: Beta1 = 0

H1: Beta1 =/= 0

p=0.03

We see that age has a statistically significant effect on apartment price per m2 (0.03<0.05).

Explanation of b1 = -7.2: If the age of an apartment increases by 1 year, its price will on average decrease by 7.2EUR per m2 (p=0.03), controlling for all other explanatory variables (assuming they are constant).

DISTANCE:

H0: Beta2 = 0

H1: Beta2 =/= 0

p<0.001

We see that distance from the city center has a statistically significant effect on apartment price per m2 (0.001<0.05).

Explanation of b2 = -21.2: If the distance of an apartment from the city center increases by 1km, its price will on average decrease by 21.2EUR per m2 (p<0.001), controlling for all other explanatory variables (assuming they are constant).

PARKING:

H0: Beta3 = 0

H1: Beta3 =/= 0

p=0.008

We see that parking has a statistically significant effect on an apartment’s price per m2.

Explanation of b3 = -168.9: Assuming that all other variables are the same, the apartments without parking will on average have a lower price by 168.9EUR per m2 compared to the apartments with parking (p=0.008).

Let’s also explain the F-statistic computed above.

This is the test of the significance of the regression. It’s meant to check whether R squared (population determination coefficient) is actually higher than zero.

Test of significance of regression

H0: Ro2 = 0

H1: Ro2 > 0

F= 22.97, p<0.001

We reject H0 at 0.001. We can say that the determination coefficient is higher than 0 meaning that we found linear relationship between the price per m2 of an apartment and the explanatory variables included in the model.

Next: How strong is the correlation between all of the variables in the model (dependent and explanatory)? Let’s find multiple correlation coefficient.

sqrt(summary(fit3)$r.squared)

## [1] 0.7354143

Multiple coefficient of correlation = 0.74

There is a strong linear relationship between price per m2 of an apartment and all explanatory variables included in the model.

Save fitted values and claculate the residual for apartment ID2.

Firstly, let’s create variables called “Fitted” and “Residuals” to display in the table.

mydatay$Fitted <- fitted.values(fit3)

mydatay$Residuals <- residuals(fit3)

head(mydatay[colnames(mydatay) %in% c("Price", "Fitted", "Residuals")])

##   Price   Fitted  Residuals
## 1  1640 1706.795  -66.79523
## 2  2800 2377.043  422.95718
## 3  1660 1713.780  -53.78016
## 4  1850 1534.428  315.57226
## 5  1640 2008.963 -368.96320
## 6  1770 1895.522 -125.52243

We can see that the actual price per m2 of the apartment ID2 is 2800EUR. The fitted value (the price forecast based on our model) is 2377.04EUR. The residual (the difference) is 422.96EUR per m2.