R take-home exam

Lara Mahkovec

Task 1

mydata2 <- read.table("./tips.csv",
                   header = TRUE,
                   sep = ",",
                   dec = "." )

1. Explanation of the data set

head(mydata2)

##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

The data set on which the Task 1 is completed was taken from kaggle.com (name of the data set “A Waiter’s Tips”, author of the data is Joe Young, the observations for the data set were collected 7 years ago). The observations were gathered by a waiter, who was collecting information about his tips for a period of several months. Data set includes 244 observations and 7 (initial) variables.

The explanation of the data set:

total_bill: the amount of bill in dollars
tip: the amount of tip in dollars
sex: gender the bill payer
smoker: whether there were smokers in the party (table)
day: day of the week
time: time of the day
size: size of the party (how many people were sitting at the table)

2. Data manipulations

mydata2$sex_F <- factor(mydata2$sex,
                      levels = c("Male", "Female"),
                      labels = c("M", "F"))

Creating a new variable sex_F, which is a categorical variable, based on the variable sex, which is text data - character (converting it into a factor). New categorical variable has two labels, “M” (which is “Male” in original variable) and “F” (which is “Female” in original variable).

mydata2$smoker_F <- factor (mydata2$smoker,
                           levels = c("No", "Yes"),
                           labels = c("No", "Yes"))

Creating a new variable smoker_F, which is a categorical variable, based on variable smoker, which is text data - character (converting it into a factor). The labels of the new categorical variable are the same as the levels of the original variable.

colnames(mydata2)[7] <- "size_of_the_party"

Renaming the variable size (which is in seventh column) to variable “size_of_the_party”.

mydata3 <- mydata2[-244 , c(1, 2, 8)]

tail(mydata3, 8)

##     total_bill  tip sex_F
## 236      10.07 1.25     M
## 237      12.60 1.00     M
## 238      32.83 1.17     M
## 239      35.83 4.67     F
## 240      29.03 5.92     M
## 241      27.18 2.00     F
## 242      22.67 2.00     M
## 243      17.82 1.75     M

Creating a new data frame called mydata3, which contains only the variables total_bill, tip and sex_F (variables in columns 1, 2, 8) and excludes the person in 244 row (exclusion of 244 row). The last 8 observations of the data frame mydata3 are presented above.

mydata3[236, 1] <- 12.50
tail(mydata3, 8)

##     total_bill  tip sex_F
## 236      12.50 1.25     M
## 237      12.60 1.00     M
## 238      32.83 1.17     M
## 239      35.83 4.67     F
## 240      29.03 5.92     M
## 241      27.18 2.00     F
## 242      22.67 2.00     M
## 243      17.82 1.75     M

With the latter code the individual value was changed by specifying the exact position of the value (in this case the 236th row and 1st column). This results in the value of the bill for 236th observation being changed from $10.07 to $12.50.

3. Descriptive statistics

summary(mydata2[ , -c(3, 4, 5, 6)])

##    total_bill         tip         size_of_the_party sex_F   smoker_F 
##  Min.   : 3.07   Min.   : 1.000   Min.   :1.00      M:157   No :151  
##  1st Qu.:13.35   1st Qu.: 2.000   1st Qu.:2.00      F: 87   Yes: 93  
##  Median :17.80   Median : 2.900   Median :2.00                       
##  Mean   :19.79   Mean   : 2.998   Mean   :2.57                       
##  3rd Qu.:24.13   3rd Qu.: 3.562   3rd Qu.:3.00                       
##  Max.   :50.81   Max.   :10.000   Max.   :6.00

In the table above the descriptive statistics for three numerical variables and two categorical variables are presented.

Variable total_bill:

the arithmetic mean is $19.79, meaning that on average people in the restaurant spent $19.79 per one bill;
the median is $17.80, meaning that 50% of people in the restaurant spent $17.80 or less per bill, and 50% of people spent more than $17.80 per bill;
1st quartile is $13.35, meaning that 25% of people in the restaurant spent $13.35 or less per bill, and 75% of people in the restaurant spent more than $13.35 per bill;
3rd quartile is $24.13, meaning that 75% of people in the restaurant spent $24.13 or less per bill, and 25% of people in the restaurant spent more than $24.13 per bill;
min. is $3.07, meaning that the minimum spent amount per bill in the restaurant was $3.07;
max. is $50.81, meaning that the maximum spent amount per bill in the restaurant was $50.81.

Variable tip:

the arithmetic mean is $2.998, meaning that on average people in the restaurant tipped the waiter $2.998 per one bill;
the median is $2.90, meaning that 50% of people in the restaurant tipped the waiter $2.90 or less per bill, and 50% of people tipped the waiter more than $2.90 per bill;
1st quartile is $2.00, meaning that 25% of people in the restaurant tipped the waiter $2.00 or less per bill, and 75% of people in the restaurant tipped the waiter more than $2.00 per bill;
max. is $10.00, meaning that the maximum amount of tip received by the waiter was $10.00.

Variable size_of_the_party:

median is 2.00, meaning that 50% of people came to the restaurant in the groups of 2 or less people (sat at the same table), and 50% of people came to the restaurant in groups larger than 2 people;
max is 6.00, meaning that the largest group of people that came to the restaurant together was a group of 6 people;
3rd quartile is 3.00, meaning that 75% of people came to the restaurant in groups of 3 people or less, and 25% of people came to the restaurant in groups of more than 3 people.

library(psych)
describeBy(x = mydata2$tip,
group = mydata2$sex_F)

## 
##  Descriptive statistics by group 
## group: M
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 157 3.09 1.49      3     2.9 1.48   1  10     9 1.56     3.54 0.12
## ------------------------------------------------------------------------------------------ 
## group: F
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 87 2.83 1.16   2.75    2.75 1.11   1 6.5   5.5 0.66     0.09 0.12

Descriptive statistics can also be shown by groups (in upper case by genders). Therefore, the comparison about the variable tip based on genders can be done. From the results it can be seen that males (arithmetic mean $3.09) on average tip more than females (arithmetic mean $2.83). Standard deviation of 1.16 for females and 1.49 for males indicates that females have less variation around the mean than males (tips are more consistent). Regarding the skewness, distribution of tips of males is more positively skewed (skew = 1.56) than the one of females (skew = 0.66).

4. Graphical representation

hist(mydata2$tip, 
     main = "Distribution of the variable tip", 
     col = "steelblue1",
     border = "magenta3",
     ylab = "Frequency", 
     xlab = "Size of a tip (in dollars)", 
     breaks = seq(0, 10, 1),
     right = FALSE)

The distribution of the variable tip (how much people in the restaurant tipped the waiter) is asymmetrical to the right, meaning that people rather tipped the waiter in small values, than in bigger values.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata2, aes(x = tip, fill = sex)) + 
geom_histogram(position = position_dodge(2), #Type of graph 
binwidth = 3,  
colour = "black") +  
ylab("Frequency") + #y-axis name 
labs(fill = "Gender") + #Name of legend 
theme_minimal()

Looking at the histogram, showing the comparison of males and females regarding the size of the tip given to the waiter, it can be concluded that males gave higher tips than females.

library(ggplot2)
ggplot(mydata2, aes(x=tip)) + #variable tip on x-axis
  geom_histogram(binwidth = 0.5, colour="darkolivegreen3", fill = "hotpink1") +
  ylab("Frequency") +       #variable Frequency on y-axis
  facet_wrap(~day, ncol = 1)  #creating separate histograms for each day

With the above histograms, the comparison of tips per day was made:

Friday has low frequency, with tips being evenly distributed between 2 and 5, without strong peaks;
Saturday has the highest frequency, with majority of tips being distributed between 2 and 5, nevertheless some outliers (higher values of tips) are present;
Sunday has also a high frequency, with distribution from around 2 to 6, being a little more spread out than for Saturday, but not having as many outliers as Saturday;
Thursday has lower frequency than Saturday and Sunday, with majority of tips being between values 1 and 3, with only a few higher ones.

Most tipping data is based on Saturday and Sunday (highest frequencies). Tips are generally small but Saturday has the clearest peak and Sunday has a little more spread data.

library(ggplot2)
ggplot(mydata2, aes(x=tip)) + 
  geom_histogram(binwidth = 1, colour="olivedrab3", fill = "mediumorchid1") +
  ylab("Frequency") +
  facet_wrap(~smoker_F, ncol = 1) +
  theme_minimal()

In the above histograms comparison of tips based on a person being a smoker (yes) or a person not being a smoker (no) is made:

non-smokers: majority of tips fall between values 2 and 4, with a high concentration between 2 and 3, very few values are above 6;
smokers: the pattern is similar but with a little more spread out values, indicating that this group has potential outliers.

library(ggplot2)
ggplot(mydata2, aes(x=tip, fill=sex_F)) +
  geom_histogram(position="dodge", binwidth = 1, colour="gray") +
  facet_wrap(~time, ncol = 1) + 
  ylab("Frequency") +
  labs(fill="Gender")

As there are not any notable differences about tipping for smokers/non-smokers, the next step was to examine the differences among time of the day (having only two options for better visibility of the results). Moreover, the gender was also included as additional variable for comparison:

Dinner: majority of tips is between 2 and 4 for both genders. Males (orange bars) have higher frequency, suggesting that men tend to give tips more often or in larger values with a few possible larger values of tip being only from men;
Lunch: tips are compared to dinner less frequent and smaller, with majority of tips being between 2 and 3. Gender comparison for lunch is more similar than for dinner.

Overall, people tended to tip more and slightly higher at dinner, as well as males on average left higher tips than females, especially at dinner (longer right tale for men distribution).

library(ggplot2) 
ggplot(mydata2, aes(y = tip, x = total_bill)) +  
geom_point() +
theme_minimal()

With the scatterplot the relationship between the value of the bill (x-axis) and the value of the tip (y-axis) can be seen. From the results presented above it can be said that value of bills and tips have a positive relationship (people tended to give a higher tip to the waiter if the value of their bill was higher).

library(ggplot2) 
ggplot(mydata2, aes(y = tip, x = sex)) + 
geom_boxplot()

boxplot(tip ~ sex,
        data = mydata2,
        main = "Tip by Gender",
        ylab = "Value of tip",
        col = c("lightpink", "darkslategray2"),
        border = c("lightpink3", "darkslategray"))

As it can be seen from the two boxplots above Female median tip is around 2.7 and Male median tip is around 3. Therefore, on average males tend to leave slightly higher tips. Both genders show similar range (Interquartile Range, IQR), but male tips have a wider range (indicating more variation). Regarding the outliers, both groups have outliers but Male tips have more extreme values (above 6, some even around 10). To conclude, based on the above boxplot males tend to tip slightly more than females, as well as males tipping behavior is more variable with some extreme values of tips. On the other hand, females are less extreme and more consistent.

Task 2

library(readxl)
mydata4 <- read_xlsx("./Business_School_Task2.xlsx")

mydata4 <- as.data.frame(mydata4)

head(mydata4)

##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before) Employability (After)
## 1          1         Business            68.4      90.2              No                    252                   276
## 2          2 Computer Science            70.2      68.7             Yes                    101                   119
## 3          3          Finance            76.4      83.3              No                    401                   462
## 4          4         Business            82.6      88.7              No                    287                   342
## 5          5          Finance            76.9      75.4              No                    275                   347
## 6          6 Computer Science            83.3      82.1              No                    254                   313
##   Status Annual Salary
## 1 Placed        111000
## 2 Placed        107000
## 3 Placed        109000
## 4 Placed        148000
## 5 Placed        255500
## 6 Placed        103500

1. Distribution of undergrad degrees

colnames(mydata4)[2] <- "Undergrad_Degree" #renaming the variable in second column

library(ggplot2)

ggplot(mydata4, aes(x = `Undergrad_Degree`)) + #setting the x-axis to variable Undergrad_Degree
  geom_bar(fill = "pink", color = "blue") + #creating a bar plot of frequencies for each category of variable Undergrad_Degree, bars are filled in pink and outlined in blue color
  ylab("Frequency") + #setting the y-axis to frequency - count of students in each degree
  theme_light() + #changing the plot style to a light theme for better readability
   labs (title = "Distribution of Undergraduate Degrees", #adding title to a plot
        x = "Undergraduate Degree")+ #explicitly labeling the x-axis
  geom_text(stat = "count", #adding the count numbers on top of each bar
            aes(label = after_stat(count)),
            vjust = -0.3) #slightly shifting the labels above the top of the bars so they don’t overlap.

As it can be seen from the graph above, the most common undergrad degree among MBA students is Business with a frequency of 35 students.

2. Descriptive statistics and distribution of Annual Salary

library(pastecs) 
round(stat.desc(mydata4[ , 9]), 1)

##      nbr.val     nbr.null       nbr.na          min          max        range          sum       median         mean 
##        100.0          0.0          0.0      20000.0     340000.0     320000.0   10905800.0     103500.0     109058.0 
##      SE.mean CI.mean.0.95          var      std.dev     coef.var 
##       4150.1       8234.8 1722373474.7      41501.5          0.4

summary(mydata4[ , 9])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

Above, two representations of descriptive statistics of the variable Annual Salary can be seen:

nbr.val of 100: there are 100 observations in this data set;
there are no missing values (nbr.na = 0) or zero salaries values (nbr.null = 0) for the variable Annual Salary;
minimum salary is 20,000€ and maximum salary is 340,000€, with a corresponding range of 320,000€;
median of 103,500€, meaning that 50% of MBA students have an annual salary of 103,500€ or less, and 50% of them have a higher annual salary;
mean of 109,058€, meaning that on average the annual salary of MBA students is 109,058€;
3rd quartile of 124,000€, meaning that 75% of MBA students have an annual salary of 124,000€ or less, and 25% have a annual salary higher than 124,000€;
std.dev of 41,501.5, meaning that on average salaries deviate about 41,500€ around the mean.

colnames(mydata4)[9] <- "Annual_Salary"

ggplot(mydata4, aes(x = Annual_Salary)) +
  geom_histogram(binwidth = 20000, fill = "lightblue", color = "purple") + #grouping salaries into bins of width 20,000
  scale_x_continuous(labels = scales::comma) + 
  labs(title = "Distribution of Annual Salary",
       x = "Annual Salary") +
         theme_minimal()

The histogram above is right - skewed (positively skewed), having a long tail to the right, meaning that most of MBA students have lower salaries, with a few of them having very high salaries. The tallest bar is around 90,000€ - 110,000€, indicating that the majority of the salaries cluster in this range.

3. Hypothesis testing

colnames(mydata4)[4] <- "MBA_Grade"

shapiro.test(mydata4$MBA_Grade)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata4$MBA_Grade
## W = 0.98799, p-value = 0.5078

In order to perform t-test, the distribution of the analyzed variable needs to be checked. Above the Shapiro-Wilk test is performed to check whether the distribution of the variable MBA_grade is normally distribute or not.

Null hypothesis: the distribution of the variable MBA_Grade is normal

Alternative hypothesis: the distribution of the variable MBA_Grade is not normal

From the normality test above it can be seen that p-value is 0.5078, which means it is larger than the significance level of 0.05, therefore we cannot reject the null hypothesis and it can be concluded that MBA grades are approximately normally distributed, so performing the t-test is appropriate.

t.test(mydata4$MBA_Grade, 
       mu = 74, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata4$MBA_Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Here the null hypothesis is as following: 𝐻0:𝜇MBA Grade = 74

Therefore, the alternative hypothesis is: 𝐻1:𝜇MBA Grade ≠ 74

As it can be seen from results above, the p-value is 0.00915, meaning it is smaller than significance level of 0.05, therefore the null hypothesis can be rejected - the sample mean of MBA grade is different from 74. It is 76.04, which is higher than 74. The 95% confidence interval [74.52, 77.56] confirms that the true mean is likely above 74.

mean_grade <- mean(mydata4$MBA_Grade)
sd_grade   <- sd(mydata4$MBA_Grade)
cohens_d   <- (mean_grade - 74) / sd_grade
cohens_d

## [1] 0.2658658

The latter code calculates the effect size of the difference between observed values of variable MBA_grade and the hypothesized mean of 74.

According to Cohen’s benchmarks:

d < 0.2 - Very small
0.2 <= d < 0.5 - Small
0.5 <= d < 0.8 - Medium
d >= 0.8 - Large

it can be concluded that the effect size is small, meaning that the difference is statistically significant but not very large in practical terms (the average grade is only modestly higher than 74 in practical terms).

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Apartments_Task 3.xlsx")

mydata <- as.data.frame(mydata)

head(mydata)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata$Parking_F <- factor (mydata$Parking,
                           levels = c("0", "1"),
                           labels = c("No", "Yes"))

mydata$Balcony_F <- factor (mydata$Balcony,
                           levels = c("0", "1"),
                           labels = c("No", "Yes"))

head(mydata)

##   Age Distance Price Parking Balcony Parking_F Balcony_F
## 1   7       28  1640       0       1        No       Yes
## 2  18        1  2800       1       0       Yes        No
## 3   7       28  1660       0       0        No        No
## 4  28       29  1850       0       1        No       Yes
## 5  18       18  1640       1       1       Yes       Yes
## 6  28       12  1770       0       1        No       Yes

With the above code, the two categorical variables (Parking, Balcony) were changed into factors by creating two additional variables (Parking_F, Balcony_F), which are factors.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata$Price, 
       mu = 1900, 
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Here the null hypothesis is as following: 𝐻0:Mu_Price = 1900 EUR

Therefore, the alternative hypothesis is: 𝐻1:Mu_Price ≠ 1900 EUR

As it can be seen from results above, the p-value is 0.004731, meaning it is smaller than significance level of 0.05, therefore the null hypothesis can be rejected - the sample arithmetic mean of variable Price per m2 is different from 1900€. It is 2018.94€, which is higher than 1900€. The 95% confidence interval [1937.44, 2100.44] confirms that the true mean of price per m2 is likely above 1900€.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
data = mydata) 

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

corr_coeff <- cor(mydata$Age, mydata$Price,
                  method = "pearson")

print(corr_coeff)

## [1] -0.230255

From the results above the simple linear regression function can be fond, specifically Price = 2185.46−8.98×Age, meaning that the predicted average price per m2 when Age = 0 is about 2185€, and the slope of -8.98 indicates that for each additional year of age of an apartment, the price per m2 on average decreases by 9€.

From p-value of 0.03401 (which is smaller than the significance level of 0.05), it can be concluded that regression coefficient is statistically significant (the coefficients are not equal to 0, rejecting the null hypothesis) and due to its negative sign of slope it can be concluded that age of the apartments has a negative effect on price per m2.

Coefficient of correlation, which is calculated by taking square root of R² = 0.053, which is approximately 0.23 and adding negative sign (since the slope is negative), it can be concluded that there is a weak negative correlation between age of apartment and price per m2.

Coefficient of determination is R² = 0.053, meaning that only 5.3% od the variation in price per m2 can be explained by the age of apartment. The other 95% of variability is due to other factors, which are not included in fit1 model.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata[c("Price", "Age", "Distance")],  
smooth = FALSE)

Multicolinearity arises when explanatory variables are strongly correlated with each other. In the scatterplot matrix two explanatory variables are presented (Age and Distance). It can be seen that Age and Distance have almost no relationship (scatterplot is flat, no clear slope). Therefore, it can be concluded that multicolinearity is not a problem here, as Age and Distance are not strongly correlated.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
data = mydata) 

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car) 
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

VIF is used to evaluate the strength of the correlation between explanatory variables. The higher the VIF is, the stronger the correlation among explanatory variables is. The value 5 can be used as a cut-off value. In the results above, it can be seen that both values (for Age and for Distance) of VIF are very close to 1.0, therefore there is no evidence of multicolinearity between Distance and Age (these two variables can be included in the model together without concern).

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata$StdResid <- round(rstandard(fit2), 3)  

head(mydata[order(mydata$StdResid),],10)

##    Age Distance Price Parking Balcony Parking_F Balcony_F StdResid
## 53   7        2  1760       0       1        No       Yes   -2.152
## 13  12       14  1650       0       1        No       Yes   -1.499
## 72  12       14  1650       0       0        No        No   -1.499
## 20  13        8  1800       0       0        No        No   -1.381
## 35  14       16  1660       0       1        No       Yes   -1.261
## 36  24        5  1830       1       0       Yes        No   -1.189
## 54  30       17  1560       0       0        No        No   -1.101
## 5   18       18  1640       1       1       Yes       Yes   -1.073
## 64  18       18  1640       1       1       Yes       Yes   -1.073
## 28  18       19  1620       1       0       Yes        No   -1.071

hist(mydata$StdResid,  
xlab = "Standardized residuals",  
ylab = "Frequency",  
main = "Histogram of standardized residuals",
col = "palegreen",
border = "blue3",
 xlim = c(-3, 3))

Based on the distribution of standardized residuals, the distribution of errors in the population can be predicted. In the histogram above it can be seen that residuals are slightly asymmetrically distributed to the right. The Shapiro-Wilk test gives the p-value of 0.003645, which is smaller than the chosen significance level (0.05), therefore we can reject the null-hypothesis that in the population errors are normally distributed, but in practice regression is usually robust if deviations are not extreme. The histogram can be also used to determine if any of the units are outliers. Since all values are between -3 and +3, which are normally used to detect outliers, it can be concluded that there are no problems with outliers (no strong outliers need to be removed).

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95303, p-value = 0.003645

mydata$CooksD <- round(cooks.distance(fit2), 3) 
hist(mydata$CooksD,  
xlab = "Cooks distance",  
ylab = "Frequency",  
main = "Histogram of Cooks distances",
col = "mediumpurple1")

Cooks distance is a value above 0, where a higher number means a larger impact. From the histogram above it can be seen that a least one unit has a value that is significantly greater than the values of other units (latter can be seen from by a gap between 0.15 and 0.30).

mydata$ID <- 1:nrow(mydata)

The new variable called “ID”, which is a sequence of all observations, was created in order to be able to exclude some particular observations based on their Cooks distance.

head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 6)

##    ID CooksD
## 38 38  0.320
## 55 55  0.104
## 33 33  0.069
## 53 53  0.066
## 22 22  0.061
## 39 39  0.038

With the code above observations were ordered based on their Cooks distance (showing the 6 observations with the biggest Cooks distances). As it can be seen the first observation has a very large Cooks distance compared to other observations. Therefore, it can be concluded that the first observation in the above table is the unit with large influence. Consequently, the row 38 is going to be removed at the end of exercises including fit2.

Check for potential heteroskedasticity with scatterplot between standardized residuals and standardized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance,  
data = mydata) 
mydata$StdResid <- round(rstandard(fit2), 3)  
mydata$StdFittedValues <- scale(fit2$fitted.values) 
library(car) 
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues, 
ylab = "Standardized residuals", 
xlab = "Standardized fitted values", 
boxplots = FALSE, 
regLine = FALSE, 
smooth = FALSE)

The points in the scatter plot of the standardized residuals against the standardized fitted values should be randomly distributed in a horizontal band of constant variability. If the variability changes, this is called heteroskedasticity. In the scatterplot above it can be seen that the variance of standard residuals is constant (random distribution of points), therefore it can be concluded that no strong evidence of heteroskedasticity is present (it is a case of homoskedasticity).

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

 hist(mydata$StdRes,
 main = "Histogram of standardized residuals",
 xlab = "Standardized residuals",
 ylab = "Density",
 col = "royalblue1",
 prob = TRUE,
 xlim = c(-3, 3))

 curve(dnorm(x, mean = mean(mydata$StdRes), sd = sd(mydata$StdRes)),
 col = "sienna3",
 lwd = 2,
 add = TRUE)

As it can be seen from the histogram above, the distribution of the standardized residuals is not completely normal but it is slightly asymmetrical to the right (as it can be also seen from the added normal distribution curve). To clarify latter statement, the Shapiro-Wilk normality test was performed in the next step.

 shapiro.test(mydata$StdRes)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdRes
## W = 0.95303, p-value = 0.003645

Here the null hypothesis is as following: 𝐻0:Distribution of standardized residuals is normal

Therefore, the alternative hypothesis is: 𝐻1:Distribution of standardized residuals is not normal

As it can be seen from results above, the p-value is 0.003645, meaning it is smaller than significance level of 0.05, therefore the null hypothesis can be rejected - the distribution of standardized residuals is not normal.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

mydata <- mydata[!(mydata$ID == 38), ]

fit2 <- lm(Price ~ Age + Distance,
data = mydata) 

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

Firstly, the observation with its Cooks distances of 0.320 was excluded based on the variable ID (which was 38 for this observation). With the function summary new regression values without latter one observation was calculated.

New regression function is as follows: Price = 2456.076 - 6.464 x Age - 22.955 x Distance

The first value is estimate of the regression constant, which is 2456.076. It indicates the average value of the dependent variable (in this case Price), when all explanatory variables are equal to 0.

Estimate of regression coefficient: the regression coefficient for Age means that for any additional one year of age of an apartment (while everything else stays the same), price per m2 will on average decrease by 6.464€. The regression coefficient for Distance means that if Distance increases by 1 unit, while everything stays the same, then the price per m2 will on average decrease by 22.955€.

From the above results it can be also seen that p-value is less than 0.05 (significance level), therefore the null hypothesis (coefficients are equal to 0) can be rejected at the 95% significance level.

Coefficient of correlation: which is calculated by taking square root of R² = 0.4838, which is approximately 0.6956, it can be concluded that Price, Age and Distance are positively and strongly correlated.

Coefficient of determination is R² = 0.4838, meaning that 48.38% of the variation in price per m2 can be explained by the variables Age and Distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking_F + Balcony_F,
data = mydata)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking_F + Balcony_F
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of Variance is used when we want to test hypothesis about equality of three or more arithmetic means of independent samples.

Null hypothesis: H0:Δ𝜌2 = 0

Alternative hypothesis: H1:Δ𝜌2 > 0

The null hypothesis says that the difference between the population multiple coefficients of determination is equal to 0, which would indicate that both models (fit2 and fit3) fit the data equally well. The alternative hypothesis says that the difference is positive, meaning that the population multiple coefficient of determination of model fit3 is higher than that of model fit2. The results above indicates that model 2 (fit3) fits the data better (p < 0.05).

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking_F + Balcony_F, 
##     data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2329.724     93.066  25.033  < 2e-16 ***
## Age            -5.821      3.074  -1.894  0.06190 .  
## Distance      -20.279      2.886  -7.026 6.66e-10 ***
## Parking_FYes  167.531     62.864   2.665  0.00933 ** 
## Balcony_FYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

The regression coefficient for Parking means that if an apartment has parking, with everything else staying the same, it has on average 167.53€ higher price per m2 than the one without. The coefficient for Balcony means that if an apartment has a balcony, with everything else staying the same, the price per m2 on average decreases by 15.207€.

F-statistic hypothesis:

H0: population coefficient of determination = 0
H1: population coefficient of determination > 0

Save fitted values and claculate the residual for apartment ID2.

mydata$FittedValues <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)

head(mydata[ , colnames(mydata) %in% c("ID", "Price", "FittedValues", "Residuals")])

##   Price ID FittedValues  Residuals
## 1  1640  1     1705.952  -65.95206
## 2  2800  2     2372.197  427.80292
## 3  1660  3     1721.159  -61.15894
## 4  1850  4     1563.431  286.56890
## 5  1640  5     2012.244 -372.24396
## 6  1770  6     1908.177 -138.17733

Residual of 427.8 for the ID2 means that the actual price of this apartment (which is 2800€) is by 427.8€ higher than the estimated price from regression (which is 2372.2€)