R take-home exam

TASK 1: ANALYSIS OF YOUR OWN DATA SET - Airquality

First I spent some time finding a data set that I liked. Already on our day one of learning about R program, I found this Airquality data set, that I have been learning and improving my R skills on. In the upcoming tasks, I will try to show you what I learned and how I would use my knowledge in my day-to-day researching.

Firs things first, importing of the data set. I chose one of the data sets in the R program, called Airquality, that describes New York Air Quality Measurements.

It has 153 observations and 6 variables.

# Finding available data sets
data(package = .packages(all.available = TRUE))

# Choosing and importing my chosen data set 

mydata_Airquality <- force(airquality)

Explanation of variables

Ozone: numeric Ozone (ppb)
Solar.R: numeric Solar R (lang)
Wind: numeric Wind (mph)
Temp: numeric Temperature (degrees F)
Month: numeric Month (1–12)
Day: numeric Day of month (1–31)

Based on the data we see, that we need to perform some manipulations on the data to optimize the analysis. This is also shown in the next data summary:

summary(mydata_Airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

We can clearly see that the data needs to be adjusted, especially with variables Month and Day, and also some other adjustments on other variables, that will allow us to make a better analysis of the data. My steps continue as followed.

Removing a variable

mydata2_Airquality <- mydata_Airquality[, !colnames(mydata_Airquality) %in% "Day"]

I decided to remove the last column Day, because it was not of significant value for the purpose of my analysis. I decided to use the formula with colnames function, because if I would only decide on which column I want to delete (e.g. column number 6), described as the number, it would delete that column (if I had it), every time I would run these chunk. This way I left myself the freedom if I would want to add some variables in the future.

Renaming a variable

For better understanding of what the column represents, I decided to rename the column. Doing that I was also careful that the name of the column is written as one word in order to avoid possible errors in the coming up steps.

colnames(mydata2_Airquality)[2] <- "SolarRadiation"

head(mydata2_Airquality)

##   Ozone SolarRadiation Wind Temp Month
## 1    41            190  7.4   67     5
## 2    36            118  8.0   72     5
## 3    12            149 12.6   74     5
## 4    18            313 11.5   62     5
## 5    NA             NA 14.3   56     5
## 6    28             NA 14.9   66     5

Change variable Month into factor

mydata2_Airquality$Month <- factor(mydata2_Airquality$Month,
                         levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                         labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

Because it does not make sense to calculate basic descriptive statistics (e.g. mean = 6.933) for the values of variable Month without factoring it, I performed factoring on our fifth, last, column. I also changed the number of each month to a short name of the month for easier understanding of the data and clearer upcoming visual presentation of the data.

Cleaning the data of non-availables

After the first view of my data I noticed that variables had some non-availables, so I decided to remove them. I can also clearly see approximately how many of them there are in the first summary (37 in Ozone and 7 in SolarRadiation), so I get the first impression of how much this will affect my data.

library(tidyr)
mydataAirquality_clean <- drop_na(mydata2_Airquality)

summary(mydataAirquality_clean)

##      Ozone       SolarRadiation       Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
##                                                                 
##      Month   
##  Sep    :29  
##  Jul    :26  
##  May    :24  
##  Aug    :23  
##  Jun    : 9  
##  Jan    : 0  
##  (Other): 0

We can see that we are left with 111 observations. We can also see that the summary for variable Month is much more correct and clean and it makes much more sense.

We can also see that all of the observations were happening only in 5 out of 12 months. There are also some changes in sample statistics and I would like to explain some newer, more correct values, one for each variable:

Ozone median = 31: In the given data set Airquality, half of the observations have ozone level equal to or bellow 31 ppb, and half are above 31 ppb.
SolarRadiation mean = 184.8: On average, observations in this data set recorded 184.8 lang of radiation.
Wind 1st Qu. = 71: The q1 (first quartal) of variable Wind is 71 means, that 25% of the observations have value of 71 mph or less, and 75% of observarions have value more than 71 mph.
Temp Max. = 97: Maximal temperature in the observations was 97 degrees F.
Month Sept = 29: 29 observations were taken in month September.

Adding a new variable about the Ozone level

# Create a new variable 'ozone_level' based on the specified ozone advisory levels
mydataAirquality_clean$Ozone_level <- cut(mydataAirquality_clean$Ozone, 
                                   breaks = c(0, 54, 70, 85, 105, 200, Inf), 
                                   labels = c("Good", "Moderate", 
                                              "Unhealthy for Sensitive Groups", 
                                              "Unhealthy", 
                                              "Very Unhealthy", 
                                              "Hazardous"),
                                   right = TRUE)

# Print out the first few rows to check the result
head(mydataAirquality_clean)

##   Ozone SolarRadiation Wind Temp Month Ozone_level
## 1    41            190  7.4   67   May        Good
## 2    36            118  8.0   72   May        Good
## 3    12            149 12.6   74   May        Good
## 4    18            313 11.5   62   May        Good
## 5    23            299  8.6   65   May        Good
## 6    19             99 13.8   59   May        Good

I decided to research ozone levels a bit more. I added a new variable that tells me, which of the levels have acceptable levels of ozone and which not. I used a criteria from the website https://www.nps.gov/subjects/air/humanhealth-ozone.htm.

Criteria:

Good (0–54 ppb)
Moderate (55–70 ppb)
Unhealthy for Sensitive Groups (71–85 ppb)
Unhealthy (86–105 ppb)
Very Unhealthy (106+ ppb)
Hazardous (201+ ppb)

Creation of new data frame with Ozone above the unhealthy level of Ozone (70 ppb)

I decided to create a new data frame, that includes only observations that include Ozone with unhealthy levels of ppb. Because ozone has a very big impact on the air quality and human health, I also wanted to check a couple of highest values, in my case 10.

mydata_Ozone <- mydataAirquality_clean[mydataAirquality_clean$Ozone >= 71,]

head(mydata_Ozone[order(-mydata_Ozone$Ozone), ], 10)

##    Ozone SolarRadiation Wind Temp Month    Ozone_level
## 77   168            238  3.4   81   Aug Very Unhealthy
## 34   135            269  4.1   84   Jul Very Unhealthy
## 63   122            255  4.0   89   Aug Very Unhealthy
## 80   118            225  2.3   94   Aug Very Unhealthy
## 23   115            223  5.7   79   May Very Unhealthy
## 65   110            207  8.0   90   Aug Very Unhealthy
## 53   108            223  8.0   85   Jul Very Unhealthy
## 40    97            267  6.3   92   Jul      Unhealthy
## 41    97            272  5.7   92   Jul      Unhealthy
## 83    96            167  6.9   91   Sep      Unhealthy

Luckily, we have no hazardeous cases, but there were some Very unhealthy and Unhealthy measurments, which should cause concearns about the data quality in New York.

Because I already inserted Ozone levels as labels, they are recognized as factored, therefore I can immediatly draw a short summary about our new variable:

summary(mydataAirquality_clean$Ozone_level)

##                           Good                       Moderate 
##                             80                              7 
## Unhealthy for Sensitive Groups                      Unhealthy 
##                             12                              5 
##                 Very Unhealthy                      Hazardous 
##                              7                              0

Separating descriptive statistics by groups of units

I decided to get separate statistics for each month

library(psych)
describeBy(mydataAirquality_clean$Ozone, group = mydataAirquality_clean$Month)

## 
##  Descriptive statistics by group 
## group: Jan
## NULL
## ---------------------------------------------------- 
## group: Feb
## NULL
## ---------------------------------------------------- 
## group: Mar
## NULL
## ---------------------------------------------------- 
## group: Apr
## NULL
## ---------------------------------------------------- 
## group: May
##    vars  n  mean    sd median trimmed  mad min max range skew
## X1    1 24 24.12 22.89     18    20.7 12.6   1 115   114 2.53
##    kurtosis   se
## X1     7.53 4.67
## ---------------------------------------------------- 
## group: Jun
##    vars n  mean    sd median trimmed   mad min max range skew
## X1    1 9 29.44 18.21     23   29.44 14.83  12  71    59 1.13
##    kurtosis   se
## X1     0.21 6.07
## ---------------------------------------------------- 
## group: Jul
##    vars  n  mean    sd median trimmed   mad min max range skew
## X1    1 26 59.12 31.64     60   58.05 31.13   7 135   128 0.29
##    kurtosis  se
## X1    -0.49 6.2
## ---------------------------------------------------- 
## group: Aug
##    vars  n mean    sd median trimmed   mad min max range skew
## X1    1 23   60 41.77     45   56.42 41.51   9 168   159 0.77
##    kurtosis   se
## X1    -0.21 8.71
## ---------------------------------------------------- 
## group: Sep
##    vars  n  mean    sd median trimmed   mad min max range skew
## X1    1 29 31.45 24.14     23   28.36 13.34   7  96    89 1.45
##    kurtosis   se
## X1     0.97 4.48
## ---------------------------------------------------- 
## group: Oct
## NULL
## ---------------------------------------------------- 
## group: Nov
## NULL
## ---------------------------------------------------- 
## group: Dec
## NULL

Presentation of descriptive statistics of the final data set

summary(mydataAirquality_clean)

##      Ozone       SolarRadiation       Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
##                                                                 
##      Month                            Ozone_level
##  Sep    :29   Good                          :80  
##  Jul    :26   Moderate                      : 7  
##  May    :24   Unhealthy for Sensitive Groups:12  
##  Aug    :23   Unhealthy                     : 5  
##  Jun    : 9   Very Unhealthy                : 7  
##  Jan    : 0   Hazardous                     : 0  
##  (Other): 0

The summary and explanations stay the same as before. I would just add, that in this new summary we can also observe how many times each level of ozone is observed in 111 observations. The most often, 80 times, we can observe a Good, healthy level of ozone.

Histogram using function hist

This is the histogram, showing distribution of Solar Radiation, that I created on the very first day when we started working in R program. I wanted to delete it, because I prefer the ggplot function for graphical showing of distributions, but I decided to leave it anyways, to show various ways of creating a histogram.

hist(mydataAirquality_clean$SolarRadiation, 
     main = "Distribution of Solar Radiation", 
     ylab = "Frequency", 
     xlab = "Solar Radiation", 
     col = "pink",
     breaks = seq(from = 0, to = 400, by = 10))

We can spot unsymmetrical distribution to the left.

Creating a histogram using ggplot function

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydataAirquality_clean, aes(x=`Ozone`)) + 
  geom_histogram(binwidth = 10, colour="white", fill = "orchid1") + 
  theme_dark() + 
  ylab("Frequency")

This is my new histogram, created with ggplot function, that presents frequency distribution of Ozone. I also added some visual improvements.

Anyways, we can spot unsymmetrical distribution to right. We can also see that we have some observations with really high values that could possibly affect our further research.

Boxplot with boxplot function

boxplot(mydataAirquality_clean$Ozone)

As I already mentioned under the previous histogram, we have some potential outliers, observations with really high values, that we can also see really clearly in this boxplot.

This boxplot also confirms the Ozone distribution being asymetrical to the right.

Boxplot with ggplot

library(ggplot2) 

ggplot(mydataAirquality_clean, aes(x = Ozone, y = Month, fill = Month)) + 
  geom_boxplot() +
  scale_fill_brewer(palette="spectral") + 
  xlab("Ozone") +
  labs(fill="Month")

## Warning: Unknown palette: "spectral"

I decided to do additional, more advanced boxplot with ggplot function. In this boxplot I included Ozone by Month, since I knew from previous research that we have only five relevant months. This boxplot shows us quartiles for each month and some potential outliers for each individual month.

We can spot that the highest ozone levels were in months July and August, with August having by far the highest values observed. And month with lowest ozone levels are May and September. We can also spot some potential outliers in months May, June and September. We can also see that monthly distributions are asymmetrical.

Scatterplot with ggplot

library(ggplot2)
ggplot(mydataAirquality_clean, aes(x=Ozone, y=Wind)) +
  geom_point(colour = "coral1") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

After researching about the effects of our variables on Ozone levels, I found it particularely interesting that wind can have a great effect on Ozone levels, so I decided to research it with a scatterplot above. Wind helps with dispersing or concentrating pollutants. Our scatterplot proved the first.

We can see their correlation is negative - higher the Wind speed, lower the Ozone level.

And with this scatterplot I am concluding Task 1, moving on to Task 2.

TASK 2: MBA

First, we need to import the .xslx data as followed:

# Importing .xlsx data

library(readxl)
mydataBS <- read_excel("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/R PROGRAM - BOOTCAMP/TAKE HOME EXAM/Task 2/Business School.xlsx")
View(mydataBS)

mydataBS <- as.data.frame(mydataBS)

head(mydataBS)

##   Student ID Undergrad Degree Undergrad Grade MBA Grade
## 1          1         Business            68.4      90.2
## 2          2 Computer Science            70.2      68.7
## 3          3          Finance            76.4      83.3
## 4          4         Business            82.6      88.7
## 5          5          Finance            76.9      75.4
## 6          6 Computer Science            83.3      82.1
##   Work Experience Employability (Before) Employability (After) Status
## 1              No                    252                   276 Placed
## 2             Yes                    101                   119 Placed
## 3              No                    401                   462 Placed
## 4              No                    287                   342 Placed
## 5              No                    275                   347 Placed
## 6              No                    254                   313 Placed
##   Annual Salary
## 1        111000
## 2        107000
## 3        109000
## 4        148000
## 5        255500
## 6        103500

To avoid any potential problems in the future, I will rename the second column into one-word-title, to more easely continue with our analysis. I did not do that for every single column because it is not needed, but recommended. In a couple of next tasks you will also be able to see how to operate with columns with multiple-word-names.

# Renaming a variable 

colnames(mydataBS)[2] <- "UndergradDegree" 

head(mydataBS)

##   Student ID  UndergradDegree Undergrad Grade MBA Grade
## 1          1         Business            68.4      90.2
## 2          2 Computer Science            70.2      68.7
## 3          3          Finance            76.4      83.3
## 4          4         Business            82.6      88.7
## 5          5          Finance            76.9      75.4
## 6          6 Computer Science            83.3      82.1
##   Work Experience Employability (Before) Employability (After) Status
## 1              No                    252                   276 Placed
## 2             Yes                    101                   119 Placed
## 3              No                    401                   462 Placed
## 4              No                    287                   342 Placed
## 5              No                    275                   347 Placed
## 6              No                    254                   313 Placed
##   Annual Salary
## 1        111000
## 2        107000
## 3        109000
## 4        148000
## 5        255500
## 6        103500

#Factoring UndergradDegree variable

mydataBS$UndergradDegree <- factor(mydataBS$UndergradDegree,
                                   levels = c("Art","Business", "Computer Science", "Engineering", "Finance"),
                                   labels = c("Art","Business", "Computer Science", "Engineering", "Finance"))

To better my future analysis, I factored the UndergradDegree variable. I kept the names of degrees and with this function I just changed how the program R acknowledges the UndergradDegree inputs.

library(ggplot2)
ggplot(mydataBS, aes(x=UndergradDegree)) + 
  geom_bar(colour="white", fill = "plum") + 
  theme_dark() + 
  ylab("Frequency")

Based on this histogram, we can see that most common degree is Business degree, and least popular is Art degree.

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

round(stat.desc(mydataBS$`Annual Salary`))

##      nbr.val     nbr.null       nbr.na          min          max 
##          100            0            0        20000       340000 
##        range          sum       median         mean      SE.mean 
##       320000     10905800       103500       109058         4150 
## CI.mean.0.95          var      std.dev     coef.var 
##         8235   1722373475        41501            0

options(scipen = 999)

library(ggplot2)
ggplot(mydataBS, aes(x=`Annual Salary`)) + 
  geom_histogram(binwidth = 20000, colour="gray", fill = "dodgerblue") + 
  theme_dark() + 
  ylab("Frequency")

Based on visual presentation of our data in the histogram above, I would conclude that the distribution is unsymmetrical to the right. I also notice there appear to be some possible outliers with really high values that we would probably need to further research. Without this outliers, I may be willing to say that Annual Salary is indeed normally distributed.

Some basic statistics about Annual Salary to give us a bit of an impression about our data for the upcomming task:

mean(mydataBS$`MBA Grade`)

## [1] 76.04055

sd(mydataBS$`MBA Grade`)

## [1] 7.675114

t-test

With upcoming t-test, I wanted to test the following hypothesis:

H0: 𝜇MBA Grade = 74

H1: 𝜇MBA Grade =/= (not equal) 74

# Makinga two-sided t-test with HO: mu_MBAgrade = 74

t.test(mydataBS$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataBS$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Based on the p-value (p=0.009), we can reject our null hypothesis H0: 𝜇MBA Grade = 74, which means that the true mean is not equal to 74, as we already predicted in H1. Based on our given 95 percent confidence interval, that is calculated above, we can predict that the true mean is actually somewhere in the interval between 74.51764 and 77.56346 (in our case the mean is also calculated at 76.04055 which is indeed inside the interval).

Calculating the effect size

# Calculating the effect size 

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydataBS$`MBA Grade`, mu=74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

# Interpeting Cohen's d with rules of Sawilowsky(2009)

effectsize::interpret_cohens_d(0.27, rules = "sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

I calculated the Cohen’s d to calculate the effect size, which is 0.27 in our case. Cohen’s d of 0.27 means a small effect and suggests that maybe researching this will not be interesting enough for us to continue with our research. Therefore, with this task I am concluding Task 2.

TASK 3

Import the dataset Apartments.xlsx

library(readxl)
mydataAP <- read_excel("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/R PROGRAM - BOOTCAMP/TAKE HOME EXAM/Task 3/Apartments.xlsx")
View(mydataAP)

mydataAP <- as.data.frame(mydataAP)

head(mydataAP)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m^2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydataAP$Parking <- factor(mydataAP$Parking,
                                   levels = c("0","1"),
                                   labels = c("No","Yes"))

mydataAP$Balcony <- factor(mydataAP$Balcony,
                                   levels = c("0","1"),
                                   labels = c("No","Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

First I calculated some basic statistics to give me a bit of an insight of what to expect as a result for this task.

mean(mydataAP$Price)

## [1] 2018.941

sd(mydataAP$Price)

## [1] 377.8417

And than I continoued with a t-test:

t.test(mydataAP$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataAP$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the calculated p-value (p = 0.004731, or more correctly p = 0.005), we can reject the null hypothesis (H0: Mu_Price = 1900 eur) and say that the true mean is not equal to 1900.

Based on the 95 percent confidence interval, calculated above, we can predict that the true mean of the variable Price is somewhere between 1937.443 and 2100.440.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = mydataAP)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydataAP)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Explanations:

Estimate of regression coefficient (-8.975): If the age increases by 1 year, the price decreases by 8.975€ thousand per m^2 on average, considering all other factors remain the same.
Coefficient of correlation (p = 0.034): We completed t- test for two hypotheses - H0: beta1 = 0 and H1: beta1 =/= (not equal) 0. We get p = 0.034 (p < 0.05), so we reject H0, meaning the age has some effect on price.
Coefficient of determination (0.05302): 5.3% of the variable Price is explained with linear effect of the Age. Based on the very small coefficient of determination, I would conclude that this is not a good model.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydataAP[ , c(1, 2, 3)], 
                  smooth = FALSE)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydataAP[ , c(1,2,3)]))

##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

Based on what we can spot in the matrix, and what we calculated with correlation matrix above, where we calculated a Pearson correlation coefficients,there is a strong correlation between Price and Distance and weak correlation between Age and Price and Age and Distance.

I would say there is a possibility of multicolinearity with variables Price and Distance because of their strong correlation.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydataAP)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

I was surprised to notice that calculated VIF statistics is actually the same for both variables, Age and Distance.

Because both VIF statistics are smaller than 5, very close to 1, I would conclude that variables do not strongly correlate, so I would continue to keep both variables, as we did not detect multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

#Standardized residuals
mydataAP$StdResid <- round(rstandard(fit2), 3) 

#Cooks distances
mydataAP$CooksD <- round(cooks.distance(fit2), 3)

hist(mydataAP$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataAP$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP$StdResid
## W = 0.95303, p-value = 0.003645

hist(mydataAP$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We cannot spot any outliers in standardized residuals ( no observations fall out of values of +/- 3), but we can spot some units with large impact due to gaps in the graph of Cooks distances. In my upcoming steps I will try to research their impact and eliminate them. I will do this by firstly ordering my data to see, which are the possible problematic values of Standardized residuals and Cooks Distances, just to be sure, and to take a closer look of (possibly) problematic values.

head(mydataAP[order(mydataAP$StdResid), ], 3)

##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013

head(mydataAP[order(-mydataAP$StdResid), ], 3)

##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 33   2       11  2790     Yes      No    2.051  0.069
## 2   18        1  2800     Yes      No    1.783  0.030

head(mydataAP[order(-mydataAP$CooksD), ], 6)

##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038

As we can see in the graph and calculated data of Cooks Distances above, there is a gap between apartment number 53 and 13 and between apartments number 38, 55 and 33.

But, because the gap is the biggest between apartments number 38 and 55, let’s first remove apartment number 38 and see if it betters the analysis.

mydataAP$ID <- 1:85

mydataAP <- mydataAP[, c("ID", setdiff(names(mydataAP), "ID"))]

To remove the apartment the more correct way, I added a new variable IDs. I also moved the column so it is the fist column in the data set.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydataAP_new <- mydataAP <- mydataAP %>%
  filter(ID != 38)

We can see that mydataAP_new now has one less observation - it is a data set without apartment number 38.

Now, let’s check if this new data set is better. For the purpose of following the instructions of Task 3, I created now fit2_2, a new ‘sub’-fit, because I wanted to show my whole process of thinking before moving on to fit 3.

fit2_2 <- lm(Price ~ Age + Distance,
           data = mydataAP_new)

summary(fit2_2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221 < 0.0000000000000002 ***
## Age           -6.464      3.159  -2.046                0.044 *  
## Distance     -22.955      2.786  -8.240     0.00000000000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 0.000000000002339

The R squared is higher which suggests that we bettered our model. But let’s continue with our research.

mydataAP_new$StdResid <- round(rstandard(fit2_2), 3) #Standardized residuals
mydataAP_new$CooksD <- round(cooks.distance(fit2_2), 3) #Cooks distances

hist(mydataAP_new$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataAP_new$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP_new$StdResid
## W = 0.95649, p-value = 0.006355

hist(mydataAP_new$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Because we still have some gaps, I decided to adjust my data a bit more with further removing next potentially problematic observation with ID 55:

library(dplyr)
mydataAP_new2 <- mydataAP_new <- mydataAP_new %>%
  filter(ID != 55)

fit2_3 <- lm(Price ~ Age + Distance,
           data = mydataAP_new2)

summary(fit2_3)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP_new2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684 < 0.0000000000000002 ***
## Age           -7.850      3.244  -2.420               0.0178 *  
## Distance     -23.945      2.826  -8.473    0.000000000000953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 0.000000000001173

mydataAP_new2$StdResid <- round(rstandard(fit2_3), 3) #Standardized residuals
mydataAP_new2$CooksD <- round(cooks.distance(fit2_3), 3) #Cooks distances

hist(mydataAP_new2$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

We can see that there are no more gaps in the graph of standardized residuals. Although, I didn’t decide to remove any observation due to being problematic in the aspect of standardized residuals (no observation had values over +/- 3), we can see that removing problematic observations in Cooks Distances, we also corrected this graph.

shapiro.test(mydataAP_new2$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP_new2$StdResid
## W = 0.95952, p-value = 0.01044

hist(mydataAP_new2$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

With the process above we eliminated units with large impact and we do not have any gaps anymore in both graphs, which is better because the graphs are now nicely connected.

Although, we still have one gap in the Histogram of Cooks distances, I decided to not eliminate this one, since I don’t find it too significantly different than the ones I eliminated before.

Also based on my observations of R-square, because the value got higher, I would conclude that this model is definitely better than previous ones.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydataAP_new2$StdFitted <- scale(fit2_3$fitted.values)

library(car)
scatterplot(y = mydataAP_new2$StdResid, x = mydataAP_new2$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Based on the observations in scatterplot above, I am predicting this is a case of heteroscedasticity and I will confirm or deny my observation in the upcoming steps.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2_3)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

Based on the calculated p-value p = 0.052, which is slightly above the significance level of 0.05, we fail to reject H0, suggesting there is no evidence of heteroskedasticity (i.e., the variance is constant) and we therefore assume homoskedasticity.

To conclude, my previous prediction is denied.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Although, I already showed this graph in my previous steps, due to it being part of my process of solving such exercises, I am now gonna present to you the same histogram.

hist(mydataAP_new2$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataAP_new2$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataAP_new2$StdResid
## W = 0.95952, p-value = 0.01044

I tested the normal distribution of standardized residuals with Shapiro Wilks test. My hypothesis were:

H0: The standardized residuals are normally distributed

H1: The standardized residuals are not normally distributed

Based on the calculated p-value p = 0.01, that is less than level of significance p = 0.05, we can reject the null hypothesis and assume the distribution is not normal, as we predicted in H1.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2_3)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP_new2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684 < 0.0000000000000002 ***
## Age           -7.850      3.244  -2.420               0.0178 *  
## Distance     -23.945      2.826  -8.473    0.000000000000953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 0.000000000001173

Explanation of coefficients:

Age (-7.850): If the age increases by 1 year, the price decreases on average by 7,850€ thousand per m^2.
Distance (-23.945): If the distance increases by 1 km, the price decreases on average by 7,934€ thousand per m^2.
Multiple R-squared: 49.68% of variability of apartment price can be explained by the linear relationship between price, age and distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = mydataAP_new2)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2_3, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With anova, we test two hypothesis, which are:

HO = Δro^2 = 0
H1 = Δro^2 =/= (not equal) 0

Based on the data in Analysis of Variance Table, we conclude that we can reject the null hypothesis at p=0.02813 (p < 0.05), which means we can assume that Model 2 (fit3) is statistically better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydataAP_new2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185 < 0.0000000000000002 ***
## Age           -7.197      3.148  -2.286              0.02499 *  
## Distance     -21.241      2.911  -7.296       0.000000000214 ***
## ParkingYes   168.921     62.166   2.717              0.00811 ** 
## BalconyYes    -6.985     58.745  -0.119              0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 0.000000000001449