mydata2 <- read.table("./tips.csv",
header = TRUE,
sep = ",",
dec = "." )
head(mydata2)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
The data set on which the Task 1 is completed was taken from kaggle.com (name of the data set “A Waiter’s Tips”, author of the data is Joe Young, the observations for the data set were collected 7 years ago). The observations were gathered by a waiter, who was collecting information about his tips for a period of several months. Data set includes 244 observations and 7 (initial) variables.
The explanation of the data set:
total_bill: the amount of bill in dollars
tip: the amount of tip in dollars
sex: gender the bill payer
smoker: whether there were smokers in the party (table)
day: day of the week
time: time of the day
size: size of the party (how many people were sitting at the table)
mydata2$sex_F <- factor(mydata2$sex,
levels = c("Male", "Female"),
labels = c("M", "F"))
Creating a new variable sex_F, which is a categorical variable, based on the variable sex, which is text data - character (converting it into a factor). New categorical variable has two labels, “M” (which is “Male” in original variable) and “F” (which is “Female” in original variable).
mydata2$smoker_F <- factor (mydata2$smoker,
levels = c("No", "Yes"),
labels = c("No", "Yes"))
Creating a new variable smoker_F, which is a categorical variable, based on variable smoker, which is text data - character (converting it into a factor). The labels of the new categorical variable are the same as the levels of the original variable.
colnames(mydata2)[7] <- "size_of_the_party"
Renaming the variable size (which is in seventh column) to variable “size_of_the_party”.
mydata3 <- mydata2[-244 , c(1, 2, 8)]
tail(mydata3, 8)
## total_bill tip sex_F
## 236 10.07 1.25 M
## 237 12.60 1.00 M
## 238 32.83 1.17 M
## 239 35.83 4.67 F
## 240 29.03 5.92 M
## 241 27.18 2.00 F
## 242 22.67 2.00 M
## 243 17.82 1.75 M
Creating a new data frame called mydata3, which contains only the variables total_bill, tip and sex_F (variables in columns 1, 2, 8) and excludes the person in 244 row (exclusion of 244 row). The last 8 observations of the data frame mydata3 are presented above.
mydata3[236, 1] <- 12.50
tail(mydata3, 8)
## total_bill tip sex_F
## 236 12.50 1.25 M
## 237 12.60 1.00 M
## 238 32.83 1.17 M
## 239 35.83 4.67 F
## 240 29.03 5.92 M
## 241 27.18 2.00 F
## 242 22.67 2.00 M
## 243 17.82 1.75 M
With the latter code the individual value was changed by specifying the exact position of the value (in this case the 236th row and 1st column). This results in the value of the bill for 236th observation being changed from $10.07 to $12.50.
summary(mydata2[ , -c(3, 4, 5, 6)])
## total_bill tip size_of_the_party sex_F smoker_F
## Min. : 3.07 Min. : 1.000 Min. :1.00 M:157 No :151
## 1st Qu.:13.35 1st Qu.: 2.000 1st Qu.:2.00 F: 87 Yes: 93
## Median :17.80 Median : 2.900 Median :2.00
## Mean :19.79 Mean : 2.998 Mean :2.57
## 3rd Qu.:24.13 3rd Qu.: 3.562 3rd Qu.:3.00
## Max. :50.81 Max. :10.000 Max. :6.00
In the table above the descriptive statistics for three numerical variables and two categorical variables are presented.
Variable total_bill:
the arithmetic mean is $19.79, meaning that on average people in the restaurant spent $19.79 per one bill;
the median is $17.80, meaning that 50% of people in the restaurant spent $17.80 or less per bill, and 50% of people spent more than $17.80 per bill;
1st quartile is $13.35, meaning that 25% of people in the restaurant spent $13.35 or less per bill, and 75% of people in the restaurant spent more than $13.35 per bill;
3rd quartile is $24.13, meaning that 75% of people in the restaurant spent $24.13 or less per bill, and 25% of people in the restaurant spent more than $24.13 per bill;
min. is $3.07, meaning that the minimum spent amount per bill in the restaurant was $3.07;
max. is $50.81, meaning that the maximum spent amount per bill in the restaurant was $50.81.
Variable tip:
the arithmetic mean is $2.998, meaning that on average people in the restaurant tipped the waiter $2.998 per one bill;
the median is $2.90, meaning that 50% of people in the restaurant tipped the waiter $2.90 or less per bill, and 50% of people tipped the waiter more than $2.90 per bill;
1st quartile is $2.00, meaning that 25% of people in the restaurant tipped the waiter $2.00 or less per bill, and 75% of people in the restaurant tipped the waiter more than $2.00 per bill;
max. is $10.00, meaning that the maximum amount of tip received by the waiter was $10.00.
Variable size_of_the_party:
median is 2.00, meaning that 50% of people came to the restaurant in the groups of 2 or less people (sat at the same table), and 50% of people came to the restaurant in groups larger than 2 people;
max is 6.00, meaning that the largest group of people that came to the restaurant together was a group of 6 people;
3rd quartile is 3.00, meaning that 75% of people came to the restaurant in groups of 3 people or less, and 25% of people came to the restaurant in groups of more than 3 people.
library(psych)
describeBy(x = mydata2$tip,
group = mydata2$sex_F)
##
## Descriptive statistics by group
## group: M
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 157 3.09 1.49 3 2.9 1.48 1 10 9 1.56 3.54 0.12
## ------------------------------------------------------------------------------------------
## group: F
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 87 2.83 1.16 2.75 2.75 1.11 1 6.5 5.5 0.66 0.09 0.12
Descriptive statistics can also be shown by groups (in upper case by genders). Therefore, the comparison about the variable tip based on genders can be done. From the results it can be seen that males (arithmetic mean $3.09) on average tip more than females (arithmetic mean $2.83). Standard deviation of 1.16 for females and 1.49 for males indicates that females have less variation around the mean than males (tips are more consistent). Regarding the skewness, distribution of tips of males is more positively skewed (skew = 1.56) than the one of females (skew = 0.66).
hist(mydata2$tip,
main = "Distribution of the variable tip",
col = "steelblue1",
border = "magenta3",
ylab = "Frequency",
xlab = "Size of a tip (in dollars)",
breaks = seq(0, 10, 1),
right = FALSE)
The distribution of the variable tip (how much people in the restaurant tipped the waiter) is asymmetrical to the right, meaning that people rather tipped the waiter in small values, than in bigger values.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata2, aes(x = tip, fill = sex)) +
geom_histogram(position = position_dodge(2), #Type of graph
binwidth = 3,
colour = "black") +
ylab("Frequency") + #y-axis name
labs(fill = "Gender") + #Name of legend
theme_minimal()
Looking at the histogram, showing the comparison of males and females regarding the size of the tip given to the waiter, it can be concluded that males gave higher tips than females.
library(ggplot2)
ggplot(mydata2, aes(x=tip)) + #variable tip on x-axis
geom_histogram(binwidth = 0.5, colour="darkolivegreen3", fill = "hotpink1") +
ylab("Frequency") + #variable Frequency on y-axis
facet_wrap(~day, ncol = 1) #creating separate histograms for each day
With the above histograms, the comparison of tips per day was made:
Friday has low frequency, with tips being evenly distributed between 2 and 5, without strong peaks;
Saturday has the highest frequency, with majority of tips being distributed between 2 and 5, nevertheless some outliers (higher values of tips) are present;
Sunday has also a high frequency, with distribution from around 2 to 6, being a little more spread out than for Saturday, but not having as many outliers as Saturday;
Thursday has lower frequency than Saturday and Sunday, with majority of tips being between values 1 and 3, with only a few higher ones.
Most tipping data is based on Saturday and Sunday (highest frequencies). Tips are generally small but Saturday has the clearest peak and Sunday has a little more spread data.
library(ggplot2)
ggplot(mydata2, aes(x=tip)) +
geom_histogram(binwidth = 1, colour="olivedrab3", fill = "mediumorchid1") +
ylab("Frequency") +
facet_wrap(~smoker_F, ncol = 1) +
theme_minimal()
In the above histograms comparison of tips based on a person being a smoker (yes) or a person not being a smoker (no) is made:
non-smokers: majority of tips fall between values 2 and 4, with a high concentration between 2 and 3, very few values are above 6;
smokers: the pattern is similar but with a little more spread out values, indicating that this group has potential outliers.
library(ggplot2)
ggplot(mydata2, aes(x=tip, fill=sex_F)) +
geom_histogram(position="dodge", binwidth = 1, colour="gray") +
facet_wrap(~time, ncol = 1) +
ylab("Frequency") +
labs(fill="Gender")
As there are not any notable differences about tipping for smokers/non-smokers, the next step was to examine the differences among time of the day (having only two options for better visibility of the results). Moreover, the gender was also included as additional variable for comparison:
Dinner: majority of tips is between 2 and 4 for both genders. Males (orange bars) have higher frequency, suggesting that men tend to give tips more often or in larger values with a few possible larger values of tip being only from men;
Lunch: tips are compared to dinner less frequent and smaller, with majority of tips being between 2 and 3. Gender comparison for lunch is more similar than for dinner.
Overall, people tended to tip more and slightly higher at dinner, as well as males on average left higher tips than females, especially at dinner (longer right tale for men distribution).
library(ggplot2)
ggplot(mydata2, aes(y = tip, x = total_bill)) +
geom_point() +
theme_minimal()
With the scatterplot the relationship between the value of the bill (x-axis) and the value of the tip (y-axis) can be seen. From the results presented above it can be said that value of bills and tips have a positive relationship (people tended to give a higher tip to the waiter if the value of their bill was higher).
library(ggplot2)
ggplot(mydata2, aes(y = tip, x = sex)) +
geom_boxplot()
boxplot(tip ~ sex,
data = mydata2,
main = "Tip by Gender",
ylab = "Value of tip",
col = c("lightpink", "darkslategray2"),
border = c("lightpink3", "darkslategray"))
As it can be seen from the two boxplots above Female median tip is around 2.7 and Male median tip is around 3. Therefore, on average males tend to leave slightly higher tips. Both genders show similar range (Interquartile Range, IQR), but male tips have a wider range (indicating more variation). Regarding the outliers, both groups have outliers but Male tips have more extreme values (above 6, some even around 10). To conclude, based on the above boxplot males tend to tip slightly more than females, as well as males tipping behavior is more variable with some extreme values of tips. On the other hand, females are less extreme and more consistent.
library(readxl)
mydata4 <- read_xlsx("./Business_School_Task2.xlsx")
mydata4 <- as.data.frame(mydata4)
head(mydata4)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before) Employability (After)
## 1 1 Business 68.4 90.2 No 252 276
## 2 2 Computer Science 70.2 68.7 Yes 101 119
## 3 3 Finance 76.4 83.3 No 401 462
## 4 4 Business 82.6 88.7 No 287 342
## 5 5 Finance 76.9 75.4 No 275 347
## 6 6 Computer Science 83.3 82.1 No 254 313
## Status Annual Salary
## 1 Placed 111000
## 2 Placed 107000
## 3 Placed 109000
## 4 Placed 148000
## 5 Placed 255500
## 6 Placed 103500
colnames(mydata4)[2] <- "Undergrad_Degree" #renaming the variable in second column
library(ggplot2)
ggplot(mydata4, aes(x = `Undergrad_Degree`)) + #setting the x-axis to variable Undergrad_Degree
geom_bar(fill = "pink", color = "blue") + #creating a bar plot of frequencies for each category of variable Undergrad_Degree, bars are filled in pink and outlined in blue color
ylab("Frequency") + #setting the y-axis to frequency - count of students in each degree
theme_light() + #changing the plot style to a light theme for better readability
labs (title = "Distribution of Undergraduate Degrees", #adding title to a plot
x = "Undergraduate Degree")+ #explicitly labeling the x-axis
geom_text(stat = "count", #adding the count numbers on top of each bar
aes(label = after_stat(count)),
vjust = -0.3) #slightly shifting the labels above the top of the bars so they don’t overlap.
As it can be seen from the graph above, the most common undergrad degree among MBA students is Business with a frequency of 35 students.
library(pastecs)
round(stat.desc(mydata4[ , 9]), 1)
## nbr.val nbr.null nbr.na min max range sum median mean
## 100.0 0.0 0.0 20000.0 340000.0 320000.0 10905800.0 103500.0 109058.0
## SE.mean CI.mean.0.95 var std.dev coef.var
## 4150.1 8234.8 1722373474.7 41501.5 0.4
summary(mydata4[ , 9])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
Above, two representations of descriptive statistics of the variable Annual Salary can be seen:
nbr.val of 100: there are 100 observations in this data set;
there are no missing values (nbr.na = 0) or zero salaries values (nbr.null = 0) for the variable Annual Salary;
minimum salary is 20,000€ and maximum salary is 340,000€, with a corresponding range of 320,000€;
median of 103,500€, meaning that 50% of MBA students have an annual salary of 103,500€ or less, and 50% of them have a higher annual salary;
mean of 109,058€, meaning that on average the annual salary of MBA students is 109,058€;
3rd quartile of 124,000€, meaning that 75% of MBA students have an annual salary of 124,000€ or less, and 25% have a annual salary higher than 124,000€;
std.dev of 41,501.5, meaning that on average salaries deviate about 41,500€ around the mean.
colnames(mydata4)[9] <- "Annual_Salary"
ggplot(mydata4, aes(x = Annual_Salary)) +
geom_histogram(binwidth = 20000, fill = "lightblue", color = "purple") + #grouping salaries into bins of width 20,000
scale_x_continuous(labels = scales::comma) +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary") +
theme_minimal()
The histogram above is right - skewed (positively skewed), having a long tail to the right, meaning that most of MBA students have lower salaries, with a few of them having very high salaries. The tallest bar is around 90,000€ - 110,000€, indicating that the majority of the salaries cluster in this range.
colnames(mydata4)[4] <- "MBA_Grade"
shapiro.test(mydata4$MBA_Grade)
##
## Shapiro-Wilk normality test
##
## data: mydata4$MBA_Grade
## W = 0.98799, p-value = 0.5078
In order to perform t-test, the distribution of the analyzed variable needs to be checked. Above the Shapiro-Wilk test is performed to check whether the distribution of the variable MBA_grade is normally distribute or not.
Null hypothesis: the distribution of the variable MBA_Grade is normal
Alternative hypothesis: the distribution of the variable MBA_Grade is not normal
From the normality test above it can be seen that p-value is 0.5078, which means it is larger than the significance level of 0.05, therefore we cannot reject the null hypothesis and it can be concluded that MBA grades are approximately normally distributed, so performing the t-test is appropriate.
t.test(mydata4$MBA_Grade,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata4$MBA_Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Here the null hypothesis is as following: 𝐻0:𝜇MBA Grade = 74
Therefore, the alternative hypothesis is: 𝐻1:𝜇MBA Grade ≠ 74
As it can be seen from results above, the p-value is 0.00915, meaning it is smaller than significance level of 0.05, therefore the null hypothesis can be rejected - the sample mean of MBA grade is different from 74. It is 76.04, which is higher than 74. The 95% confidence interval [74.52, 77.56] confirms that the true mean is likely above 74.
mean_grade <- mean(mydata4$MBA_Grade)
sd_grade <- sd(mydata4$MBA_Grade)
cohens_d <- (mean_grade - 74) / sd_grade
cohens_d
## [1] 0.2658658
The latter code calculates the effect size of the difference between observed values of variable MBA_grade and the hypothesized mean of 74.
According to Cohen’s benchmarks:
d < 0.2 - Very small
0.2 <= d < 0.5 - Small
0.5 <= d < 0.8 - Medium
d >= 0.8 - Large
it can be concluded that the effect size is small, meaning that the difference is statistically significant but not very large in practical terms (the average grade is only modestly higher than 74 in practical terms).
library(readxl)
mydata <- read_xlsx("./Apartments_Task 3.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$Parking_F <- factor (mydata$Parking,
levels = c("0", "1"),
labels = c("No", "Yes"))
mydata$Balcony_F <- factor (mydata$Balcony,
levels = c("0", "1"),
labels = c("No", "Yes"))
head(mydata)
## Age Distance Price Parking Balcony Parking_F Balcony_F
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
With the above code, the two categorical variables (Parking, Balcony) were changed into factors by creating two additional variables (Parking_F, Balcony_F), which are factors.
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Here the null hypothesis is as following: 𝐻0:Mu_Price = 1900 EUR
Therefore, the alternative hypothesis is: 𝐻1:Mu_Price ≠ 1900 EUR
As it can be seen from results above, the p-value is 0.004731, meaning it is smaller than significance level of 0.05, therefore the null hypothesis can be rejected - the sample arithmetic mean of variable Price per m2 is different from 1900€. It is 2018.94€, which is higher than 1900€. The 95% confidence interval [1937.44, 2100.44] confirms that the true mean of price per m2 is likely above 1900€.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
corr_coeff <- cor(mydata$Age, mydata$Price,
method = "pearson")
print(corr_coeff)
## [1] -0.230255
From the results above the simple linear regression function can be fond, specifically Price = 2185.46−8.98×Age, meaning that the predicted average price per m2 when Age = 0 is about 2185€, and the slope of -8.98 indicates that for each additional year of age of an apartment, the price per m2 on average decreases by 9€.
From p-value of 0.03401 (which is smaller than the significance level of 0.05), it can be concluded that regression coefficient is statistically significant (the coefficients are not equal to 0, rejecting the null hypothesis) and due to its negative sign of slope it can be concluded that age of the apartments has a negative effect on price per m2.
Coefficient of correlation, which is calculated by taking square root of R² = 0.053, which is approximately 0.23 and adding negative sign (since the slope is negative), it can be concluded that there is a weak negative correlation between age of apartment and price per m2.
Coefficient of determination is R² = 0.053, meaning that only 5.3% od the variation in price per m2 can be explained by the age of apartment. The other 95% of variability is due to other factors, which are not included in fit1 model.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata[c("Price", "Age", "Distance")],
smooth = FALSE)
Multicolinearity arises when explanatory variables are strongly correlated with each other. In the scatterplot matrix two explanatory variables are presented (Age and Distance). It can be seen that Age and Distance have almost no relationship (scatterplot is flat, no clear slope). Therefore, it can be concluded that multicolinearity is not a problem here, as Age and Distance are not strongly correlated.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
VIF is used to evaluate the strength of the correlation between explanatory variables. The higher the VIF is, the stronger the correlation among explanatory variables is. The value 5 can be used as a cut-off value. In the results above, it can be seen that both values (for Age and for Distance) of VIF are very close to 1.0, therefore there is no evidence of multicolinearity between Distance and Age (these two variables can be included in the model together without concern).
mydata$StdResid <- round(rstandard(fit2), 3)
head(mydata[order(mydata$StdResid),],10)
## Age Distance Price Parking Balcony Parking_F Balcony_F StdResid
## 53 7 2 1760 0 1 No Yes -2.152
## 13 12 14 1650 0 1 No Yes -1.499
## 72 12 14 1650 0 0 No No -1.499
## 20 13 8 1800 0 0 No No -1.381
## 35 14 16 1660 0 1 No Yes -1.261
## 36 24 5 1830 1 0 Yes No -1.189
## 54 30 17 1560 0 0 No No -1.101
## 5 18 18 1640 1 1 Yes Yes -1.073
## 64 18 18 1640 1 1 Yes Yes -1.073
## 28 18 19 1620 1 0 Yes No -1.071
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
col = "palegreen",
border = "blue3",
xlim = c(-3, 3))
Based on the distribution of standardized residuals, the distribution of errors in the population can be predicted. In the histogram above it can be seen that residuals are slightly asymmetrically distributed to the right. The Shapiro-Wilk test gives the p-value of 0.003645, which is smaller than the chosen significance level (0.05), therefore we can reject the null-hypothesis that in the population errors are normally distributed, but in practice regression is usually robust if deviations are not extreme. The histogram can be also used to determine if any of the units are outliers. Since all values are between -3 and +3, which are normally used to detect outliers, it can be concluded that there are no problems with outliers (no strong outliers need to be removed).
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95303, p-value = 0.003645
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances",
col = "mediumpurple1")
Cooks distance is a value above 0, where a higher number means a larger impact. From the histogram above it can be seen that a least one unit has a value that is significantly greater than the values of other units (latter can be seen from by a gap between 0.15 and 0.30).
mydata$ID <- 1:nrow(mydata)
The new variable called “ID”, which is a sequence of all observations, was created in order to be able to exclude some particular observations based on their Cooks distance.
head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 6)
## ID CooksD
## 38 38 0.320
## 55 55 0.104
## 33 33 0.069
## 53 53 0.066
## 22 22 0.061
## 39 39 0.038
With the code above observations were ordered based on their Cooks distance (showing the 6 observations with the biggest Cooks distances). As it can be seen the first observation has a very large Cooks distance compared to other observations. Therefore, it can be concluded that the first observation in the above table is the unit with large influence. Consequently, the row 38 is going to be removed at the end of exercises including fit2.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The points in the scatter plot of the standardized residuals against the standardized fitted values should be randomly distributed in a horizontal band of constant variability. If the variability changes, this is called heteroskedasticity. In the scatterplot above it can be seen that the variance of standard residuals is constant (random distribution of points), therefore it can be concluded that no strong evidence of heteroskedasticity is present (it is a case of homoskedasticity).
hist(mydata$StdRes,
main = "Histogram of standardized residuals",
xlab = "Standardized residuals",
ylab = "Density",
col = "royalblue1",
prob = TRUE,
xlim = c(-3, 3))
curve(dnorm(x, mean = mean(mydata$StdRes), sd = sd(mydata$StdRes)),
col = "sienna3",
lwd = 2,
add = TRUE)
As it can be seen from the histogram above, the distribution of the standardized residuals is not completely normal but it is slightly asymmetrical to the right (as it can be also seen from the added normal distribution curve). To clarify latter statement, the Shapiro-Wilk normality test was performed in the next step.
shapiro.test(mydata$StdRes)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdRes
## W = 0.95303, p-value = 0.003645
Here the null hypothesis is as following: 𝐻0:Distribution of standardized residuals is normal
Therefore, the alternative hypothesis is: 𝐻1:Distribution of standardized residuals is not normal
As it can be seen from results above, the p-value is 0.003645, meaning it is smaller than significance level of 0.05, therefore the null hypothesis can be rejected - the distribution of standardized residuals is not normal.
mydata <- mydata[!(mydata$ID == 38), ]
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
Firstly, the observation with its Cooks distances of 0.320 was excluded based on the variable ID (which was 38 for this observation). With the function summary new regression values without latter one observation was calculated.
New regression function is as follows: Price = 2456.076 - 6.464 x Age - 22.955 x Distance
The first value is estimate of the regression constant, which is 2456.076. It indicates the average value of the dependent variable (in this case Price), when all explanatory variables are equal to 0.
Estimate of regression coefficient: the regression coefficient for Age means that for any additional one year of age of an apartment (while everything else stays the same), price per m2 will on average decrease by 6.464€. The regression coefficient for Distance means that if Distance increases by 1 unit, while everything stays the same, then the price per m2 will on average decrease by 22.955€.
From the above results it can be also seen that p-value is less than 0.05 (significance level), therefore the null hypothesis (coefficients are equal to 0) can be rejected at the 95% significance level.
Coefficient of correlation: which is calculated by taking square root of R² = 0.4838, which is approximately 0.6956, it can be concluded that Price, Age and Distance are positively and strongly correlated.
Coefficient of determination is R² = 0.4838, meaning that 48.38% of the variation in price per m2 can be explained by the variables Age and Distance.
fit3 <- lm(Price ~ Age + Distance + Parking_F + Balcony_F,
data = mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking_F + Balcony_F
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance is used when we want to test hypothesis about equality of three or more arithmetic means of independent samples.
Null hypothesis: H0:Δ𝜌2 = 0
Alternative hypothesis: H1:Δ𝜌2 > 0
The null hypothesis says that the difference between the population multiple coefficients of determination is equal to 0, which would indicate that both models (fit2 and fit3) fit the data equally well. The alternative hypothesis says that the difference is positive, meaning that the population multiple coefficient of determination of model fit3 is higher than that of model fit2. The results above indicates that model 2 (fit3) fits the data better (p < 0.05).
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking_F + Balcony_F,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## Parking_FYes 167.531 62.864 2.665 0.00933 **
## Balcony_FYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
The regression coefficient for Parking means that if an apartment has parking, with everything else staying the same, it has on average 167.53€ higher price per m2 than the one without. The coefficient for Balcony means that if an apartment has a balcony, with everything else staying the same, the price per m2 on average decreases by 15.207€.
F-statistic hypothesis:
H0: population coefficient of determination = 0
H1: population coefficient of determination > 0
mydata$FittedValues <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)
head(mydata[ , colnames(mydata) %in% c("ID", "Price", "FittedValues", "Residuals")])
## Price ID FittedValues Residuals
## 1 1640 1 1705.952 -65.95206
## 2 2800 2 2372.197 427.80292
## 3 1660 3 1721.159 -61.15894
## 4 1850 4 1563.431 286.56890
## 5 1640 5 2012.244 -372.24396
## 6 1770 6 1908.177 -138.17733
Residual of 427.8 for the ID2 means that the actual price of this apartment (which is 2800€) is by 427.8€ higher than the estimated price from regression (which is 2372.2€)