First I spent some time finding a data set that I liked. Already on our day one of learning about R program, I found this Airquality data set, that I have been learning and improving my R skills on. In the upcoming tasks, I will try to show you what I learned and how I would use my knowledge in my day-to-day researching.
Firs things first, importing of the data set. I chose one of the data sets in the R program, called Airquality, that describes New York Air Quality Measurements.
It has 153 observations and 6 variables.
# Finding available data sets
data(package = .packages(all.available = TRUE))
# Choosing and importing my chosen data set
mydata_Airquality <- force(airquality)
Explanation of variables
Ozone: numeric Ozone (ppb)
Solar.R: numeric Solar R (lang)
Wind: numeric Wind (mph)
Temp: numeric Temperature (degrees F)
Month: numeric Month (1–12)
Day: numeric Day of month (1–31)
Based on the data we see, that we need to perform some manipulations on the data to optimize the analysis. This is also shown in the next data summary:
summary(mydata_Airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
We can clearly see that the data needs to be adjusted, especially with variables Month and Day, and also some other adjustments on other variables, that will allow us to make a better analysis of the data. My steps continue as followed.
mydata2_Airquality <- mydata_Airquality[, !colnames(mydata_Airquality) %in% "Day"]
I decided to remove the last column Day, because it was not of significant value for the purpose of my analysis. I decided to use the formula with colnames function, because if I would only decide on which column I want to delete (e.g. column number 6), described as the number, it would delete that column (if I had it), every time I would run these chunk. This way I left myself the freedom if I would want to add some variables in the future.
For better understanding of what the column represents, I decided to rename the column. Doing that I was also careful that the name of the column is written as one word in order to avoid possible errors in the coming up steps.
colnames(mydata2_Airquality)[2] <- "SolarRadiation"
head(mydata2_Airquality)
## Ozone SolarRadiation Wind Temp Month
## 1 41 190 7.4 67 5
## 2 36 118 8.0 72 5
## 3 12 149 12.6 74 5
## 4 18 313 11.5 62 5
## 5 NA NA 14.3 56 5
## 6 28 NA 14.9 66 5
mydata2_Airquality$Month <- factor(mydata2_Airquality$Month,
levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
Because it does not make sense to calculate basic descriptive statistics (e.g. mean = 6.933) for the values of variable Month without factoring it, I performed factoring on our fifth, last, column. I also changed the number of each month to a short name of the month for easier understanding of the data and clearer upcoming visual presentation of the data.
After the first view of my data I noticed that variables had some non-availables, so I decided to remove them. I can also clearly see approximately how many of them there are in the first summary (37 in Ozone and 7 in SolarRadiation), so I get the first impression of how much this will affect my data.
library(tidyr)
mydataAirquality_clean <- drop_na(mydata2_Airquality)
summary(mydataAirquality_clean)
## Ozone SolarRadiation Wind Temp
## Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00
## 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00
## Median : 31.0 Median :207.0 Median : 9.70 Median :79.00
## Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79
## 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50
## Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00
##
## Month
## Sep :29
## Jul :26
## May :24
## Aug :23
## Jun : 9
## Jan : 0
## (Other): 0
We can see that we are left with 111 observations. We can also see that the summary for variable Month is much more correct and clean and it makes much more sense.
We can also see that all of the observations were happening only in 5 out of 12 months. There are also some changes in sample statistics and I would like to explain some newer, more correct values, one for each variable:
Ozone median = 31: In the given data set Airquality, half of the observations have ozone level equal to or bellow 31 ppb, and half are above 31 ppb.
SolarRadiation mean = 184.8: On average, observations in this data set recorded 184.8 lang of radiation.
Wind 1st Qu. = 71: The q1 (first quartal) of variable Wind is 71 means, that 25% of the observations have value of 71 mph or less, and 75% of observarions have value more than 71 mph.
Temp Max. = 97: Maximal temperature in the observations was 97 degrees F.
Month Sept = 29: 29 observations were taken in month September.
# Create a new variable 'ozone_level' based on the specified ozone advisory levels
mydataAirquality_clean$Ozone_level <- cut(mydataAirquality_clean$Ozone,
breaks = c(0, 54, 70, 85, 105, 200, Inf),
labels = c("Good", "Moderate",
"Unhealthy for Sensitive Groups",
"Unhealthy",
"Very Unhealthy",
"Hazardous"),
right = TRUE)
# Print out the first few rows to check the result
head(mydataAirquality_clean)
## Ozone SolarRadiation Wind Temp Month Ozone_level
## 1 41 190 7.4 67 May Good
## 2 36 118 8.0 72 May Good
## 3 12 149 12.6 74 May Good
## 4 18 313 11.5 62 May Good
## 5 23 299 8.6 65 May Good
## 6 19 99 13.8 59 May Good
I decided to research ozone levels a bit more. I added a new variable that tells me, which of the levels have acceptable levels of ozone and which not. I used a criteria from the website https://www.nps.gov/subjects/air/humanhealth-ozone.htm.
Criteria:
Good (0–54 ppb)
Moderate (55–70 ppb)
Unhealthy for Sensitive Groups (71–85 ppb)
Unhealthy (86–105 ppb)
Very Unhealthy (106+ ppb)
Hazardous (201+ ppb)
I decided to create a new data frame, that includes only observations that include Ozone with unhealthy levels of ppb. Because ozone has a very big impact on the air quality and human health, I also wanted to check a couple of highest values, in my case 10.
mydata_Ozone <- mydataAirquality_clean[mydataAirquality_clean$Ozone >= 71,]
head(mydata_Ozone[order(-mydata_Ozone$Ozone), ], 10)
## Ozone SolarRadiation Wind Temp Month Ozone_level
## 77 168 238 3.4 81 Aug Very Unhealthy
## 34 135 269 4.1 84 Jul Very Unhealthy
## 63 122 255 4.0 89 Aug Very Unhealthy
## 80 118 225 2.3 94 Aug Very Unhealthy
## 23 115 223 5.7 79 May Very Unhealthy
## 65 110 207 8.0 90 Aug Very Unhealthy
## 53 108 223 8.0 85 Jul Very Unhealthy
## 40 97 267 6.3 92 Jul Unhealthy
## 41 97 272 5.7 92 Jul Unhealthy
## 83 96 167 6.9 91 Sep Unhealthy
Luckily, we have no hazardeous cases, but there were some Very unhealthy and Unhealthy measurments, which should cause concearns about the data quality in New York.
Because I already inserted Ozone levels as labels, they are recognized as factored, therefore I can immediatly draw a short summary about our new variable:
summary(mydataAirquality_clean$Ozone_level)
## Good Moderate
## 80 7
## Unhealthy for Sensitive Groups Unhealthy
## 12 5
## Very Unhealthy Hazardous
## 7 0
I decided to get separate statistics for each month
library(psych)
describeBy(mydataAirquality_clean$Ozone, group = mydataAirquality_clean$Month)
##
## Descriptive statistics by group
## group: Jan
## NULL
## ----------------------------------------------------
## group: Feb
## NULL
## ----------------------------------------------------
## group: Mar
## NULL
## ----------------------------------------------------
## group: Apr
## NULL
## ----------------------------------------------------
## group: May
## vars n mean sd median trimmed mad min max range skew
## X1 1 24 24.12 22.89 18 20.7 12.6 1 115 114 2.53
## kurtosis se
## X1 7.53 4.67
## ----------------------------------------------------
## group: Jun
## vars n mean sd median trimmed mad min max range skew
## X1 1 9 29.44 18.21 23 29.44 14.83 12 71 59 1.13
## kurtosis se
## X1 0.21 6.07
## ----------------------------------------------------
## group: Jul
## vars n mean sd median trimmed mad min max range skew
## X1 1 26 59.12 31.64 60 58.05 31.13 7 135 128 0.29
## kurtosis se
## X1 -0.49 6.2
## ----------------------------------------------------
## group: Aug
## vars n mean sd median trimmed mad min max range skew
## X1 1 23 60 41.77 45 56.42 41.51 9 168 159 0.77
## kurtosis se
## X1 -0.21 8.71
## ----------------------------------------------------
## group: Sep
## vars n mean sd median trimmed mad min max range skew
## X1 1 29 31.45 24.14 23 28.36 13.34 7 96 89 1.45
## kurtosis se
## X1 0.97 4.48
## ----------------------------------------------------
## group: Oct
## NULL
## ----------------------------------------------------
## group: Nov
## NULL
## ----------------------------------------------------
## group: Dec
## NULL
summary(mydataAirquality_clean)
## Ozone SolarRadiation Wind Temp
## Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00
## 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00
## Median : 31.0 Median :207.0 Median : 9.70 Median :79.00
## Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79
## 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50
## Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00
##
## Month Ozone_level
## Sep :29 Good :80
## Jul :26 Moderate : 7
## May :24 Unhealthy for Sensitive Groups:12
## Aug :23 Unhealthy : 5
## Jun : 9 Very Unhealthy : 7
## Jan : 0 Hazardous : 0
## (Other): 0
The summary and explanations stay the same as before. I would just add, that in this new summary we can also observe how many times each level of ozone is observed in 111 observations. The most often, 80 times, we can observe a Good, healthy level of ozone.
This is the histogram, showing distribution of Solar Radiation, that I created on the very first day when we started working in R program. I wanted to delete it, because I prefer the ggplot function for graphical showing of distributions, but I decided to leave it anyways, to show various ways of creating a histogram.
hist(mydataAirquality_clean$SolarRadiation,
main = "Distribution of Solar Radiation",
ylab = "Frequency",
xlab = "Solar Radiation",
col = "pink",
breaks = seq(from = 0, to = 400, by = 10))
We can spot unsymmetrical distribution to the left.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydataAirquality_clean, aes(x=`Ozone`)) +
geom_histogram(binwidth = 10, colour="white", fill = "orchid1") +
theme_dark() +
ylab("Frequency")
This is my new histogram, created with ggplot function, that presents frequency distribution of Ozone. I also added some visual improvements.
Anyways, we can spot unsymmetrical distribution to right. We can also see that we have some observations with really high values that could possibly affect our further research.
boxplot(mydataAirquality_clean$Ozone)
As I already mentioned under the previous histogram, we have some potential outliers, observations with really high values, that we can also see really clearly in this boxplot.
This boxplot also confirms the Ozone distribution being asymetrical to the right.
library(ggplot2)
ggplot(mydataAirquality_clean, aes(x = Ozone, y = Month, fill = Month)) +
geom_boxplot() +
scale_fill_brewer(palette="spectral") +
xlab("Ozone") +
labs(fill="Month")
## Warning: Unknown palette: "spectral"
I decided to do additional, more advanced boxplot with ggplot function. In this boxplot I included Ozone by Month, since I knew from previous research that we have only five relevant months. This boxplot shows us quartiles for each month and some potential outliers for each individual month.
We can spot that the highest ozone levels were in months July and August, with August having by far the highest values observed. And month with lowest ozone levels are May and September. We can also spot some potential outliers in months May, June and September. We can also see that monthly distributions are asymmetrical.
library(ggplot2)
ggplot(mydataAirquality_clean, aes(x=Ozone, y=Wind)) +
geom_point(colour = "coral1") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
After researching about the effects of our variables on Ozone levels, I found it particularely interesting that wind can have a great effect on Ozone levels, so I decided to research it with a scatterplot above. Wind helps with dispersing or concentrating pollutants. Our scatterplot proved the first.
We can see their correlation is negative - higher the Wind speed, lower the Ozone level.
And with this scatterplot I am concluding Task 1, moving on to Task 2.
First, we need to import the .xslx data as followed:
# Importing .xlsx data
library(readxl)
mydataBS <- read_excel("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/R PROGRAM - BOOTCAMP/TAKE HOME EXAM/Task 2/Business School.xlsx")
View(mydataBS)
mydataBS <- as.data.frame(mydataBS)
head(mydataBS)
## Student ID Undergrad Degree Undergrad Grade MBA Grade
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## Work Experience Employability (Before) Employability (After) Status
## 1 No 252 276 Placed
## 2 Yes 101 119 Placed
## 3 No 401 462 Placed
## 4 No 287 342 Placed
## 5 No 275 347 Placed
## 6 No 254 313 Placed
## Annual Salary
## 1 111000
## 2 107000
## 3 109000
## 4 148000
## 5 255500
## 6 103500
To avoid any potential problems in the future, I will rename the second column into one-word-title, to more easely continue with our analysis. I did not do that for every single column because it is not needed, but recommended. In a couple of next tasks you will also be able to see how to operate with columns with multiple-word-names.
# Renaming a variable
colnames(mydataBS)[2] <- "UndergradDegree"
head(mydataBS)
## Student ID UndergradDegree Undergrad Grade MBA Grade
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## Work Experience Employability (Before) Employability (After) Status
## 1 No 252 276 Placed
## 2 Yes 101 119 Placed
## 3 No 401 462 Placed
## 4 No 287 342 Placed
## 5 No 275 347 Placed
## 6 No 254 313 Placed
## Annual Salary
## 1 111000
## 2 107000
## 3 109000
## 4 148000
## 5 255500
## 6 103500
#Factoring UndergradDegree variable
mydataBS$UndergradDegree <- factor(mydataBS$UndergradDegree,
levels = c("Art","Business", "Computer Science", "Engineering", "Finance"),
labels = c("Art","Business", "Computer Science", "Engineering", "Finance"))
To better my future analysis, I factored the UndergradDegree variable. I kept the names of degrees and with this function I just changed how the program R acknowledges the UndergradDegree inputs.
library(ggplot2)
ggplot(mydataBS, aes(x=UndergradDegree)) +
geom_bar(colour="white", fill = "plum") +
theme_dark() +
ylab("Frequency")
Based on this histogram, we can see that most common degree is Business degree, and least popular is Art degree.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(mydataBS$`Annual Salary`))
## nbr.val nbr.null nbr.na min max
## 100 0 0 20000 340000
## range sum median mean SE.mean
## 320000 10905800 103500 109058 4150
## CI.mean.0.95 var std.dev coef.var
## 8235 1722373475 41501 0
options(scipen = 999)
library(ggplot2)
ggplot(mydataBS, aes(x=`Annual Salary`)) +
geom_histogram(binwidth = 20000, colour="gray", fill = "dodgerblue") +
theme_dark() +
ylab("Frequency")
Based on visual presentation of our data in the histogram above, I would conclude that the distribution is unsymmetrical to the right. I also notice there appear to be some possible outliers with really high values that we would probably need to further research. Without this outliers, I may be willing to say that Annual Salary is indeed normally distributed.
Some basic statistics about Annual Salary to give us a bit of an impression about our data for the upcomming task:
mean(mydataBS$`MBA Grade`)
## [1] 76.04055
sd(mydataBS$`MBA Grade`)
## [1] 7.675114
With upcoming t-test, I wanted to test the following hypothesis:
H0: 𝜇MBA Grade = 74
H1: 𝜇MBA Grade =/= (not equal) 74
# Makinga two-sided t-test with HO: mu_MBAgrade = 74
t.test(mydataBS$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataBS$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Based on the p-value (p=0.009), we can reject our null hypothesis H0: 𝜇MBA Grade = 74, which means that the true mean is not equal to 74, as we already predicted in H1. Based on our given 95 percent confidence interval, that is calculated above, we can predict that the true mean is actually somewhere in the interval between 74.51764 and 77.56346 (in our case the mean is also calculated at 76.04055 which is indeed inside the interval).
# Calculating the effect size
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydataBS$`MBA Grade`, mu=74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
# Interpeting Cohen's d with rules of Sawilowsky(2009)
effectsize::interpret_cohens_d(0.27, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
I calculated the Cohen’s d to calculate the effect size, which is 0.27 in our case. Cohen’s d of 0.27 means a small effect and suggests that maybe researching this will not be interesting enough for us to continue with our research. Therefore, with this task I am concluding Task 2.
library(readxl)
mydataAP <- read_excel("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/R PROGRAM - BOOTCAMP/TAKE HOME EXAM/Task 3/Apartments.xlsx")
View(mydataAP)
mydataAP <- as.data.frame(mydataAP)
head(mydataAP)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydataAP$Parking <- factor(mydataAP$Parking,
levels = c("0","1"),
labels = c("No","Yes"))
mydataAP$Balcony <- factor(mydataAP$Balcony,
levels = c("0","1"),
labels = c("No","Yes"))
First I calculated some basic statistics to give me a bit of an insight of what to expect as a result for this task.
mean(mydataAP$Price)
## [1] 2018.941
sd(mydataAP$Price)
## [1] 377.8417
And than I continoued with a t-test:
t.test(mydataAP$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataAP$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on the calculated p-value (p = 0.004731, or more correctly p = 0.005), we can reject the null hypothesis (H0: Mu_Price = 1900 eur) and say that the true mean is not equal to 1900.
Based on the 95 percent confidence interval, calculated above, we can predict that the true mean of the variable Price is somewhere between 1937.443 and 2100.440.
fit1 <- lm(Price ~ Age,
data = mydataAP)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydataAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanations:
Estimate of regression coefficient (-8.975): If the age increases by 1 year, the price decreases by 8.975€ thousand per m^2 on average, considering all other factors remain the same.
Coefficient of correlation (p = 0.034): We completed t- test for two hypotheses - H0: beta1 = 0 and H1: beta1 =/= (not equal) 0. We get p = 0.034 (p < 0.05), so we reject H0, meaning the age has some effect on price.
Coefficient of determination (0.05302): 5.3% of the variable Price is explained with linear effect of the Age. Based on the very small coefficient of determination, I would conclude that this is not a good model.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydataAP[ , c(1, 2, 3)],
smooth = FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydataAP[ , c(1,2,3)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
Based on what we can spot in the matrix, and what we calculated with correlation matrix above, where we calculated a Pearson correlation coefficients,there is a strong correlation between Price and Distance and weak correlation between Age and Price and Age and Distance.
I would say there is a possibility of multicolinearity with variables Price and Distance because of their strong correlation.
fit2 <- lm(Price ~ Age + Distance,
data = mydataAP)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
I was surprised to notice that calculated VIF statistics is actually the same for both variables, Age and Distance.
Because both VIF statistics are smaller than 5, very close to 1, I would conclude that variables do not strongly correlate, so I would continue to keep both variables, as we did not detect multicolinearity.
#Standardized residuals
mydataAP$StdResid <- round(rstandard(fit2), 3)
#Cooks distances
mydataAP$CooksD <- round(cooks.distance(fit2), 3)
hist(mydataAP$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataAP$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP$StdResid
## W = 0.95303, p-value = 0.003645
hist(mydataAP$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We cannot spot any outliers in standardized residuals ( no observations fall out of values of +/- 3), but we can spot some units with large impact due to gaps in the graph of Cooks distances. In my upcoming steps I will try to research their impact and eliminate them. I will do this by firstly ordering my data to see, which are the possible problematic values of Standardized residuals and Cooks Distances, just to be sure, and to take a closer look of (possibly) problematic values.
head(mydataAP[order(mydataAP$StdResid), ], 3)
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
head(mydataAP[order(-mydataAP$StdResid), ], 3)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 33 2 11 2790 Yes No 2.051 0.069
## 2 18 1 2800 Yes No 1.783 0.030
head(mydataAP[order(-mydataAP$CooksD), ], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 39 40 2 2400 No Yes 1.091 0.038
As we can see in the graph and calculated data of Cooks Distances above, there is a gap between apartment number 53 and 13 and between apartments number 38, 55 and 33.
But, because the gap is the biggest between apartments number 38 and 55, let’s first remove apartment number 38 and see if it betters the analysis.
mydataAP$ID <- 1:85
mydataAP <- mydataAP[, c("ID", setdiff(names(mydataAP), "ID"))]
To remove the apartment the more correct way, I added a new variable IDs. I also moved the column so it is the fist column in the data set.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydataAP_new <- mydataAP <- mydataAP %>%
filter(ID != 38)
We can see that mydataAP_new now has one less observation - it is a data set without apartment number 38.
Now, let’s check if this new data set is better. For the purpose of following the instructions of Task 3, I created now fit2_2, a new ‘sub’-fit, because I wanted to show my whole process of thinking before moving on to fit 3.
fit2_2 <- lm(Price ~ Age + Distance,
data = mydataAP_new)
summary(fit2_2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 0.0000000000000002 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 0.00000000000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 0.000000000002339
The R squared is higher which suggests that we bettered our model. But let’s continue with our research.
mydataAP_new$StdResid <- round(rstandard(fit2_2), 3) #Standardized residuals
mydataAP_new$CooksD <- round(cooks.distance(fit2_2), 3) #Cooks distances
hist(mydataAP_new$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataAP_new$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP_new$StdResid
## W = 0.95649, p-value = 0.006355
hist(mydataAP_new$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Because we still have some gaps, I decided to adjust my data a bit more with further removing next potentially problematic observation with ID 55:
library(dplyr)
mydataAP_new2 <- mydataAP_new <- mydataAP_new %>%
filter(ID != 55)
fit2_3 <- lm(Price ~ Age + Distance,
data = mydataAP_new2)
summary(fit2_3)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP_new2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 0.0000000000000002 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 0.000000000000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 0.000000000001173
mydataAP_new2$StdResid <- round(rstandard(fit2_3), 3) #Standardized residuals
mydataAP_new2$CooksD <- round(cooks.distance(fit2_3), 3) #Cooks distances
hist(mydataAP_new2$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
We can see that there are no more gaps in the graph of standardized residuals. Although, I didn’t decide to remove any observation due to being problematic in the aspect of standardized residuals (no observation had values over +/- 3), we can see that removing problematic observations in Cooks Distances, we also corrected this graph.
shapiro.test(mydataAP_new2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP_new2$StdResid
## W = 0.95952, p-value = 0.01044
hist(mydataAP_new2$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
With the process above we eliminated units with large impact and we do not have any gaps anymore in both graphs, which is better because the graphs are now nicely connected.
Although, we still have one gap in the Histogram of Cooks distances, I decided to not eliminate this one, since I don’t find it too significantly different than the ones I eliminated before.
Also based on my observations of R-square, because the value got higher, I would conclude that this model is definitely better than previous ones.
mydataAP_new2$StdFitted <- scale(fit2_3$fitted.values)
library(car)
scatterplot(y = mydataAP_new2$StdResid, x = mydataAP_new2$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Based on the observations in scatterplot above, I am predicting this is a case of heteroscedasticity and I will confirm or deny my observation in the upcoming steps.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2_3)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
Based on the calculated p-value p = 0.052, which is slightly above the significance level of 0.05, we fail to reject H0, suggesting there is no evidence of heteroskedasticity (i.e., the variance is constant) and we therefore assume homoskedasticity.
To conclude, my previous prediction is denied.
Although, I already showed this graph in my previous steps, due to it being part of my process of solving such exercises, I am now gonna present to you the same histogram.
hist(mydataAP_new2$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataAP_new2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataAP_new2$StdResid
## W = 0.95952, p-value = 0.01044
I tested the normal distribution of standardized residuals with Shapiro Wilks test. My hypothesis were:
H0: The standardized residuals are normally distributed
H1: The standardized residuals are not normally distributed
Based on the calculated p-value p = 0.01, that is less than level of significance p = 0.05, we can reject the null hypothesis and assume the distribution is not normal, as we predicted in H1.
summary(fit2_3)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataAP_new2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 0.0000000000000002 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 0.000000000000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 0.000000000001173
Explanation of coefficients:
Age (-7.850): If the age increases by 1 year, the price decreases on average by 7,850€ thousand per m^2.
Distance (-23.945): If the distance increases by 1 km, the price decreases on average by 7,934€ thousand per m^2.
Multiple R-squared: 49.68% of variability of apartment price can be explained by the linear relationship between price, age and distance.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydataAP_new2)
anova(fit2_3, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With anova, we test two hypothesis, which are:
HO = Δro^2 = 0
H1 = Δro^2 =/= (not equal) 0
Based on the data in Analysis of Variance Table, we conclude that we can reject the null hypothesis at p=0.02813 (p < 0.05), which means we can assume that Model 2 (fit3) is statistically better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydataAP_new2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 0.0000000000000002 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 0.000000000214 ***
## ParkingYes 168.921 62.166 2.717 0.00811 **
## BalconyYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 0.000000000001449
With anova, we test two hypothesis, which are:
HO = Δro^2 = 0
H1 = Δro^2 =/= (not equal) 0
Based on the data of Analysis of Variance Table, we conclude that we can reject the null hypothesis at p < 0.001 (p < 0.05), which means we can assume that Model 2 (fit3) is statistically better.
We can also spot that in the increased R-squared, which suggests that we bettered our model.
Explanations:
Multiple R-squared (0.5408): 54% of variability in apartment prices can be explained by age, distance, and the apartment having a arking space and/or a balcony.
Regression Coefficient for Parking (168.921): Given the Age, Distance and Balcony, apartments that have a parking space on average sell for a 168.921€ for m^2 higher price than apartments who don’t have parking.
Based on p-value < 0.001, we can reject H0 and assume that Parking is statistically significant.
mydataAP_new2$Fitted <- fitted.values(fit3)
mydataAP_new2$Residuals <- residuals(fit3)
mydataAP_new2[ mydataAP_new2$ID == 2, c("Fitted", "Residuals")] #Selection of variables to be displayed
## Fitted Residuals
## 2 2377.043 422.9572
The residual for apartment with ID2 is 422.9572.