I did an analysis of two different data sets. I completed the TitanicSurvival analysis first but upon rereading the Task 1 instructions, I realised I needed more numeric variables, I redid it. But I liked the Titanic analysis so much, I decided to leave it in. :)
Importing data
data(package = .packages(all.available = TRUE))
I chose Wong, data set about Post-Coma Recovery of IQ.
library(carData)
PostComaIQ <- force(Wong)
head(PostComaIQ)
## id days duration sex age piq viq
## 1 3358 30 4 Male 20.67077 87 89
## 2 3535 16 17 Male 55.28816 95 77
## 3 3547 40 1 Male 55.91513 95 116
## 4 3592 13 10 Male 61.66461 59 73
## 5 3728 19 6 Male 30.12731 67 73
## 6 3790 13 3 Male 57.06229 76 69
A data frame with 331 observations on the following 11 variables:
The summary of the data:
summary(PostComaIQ)
## id days duration sex age
## Min. : 405 Min. : 13.0 Min. : 0.0 Female: 71 Min. : 6.513
## 1st Qu.:3808 1st Qu.: 59.0 1st Qu.: 1.0 Male :260 1st Qu.:21.737
## Median :5085 Median : 150.0 Median : 7.0 Median :26.877
## Mean :4735 Mean : 481.9 Mean : 14.3 Mean :31.853
## 3rd Qu.:5712 3rd Qu.: 416.0 3rd Qu.: 16.0 3rd Qu.:40.923
## Max. :7548 Max. :11628.0 Max. :255.0 Max. :80.033
## piq viq
## Min. : 50.00 Min. : 64.00
## 1st Qu.: 77.00 1st Qu.: 85.00
## Median : 87.00 Median : 94.00
## Mean : 87.56 Mean : 94.96
## 3rd Qu.: 97.00 3rd Qu.:105.00
## Max. :133.00 Max. :132.00
Data Manipulation
#Renaming a column
colnames(PostComaIQ)[4] <- "Gender"
#Creating new variables
PostComaIQ$Gender <- factor(PostComaIQ$Gender,
levels = c("Male", "Female"),
labels = c(0, 1))
#Deleting units with missing data
library(tidyr)
PostComaIQ <- PostComaIQ %>% drop_na()
#Creating a new data.frame based on conditions
PostComaIQ2 <- data.frame("Illness" = c("Parkinsons", "Melanoma", "Diabetes", "Cancer"),
"Age" = c(30, 26, 16, 34),
"Gender" = c("F", "M", "M", "M"),
"Married" = c("YES", "NO", "YES", "NO"))
print(PostComaIQ2)
## Illness Age Gender Married
## 1 Parkinsons 30 F YES
## 2 Melanoma 26 M NO
## 3 Diabetes 16 M YES
## 4 Cancer 34 M NO
#I want to see the average time patients were in a coma
mean(PostComaIQ$duration)
## [1] 14.29607
# I want to see the average time females spent in a coma
mean(PostComaIQ[PostComaIQ$Gender == 1, ]$duration)
## [1] 9.267606
Descriptive statistics
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(PostComaIQ), 2)
## id days duration Gender age piq viq
## nbr.val 331.00 331.00 331.00 NA 331.00 331.00 331.00
## nbr.null 0.00 0.00 46.00 NA 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 NA 0.00 0.00 0.00
## min 405.00 13.00 0.00 NA 6.51 50.00 64.00
## max 7548.00 11628.00 255.00 NA 80.03 133.00 132.00
## range 7143.00 11615.00 255.00 NA 73.52 83.00 68.00
## sum 1567184.00 159493.00 4732.00 NA 10543.41 28981.00 31433.00
## median 5085.00 150.00 7.00 NA 26.88 87.00 94.00
## mean 4734.69 481.85 14.30 NA 31.85 87.56 94.96
## SE.mean 83.24 62.50 1.43 NA 0.76 0.83 0.77
## CI.mean.0.95 163.75 122.95 2.82 NA 1.50 1.64 1.52
## var 2293522.34 1292958.09 678.08 NA 192.32 228.96 197.49
## std.dev 1514.44 1137.08 26.04 NA 13.87 15.13 14.05
## coef.var 0.32 2.36 1.82 NA 0.44 0.17 0.15
Explanations:
Graphing the distribution based on gender
library(ggplot2)
ggplot(PostComaIQ, aes(x=duration)) +
geom_histogram(binwidth = 10, colour="gray", fill = "dodgerblue") +
theme_dark() +
facet_wrap(~Gender, ncol = 1) +
ylab("Frequency")
So… this histogram is not pretty at all, because we have quite a few
outliers. As we can see, most male patients were in a coma up to 30
days, most of them around 10 days or less, similar is true for female
patients. But the data includes patients that were in a coma for 255,
180, quite a few for 130 days, so that elongates the histogram and makes
it so severly skewed to the right.
To eliminate the outliers, I would have to use a histogram of Standardized residuals and Cooks distance function, find the outliers and eliminate them to get a histogram that is nice, with normally distributed data.
library(car)
scatterplot(PostComaIQ$viq ~ PostComaIQ$piq,
smooth = FALSE,
boxplots = FALSE,
ylab = "Tested verbal IQ",
xlab = "Tested mathematical IQ")
I wanted to see if the tested mathematical and verbal IQs have any
correlation and as we can see, they have a strong correlation. Now I
just want to check if the duration of the coma had any effect on the
tested mathematical and verbal IQs:
library(car)
scatterplot(PostComaIQ$duration ~ PostComaIQ$piq,
smooth = FALSE,
boxplots = FALSE,
ylab = "Duration of the coma",
xlab = "Tested mathematical IQ")
library(car)
scatterplot(PostComaIQ$duration ~ PostComaIQ$viq,
smooth = FALSE,
boxplots = FALSE,
ylab = "Duration of the coma",
xlab = "Tested verbal IQ")
Again, we see the outliers at the top, but nevertheless, there is some
correlation between the duration of the coma and the tested verbal and
mathematical IQs.
library(ggplot2)
ggplot(PostComaIQ, aes(x=Gender, y=piq, fill=duration)) +
geom_boxplot() +
scale_fill_brewer(palette="Spectral") +
xlab("Gender") +
ylab("Tested Mathematical IQ")
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
labs(fill="Tested Mathematical IQ")
## $fill
## [1] "Tested Mathematical IQ"
##
## attr(,"class")
## [1] "labels"
From the boxplot we can see that the Tested mathematical IQ did not vary in mean between the male (0 in data) patients and female (1 in data) patients. We can recognise that is true from psychological theory. We can also see that female patients scored a larger range on the Mathematical IQ test than male patients.
I chose the data set TitanicSurvival, about the Survival of Passengers on the Titanic
library(carData)
mydata <- force(TitanicSurvival)
head(mydata)
## survived sex age passengerClass
## Allen, Miss. Elisabeth Walton yes female 29.0000 1st
## Allison, Master. Hudson Trevor yes male 0.9167 1st
## Allison, Miss. Helen Loraine no female 2.0000 1st
## Allison, Mr. Hudson Joshua Crei no male 30.0000 1st
## Allison, Mrs. Hudson J C (Bessi no female 25.0000 1st
## Anderson, Mr. Harry yes male 48.0000 1st
A data frame with 1309 observations on the following 4 variables. Explanation of the variables:
The summary of the data:
summary(mydata)
## survived sex age passengerClass
## no :809 female:466 Min. : 0.1667 1st:323
## yes:500 male :843 1st Qu.:21.0000 2nd:277
## Median :28.0000 3rd:709
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
## NA's :263
Data Manipulation
#Renaming a column
colnames(mydata)[2] <- "Gender"
#Creating new variables
mydata$survived <- factor(mydata$survived,
levels = c("yes", "no"),
labels = c("1", "0"))
mydata$Gender <- factor(mydata$Gender,
levels = c("male", "female"),
labels = c("1", "0"))
#Deleting units with missing data
library(tidyr)
mydata <- mydata %>% drop_na()
#Creating a new data.frame based on conditions
mydata2 <- data.frame("Ticket" = c("YES", "YES", "NO", "YES"),
"Age" = c(30, 26, 16, 34),
"Gender" = c("F", "M", "M", "M"),
"Survived" = c("YES", "NO", "YES", "NO"))
print(mydata2)
## Ticket Age Gender Survived
## 1 YES 30 F YES
## 2 YES 26 M NO
## 3 NO 16 M YES
## 4 YES 34 M NO
#I want to see the average age of people who survived
mean(mydata[mydata$survived == 1, ]$age)
## [1] 28.91823
Descriptive statistics
library(pastecs)
round(stat.desc(mydata), 2)
## survived Gender age passengerClass
## nbr.val NA NA 1046.00 NA
## nbr.null NA NA 0.00 NA
## nbr.na NA NA 0.00 NA
## min NA NA 0.17 NA
## max NA NA 80.00 NA
## range NA NA 79.83 NA
## sum NA NA 31255.67 NA
## median NA NA 28.00 NA
## mean NA NA 29.88 NA
## SE.mean NA NA 0.45 NA
## CI.mean NA NA 0.87 NA
## var NA NA 207.75 NA
## std.dev NA NA 14.41 NA
## coef.var NA NA 0.48 NA
Explanations:
Graphing the distribution
library(ggplot2) #najbolj uporaben polotting system je ggplot
ggplot(mydata, aes(x=age)) + #daš eas(thetics) in daš noter, kar hočeš na x osi
geom_histogram(binwidth = 5, colour="gray", fill = "dodgerblue") +
theme_dark() +
facet_wrap(~survived, ncol = 1) + #s tem ločimo glede na, v našem primeru, movie factors
ylab("Frequency")
In this histogram it appears that the distributions of survivors (1 in
data) and victims (0 in data) of the Titanic is skewed to the right, so
the distribution is not normal. We cannot check that with the
Shapiro-Wilk normality test, because the values are not numeric. If I
changed them to ones and zeros, the test would still not work, because
the values must be between 3 and 5000.
But, nonetheless, we can see that most victims were aged between 20 and 30, and the survivors were more evenly distributed.
library(ggplot2)
ggplot(mydata, aes(x=age, fill=survived)) +
geom_histogram(position=position_dodge(2), binwidth = 5, colour = "gray") +
facet_wrap(~passengerClass, ncol = 1) +
ylab("Frequency") +
labs(fill="survived")
With this graph, I wanted to see if traveling in a specific class affected the survival rate (if they survived, that is 1 in data and if they died that is 0 in data). And, as we can see, the people who travelled in first class were luckier than the 3rd class passengers. I find this analysis very interesting. Honestly, because of these graphs, I started liking R a whole lot more. :)
library(ggplot2)
ggplot(mydata, aes(x=age)) +
theme_bw() +
geom_histogram(binwidth = 5, colour="gray") +
facet_wrap(~Gender:survived) +
ylab("Frequency") +
labs(fill="Survived")
I wanted to check if gender affected survival rate and as we can see,
many more women (0 in data) survived (1 in data, 0 in column survived
means they died) than men (1 in data). I believe that is because they
had a policy of “Women and children first” while boarding the escape
boats.
library(ggplot2)
ggplot(mydata, aes(x=Gender, y=age, fill=survived)) +
geom_boxplot() +
scale_fill_brewer(palette="Spectral") +
xlab("Gender") +
labs(fill="Survied")
In this boxplot we can see that the mean of women (0 in data) that
survived (1 in data) is higher than the mean of men (1 in data) that
survived, and the mean of victims (0 in data) that were men is higher
than the mean of victims, that were female. This encourages the finding
we specified while observing the 3rd histogram above.
library(readxl)
Business_School <- read_excel("~/Desktop/R Take Home Exam 2024/Task 2/Business School.xlsx")
View(Business_School)
Business_School <- as.data.frame(Business_School)
head(Business_School)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
colnames(Business_School)[2]<- "UndergradDegree"
colnames(Business_School)[3]<- "UndergradGrade"
colnames(Business_School)[4]<- "MBAGrade"
colnames(Business_School)[5]<- "WorkExp"
colnames(Business_School)[9]<- "AnnSalary"
head(Business_School)
## Student ID UndergradDegree UndergradGrade MBAGrade WorkExp
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status AnnSalary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
Business_School$UndergradDegree <- factor(Business_School$UndergradDegree,
levels = c("Art", "Business", "Computer Science", "Engineering", "Finance"),
labels = c("Art", "Business", "Computer Science", "Engineering", "Finance"))
library(ggplot2)
ggplot(Business_School, aes(x=UndergradDegree)) +
geom_bar() +
ylab("Frequency")
By a good margin, the most common Undergraduate Degree is Buisness,
followed by Finance and Computer Science.
library(pastecs)
round(stat.desc(Business_School$AnnSalary))
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
options(scipen = 999)
library(ggplot2)
ggplot(Business_School, aes(x=AnnSalary)) +
geom_histogram() +
ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The minimal annual salary is 20000, the maximum is 340000. The average
annual salary is 109058 and 50% od the MBA students have annual salaries
higher than 103500. According to the histogram, the distribution seems
to be skewed to the right, but we appear to have some outliers, so we
should do the Shapiro-Wilks test to check.
mean(Business_School$MBAGrade)
## [1] 76.04055
sd(Business_School$MBAGrade)
## [1] 7.675114
t.test(Business_School$MBAGrade,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$MBAGrade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Based on the p-value (p = 0.00915), which is smaller that 0.05, we can reject the null hypothesis and say that the true mean is not equal to 74. Based on the 95 percent confidence interval, we can stipulate that the true mean is somewhere between 74.51764 and 77.56346.
library(readxl)
Apartments <- read_excel("~/Desktop/R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)
Apartments <- as.data.frame(Apartments)
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Apartments$Parking <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
mean(Apartments$Price)
## [1] 2018.941
sd(Apartments$Price)
## [1] 377.8417
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on the p-value (p = 0.004731), which is smaller that 0.05, we can reject the null hypothesis and say that the true mean is not equal to 1900. Based on the 95 percent confidence interval, we can stipulate that the true mean is somewhere between 1937.443 and 2100.440.
fit1 <- lm(Price ~ Age,
data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Explanations:
library(car)
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
smooth = FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(Apartments[ , c(1, 2, 3)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
As we can see from the correlation matrix and the Pearson correlation coefficient, there is a weak correlation between Age and Distance and Age and Price, but there is a strong correlation between Price and Distance.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Chris - mail!
Apartments$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95303, p-value = 0.003645
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
As we can see, we have outliers. Let’s find the outliers and eliminate
them.
head(Apartments[order(Apartments$StdResid),], 3)
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
head(Apartments[order(-Apartments$CooksD),], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 39 40 2 2400 No Yes 1.091 0.038
As we can see, there is a gap between aparment no. 53 and 13 and between no. 38, 55 and 33 (according to Cooks Distance). But, let’s remove aparment no. 38 and see if our model is going to be better.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Apartments <- Apartments %>%
filter(!CooksD == 0.320)
Let’s see if it works better now:
fit3 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 0.0000000000000002 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 0.00000000000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 0.000000000002339
Apartments$StdResid <- round(rstandard(fit3), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit3), 3) #Cooks distances
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95649, p-value = 0.006355
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We see the test is better, but we have to remove another outlier.
head(Apartments[order(Apartments$StdResid),], 3)
## Age Distance Price Parking Balcony StdResid CooksD
## 52 7 2 1760 No Yes -2.237 0.072
## 13 12 14 1650 No Yes -1.488 0.013
## 71 12 14 1650 No No -1.488 0.013
head(Apartments[order(-Apartments$CooksD),], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 54 43 37 1740 No No 1.598 0.129
## 33 2 11 2790 Yes No 2.225 0.084
## 52 7 2 1760 No Yes -2.237 0.072
## 22 37 3 2540 Yes Yes 1.473 0.056
## 25 8 26 2300 Yes Yes 1.825 0.052
## 57 8 2 2820 Yes No 1.705 0.039
library(dplyr)
Apartments <- Apartments %>%
filter(!CooksD == 0.129)
Let’s check again:
fit4 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit4)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 0.0000000000000002 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 0.000000000000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 0.000000000001173
Apartments$StdResid <- round(rstandard(fit4), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit4), 3) #Cooks distances
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95952, p-value = 0.01044
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
And as we see, this works way better. We eliminated the outliers one at
a time and kept an eye on R-squared, which is now at almost 50%. That is
a good outcome.
Apartments$StdFitted <- scale(fit4$fitted.values)
library(car)
scatterplot(y = Apartments$StdResid, x = Apartments$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
I think we have heteroskedasticity, at least it looks plausible
according to the scatterplot, but, let’s check.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit4)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
The null hypothesis, that the variance of errors is constant, cannot be rejected (p-value is above 0.05). We therefore assume homoskedasticity.
Apartments$StdResid <- round(rstandard(fit4), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit4), 3) #Cooks distances
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95952, p-value = 0.01044
To formally test the normal distribution of our standardized residuals, I did the Shapiro Wilks test. I postulated H0: The standardized residuals are normally distributed, H1: The standardized residuals are not normally distributed. Based on the p-value that is less than 0.05, we can reject the null hypothesis and assume the distribution is not normal.
summary(fit4)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 0.0000000000000002 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 0.000000000000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 0.000000000001173
Explanations:
fit5 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
anova(fit4, fit5)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We postulate two hypothesis: H0: ∆ro^2=0 and H1: ∆ro^2>0. Based on Pr(>F) = 0.02813, which is smaller than 0.05, we can reject H0 and assume that Model 2 is statistically better.
summary(fit5)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 0.0000000000000002 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 0.000000000214 ***
## ParkingYes 168.921 62.166 2.717 0.00811 **
## BalconyYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 0.000000000001449
54% of variability in Apartment prices can be explained by Age, Distance, whether it has a parking space and/or a balcony.
Given the Age, Distance and Balcony, appartments that have a parking space on average sell for a 168.921 euro/m^2 higher price than appartments who don’t have parking.
We postulate two hypothesis: H0: ∆ro^2=0 and H1: ∆ro^2>0. Based on p-value > 0.001, which is smaller than 0.05, we can reject H0 and assume that Parking is statistically significant.
Because Pr(>abs(t))=0.90566, Balcony is not statistically relevant and we will not interpret it.
Apartments$Fitted <- fitted.values(fit5)
Apartments$Residuals <- residuals(fit5)
Apartments[2, c("Fitted", "Residuals")]
## Fitted Residuals
## 2 2377.043 422.9572
The residual for apartment ID2 is 422.9572