library(carData)
data(Wong)
library(carData)
mydata <- force(Wong)
head(mydata)
## id days duration sex age piq viq
## 1 3358 30 4 Male 20.67077 87 89
## 2 3535 16 17 Male 55.28816 95 77
## 3 3547 40 1 Male 55.91513 95 116
## 4 3592 13 10 Male 61.66461 59 73
## 5 3728 19 6 Male 30.12731 67 73
## 6 3790 13 3 Male 57.06229 76 69
Explanation of dataset: This data observes recovery of IQ after comas of different duration on 200 patients, where:
There is 331 observations and 200 patients observed, because patients were periodically administered an IQ test, however, the average number of measurements per patient is small (1.7 IQ measurements per patient on average).
Source of this dataset is from Cardata package, with a dataset name “Wong”. The source of this datast is:
Wong, P. P., Monette, G., and Weiner, N. I. (2001) Mathematical models of cognitive recovery. Brain Injury, 15, 519–530.
summary(mydata) #Summary of all variables, but id calculations don't make any sense, so I will remove them
## id days duration sex
## Min. : 405 Min. : 13.0 Min. : 0.0 Female: 71
## 1st Qu.:3808 1st Qu.: 59.0 1st Qu.: 1.0 Male :260
## Median :5085 Median : 150.0 Median : 7.0
## Mean :4735 Mean : 481.9 Mean : 14.3
## 3rd Qu.:5712 3rd Qu.: 416.0 3rd Qu.: 16.0
## Max. :7548 Max. :11628.0 Max. :255.0
## age piq viq
## Min. : 6.513 Min. : 50.00 Min. : 64.00
## 1st Qu.:21.737 1st Qu.: 77.00 1st Qu.: 85.00
## Median :26.877 Median : 87.00 Median : 94.00
## Mean :31.853 Mean : 87.56 Mean : 94.96
## 3rd Qu.:40.923 3rd Qu.: 97.00 3rd Qu.:105.00
## Max. :80.033 Max. :133.00 Max. :132.00
mydata2 <- mydata[-1] #Removing the ID
summary(mydata2)
## days duration sex age
## Min. : 13.0 Min. : 0.0 Female: 71 Min. : 6.513
## 1st Qu.: 59.0 1st Qu.: 1.0 Male :260 1st Qu.:21.737
## Median : 150.0 Median : 7.0 Median :26.877
## Mean : 481.9 Mean : 14.3 Mean :31.853
## 3rd Qu.: 416.0 3rd Qu.: 16.0 3rd Qu.:40.923
## Max. :11628.0 Max. :255.0 Max. :80.033
## piq viq
## Min. : 50.00 Min. : 64.00
## 1st Qu.: 77.00 1st Qu.: 85.00
## Median : 87.00 Median : 94.00
## Mean : 87.56 Mean : 94.96
## 3rd Qu.: 97.00 3rd Qu.:105.00
## Max. :133.00 Max. :132.00
We can see, that:
colnames(mydata) <- c("Patient", "Days from coma","Coma duration", "Sex", "Age", "IQ Test Performance", "Verbal IQ performance") #Changing the column names for clearer variable descriptions
head(mydata)
## Patient Days from coma Coma duration Sex Age
## 1 3358 30 4 Male 20.67077
## 2 3535 16 17 Male 55.28816
## 3 3547 40 1 Male 55.91513
## 4 3592 13 10 Male 61.66461
## 5 3728 19 6 Male 30.12731
## 6 3790 13 3 Male 57.06229
## IQ Test Performance Verbal IQ performance
## 1 87 89
## 2 95 77
## 3 95 116
## 4 59 73
## 5 67 73
## 6 76 69
library(tidyr)
mydata3 <- drop_na(mydata)
#I wanted to remove any non available data from data set, but there were none, so the amount of the observations remained the same
sorted_data <- mydata[order(-mydata$"Days from coma"), ]
head(sorted_data) #ordering the data by the descending duration from coma
## Patient Days from coma Coma duration Sex Age
## 330 5142 11628 57 Male 16.432580
## 331 5964 11038 0 Male 12.544830
## 329 2773 7631 42 Male 6.513347
## 328 3835 4933 14 Male 25.993160
## 326 1939 3864 130 Male 28.273790
## 325 651 3412 21 Male 22.006840
## IQ Test Performance Verbal IQ performance
## 330 101 95
## 331 71 73
## 329 88 103
## 328 91 88
## 326 88 105
## 325 68 92
I noticed there are many big numbers of days from coma and possible errors based on patients ages, so I will add a column of days transformed into years for a better representation of the duration
sorted_data$"Years from coma" <- sorted_data$"Days from coma" / 365.25 #Dividing the days to get out the number of years from coma to be able to compare the years from coma and age
new_order_data <- sorted_data[c("Patient", "Days from coma", "Years from coma", "Age", "Sex", "Coma duration", "IQ Test Performance", "Verbal IQ performance" )] #Sorting the columns so that the years from coma and age are next to each other
new_order_data$"Years from coma" <- round(new_order_data$"Years from coma", 0)
new_order_data$"Age" <- round(new_order_data$"Age", 0)
head(new_order_data) #rounding up the numbers for better visibility
## Patient Days from coma Years from coma Age Sex Coma duration
## 330 5142 11628 32 16 Male 57
## 331 5964 11038 30 13 Male 0
## 329 2773 7631 21 7 Male 42
## 328 3835 4933 14 26 Male 14
## 326 1939 3864 11 28 Male 130
## 325 651 3412 9 22 Male 21
## IQ Test Performance Verbal IQ performance
## 330 101 95
## 331 71 73
## 329 88 103
## 328 91 88
## 326 88 105
## 325 68 92
As seen in the first top three patients, the number of years from coma exceed their age, so now I want to find out how many patients there are whose duration from coma exceeds their age
new_order_data$Comparison <- new_order_data$"Years from coma" > new_order_data$"Age" #Comparing the age and years from coma, to see where years from coma exceeds the age
head(new_order_data[order(-new_order_data$"Comparison"), ]) #Ordering the dataset so that I see the values with exceeding duration first
## Patient Days from coma Years from coma Age Sex Coma duration
## 330 5142 11628 32 16 Male 57
## 331 5964 11038 30 13 Male 0
## 329 2773 7631 21 7 Male 42
## 328 3835 4933 14 26 Male 14
## 326 1939 3864 11 28 Male 130
## 325 651 3412 9 22 Male 21
## IQ Test Performance Verbal IQ performance Comparison
## 330 101 95 TRUE
## 331 71 73 TRUE
## 329 88 103 TRUE
## 328 91 88 FALSE
## 326 88 105 FALSE
## 325 68 92 FALSE
From this data set, I see first three patient observations (With value=true when checking if years from coma exceed age) have an error in data, because their age cannot be lower than duration from their coma. Because of this, I will delete these 3 observations.
cleared_data <- new_order_data[c(-1,-2,-3),-9] #Removing the first three observations
head(cleared_data)
## Patient Days from coma Years from coma Age Sex Coma duration
## 328 3835 4933 14 26 Male 14
## 326 1939 3864 11 28 Male 130
## 325 651 3412 9 22 Male 21
## 327 2600 3337 9 44 Male 9
## 47 2600 3333 9 44 Male 9
## 321 1939 3111 9 28 Male 130
## IQ Test Performance Verbal IQ performance
## 328 91 88
## 326 88 105
## 325 68 92
## 327 101 84
## 47 86 80
## 321 88 111
Now, we have a data set with no more errors.
Let’s start comparing variables between each other to see their correlation.
variable1 <- cleared_data$"Days from coma"
variable2 <- cleared_data$"Age"
variable3 <- cleared_data$"Sex"
variable4 <- cleared_data$"Coma duration"
variable5 <- cleared_data$"IQ Test Performance"
variable6 <- cleared_data$"Verbal IQ performance"
#naming the variables I assume could have significant correlation between each other
correlation_coefficient <- cor(variable4, variable5)
print(correlation_coefficient) #checking the correlation between coma duration and IQ test performance
## [1] -0.1786684
No significant correlation between coma duration and IQ test performance found.
correlation_coefficient <- cor(variable4, variable6)
print(correlation_coefficient) #checking the correlation between coma duration and verbal IQ performance
## [1] -0.003885065
No significant correlation between coma duration and verbal IQ performance found.
correlation_coefficient <- cor(variable6, variable5)
print(correlation_coefficient) #checking correlation between IQ test and verbal IQ performance
## [1] 0.6504822
Strong positive correlation between IQ test and verbal IQ performance found, meaning the higher the score patient had on IQ test, the better they scored in verbal IQ performance.
correlation_coefficient <- cor(variable1, variable5)
print(correlation_coefficient) #checking correlation between duration from the coma and IQ test score
## [1] 0.07337349
No significant correlation between duration from the coma and IQ test score found.
library(psych)
describeBy(cleared_data, cleared_data$`Sex`)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad
## Patient 1 71 5054.93 1373.10 5208 5224.11 932.56
## Days from coma 2 71 316.54 439.47 135 219.67 131.95
## Years from coma 3 71 0.79 1.29 0 0.51 0.00
## Age 4 71 30.86 12.25 27 28.93 7.41
## Sex* 5 71 1.00 0.00 1 1.00 0.00
## Coma duration 6 71 9.27 13.18 4 6.11 4.45
## IQ Test Performance 7 71 89.18 18.00 87 87.96 17.79
## Verbal IQ performance 8 71 94.35 14.25 92 94.54 17.79
## min max range skew kurtosis se
## Patient 1075 7309 6234 -1.25 1.60 162.96
## Days from coma 18 2259 2241 2.28 5.42 52.16
## Years from coma 0 6 6 1.94 3.73 0.15
## Age 17 77 60 1.52 2.01 1.45
## Sex* 1 1 0 NaN NaN 0.00
## Coma duration 0 68 68 2.36 5.65 1.56
## IQ Test Performance 50 133 83 0.48 -0.06 2.14
## Verbal IQ performance 64 131 67 0.03 -0.81 1.69
## ----------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad
## Patient 1 257 4647.49 1543.79 5009 4748.16 1298.76
## Days from coma 2 257 415.26 694.00 159 247.62 173.46
## Years from coma 3 257 1.05 1.98 0 0.57 0.00
## Age 4 257 32.35 14.21 27 30.62 10.38
## Sex* 5 257 2.00 0.00 2 2.00 0.00
## Coma duration 6 257 15.47 28.41 7 9.52 8.90
## IQ Test Performance 7 257 87.12 14.28 87 86.84 14.83
## Verbal IQ performance 8 257 95.19 14.03 94 94.68 14.83
## min max range skew kurtosis se
## Patient 405 7548 7143 -0.63 -0.17 96.30
## Days from coma 13 4933 4920 3.27 12.59 43.29
## Years from coma 0 14 14 3.15 12.16 0.12
## Age 16 80 64 0.97 -0.14 0.89
## Sex* 2 2 0 NaN NaN 0.00
## Coma duration 0 255 255 4.58 26.81 1.77
## IQ Test Performance 50 130 80 0.19 -0.14 0.89
## Verbal IQ performance 64 132 68 0.30 -0.49 0.88
From this data, we can see that on average, coma duration for males is 15,47 days, while for women it is only 9,27 days.
We can also see that females did slightly better in IQ test performance, while males did better in verbal IQ performances.
There is also a visible difference in kurtosis of coma duration, with females’ kurtosis being 5.65 and male 26.81, meaning the distribution of females’ distribution of coma duration has moderately heavy tails, while males’ has extremely heavy tails, meaning the distribution is much more dispersed and has very high probability of extreme values in tails, meaning it is more leptokurtic than the female distribution of coma duration. To be able to explain this better, I will make some box plot and histogram graphs.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
boxplot(`Coma duration` ~ Sex, data = cleared_data,
main = "Coma Duration by Gender",
xlab = "Gender",
ylab = "Coma Duration in days",
ylim = c(0, 100),
border = c("red", "blue"), #Setting the border color for boxes
col = c("pink", "lightblue"), #Setting the fill color for boxes
outline = TRUE)
#Putting Coma duration and Gender variables in boxplot to analyse the difference between genders.
In this boxplot, I see a lot of outliers (more for males than females, like explained before), so I will take z-value of 2 (Any data that are more than 2 standard deviations away from the mean, as a standard threshold use).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
z_threshold <- 2 #Setting the z-value threshold
filtered_data <- cleared_data %>%
filter(abs(scale(`Coma duration`)) < z_threshold)
## Warning: Using one column matrices in `filter()` was deprecated in dplyr
## 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.
boxplot(`Coma duration` ~ Sex, data = filtered_data,
main = "Coma Duration by Gender",
xlab = "Gender",
ylab = "Coma Duration in days",
ylim = c(0, 60),
border = c("red", "blue"), # Setting the border color for boxes
col = c("pink", "lightblue"), # Setting the fill color for boxes
outline = TRUE) #Adding the z-threshold filter to the boxplot and adjusting the y axis maximum for better visibility
From the adjusted box plot, we can see that males indeed have longer comas on average than females despite removing the outliers. It is also important to note there are also more males than males in this data set, possibly because they go in comas more often than females.
coma_duration_male <- filtered_data$`Coma duration`[filtered_data$Sex == "Male"]
coma_duration_female <- filtered_data$`Coma duration`[filtered_data$Sex == "Female"]
#Extracting the 'Coma duration' variable by 'Sex'
hist(coma_duration_male, probability = TRUE,col= rgb(0, 0, 1, 0.5), main = "Histogram and Normal Distribution (Male)",
xlab = "Coma Duration", ylab = "Density") # Creating separate histograms for male and female data, setting the transparency of the columns to 0,5, so that both can be seen
hist(coma_duration_female, probability = TRUE, col = rgb(1, 0, 1, 0.5), main = "Histogram and Normal Distribution (Female)",
xlab = "Coma Duration", ylab = "Density", add = TRUE)
legend("topright", legend = c("Male", "Female"), fill = c("blue", "pink"))
We can see our data is not normally distributed, as it is positively skewed (skewed to the right) for both male and female gender. Since the transparency columns are hard to read, I want to make another graph with separate distributions, based on gender.
This skewness makes sense in practice, because the coma durations start at 0 days.
par(mfrow = c(1, 2),mar = c(5, 4, 4, 2) + 0.1) # Setting up a layout with two plots side by side
hist(coma_duration_male, probability = TRUE, col = rgb(0, 0, 1, 0.5), main = "Histogram (Male)",
xlab = "Coma Duration", ylab = "Density") # Adjusting the transparency
hist(coma_duration_female, probability = TRUE, col = rgb(1, 0, 1, 0.5), main = "Histogram (Female)",
xlab = "Coma Duration", ylab = "Density") # Adjusting the transparency
# Reset the layout to default
par(mfrow = c(1, 1))
Despite x and y axis not being the same, we see male and female distribution being very similar to eachother.
task2 <- read.table("~/IMB/Bootcamp/HOMEWORK/Body mass.csv",
header = TRUE,
sep=";",
dec=",") #Creating the dataset
summary(task2)
## ID Mass
## Min. : 1.00 Min. :49.70
## 1st Qu.:13.25 1st Qu.:60.23
## Median :25.50 Median :62.80
## Mean :25.50 Mean :62.88
## 3rd Qu.:37.75 3rd Qu.:64.50
## Max. :50.00 Max. :83.20
From the data summary we can see that the average weight of ninth grade students from 2018/2019 being 59.5kg increased in 2021/2022 to 62.88kgs.
We can see the minimum weight of a student is 49.7 kgs and maximum weight is 83.20 kgs. We also see that 25% of the students weighed 60.23 kgs or less and 75% of the students weighed 64.50kgs or less.
library(dplyr)
#To see if there are any more descriptive statistics we can get from this dataset
summary(task2$Mass) #Calculating summary statistics for Mass as a variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.70 60.23 62.80 62.88 64.50 83.20
mean_mass <- mean(task2$Mass)
median_mass <- median(task2$Mass)
sd_mass <- sd(task2$Mass)
min_mass <- min(task2$Mass)
max_mass <- max(task2$Mass) #Calculating mean, median, standard deviation, minimum, and maximum
cat("Mean Mass:", mean_mass, "\n")
## Mean Mass: 62.876
cat("Median Mass:", median_mass, "\n")
## Median Mass: 62.8
cat("Standard Deviation:", sd_mass, "\n")
## Standard Deviation: 6.011403
cat("Minimum Mass:", min_mass, "\n")
## Minimum Mass: 49.7
cat("Maximum Mass:", max_mass, "\n") #Printing the calculated statistics
## Maximum Mass: 83.2
Additionally, the standard deviation in mass is 6kgs.
library(dplyr)
hist(task2$Mass,
main = "Histogram of Body Mass",
xlab = "Body Mass in Kilograms",
ylab = "Frequency",
col = "blue", # Color of the bars
border = "black", # Color of the bar borders
breaks = 20,)
curve(dnorm(x, mean = mean(task2$Mass), sd = sd(task2$Mass)),
col = "red", # Color of the curve
lwd = 2,# Line width of the curve
add = TRUE) #Adding a normal distribution line to the graph, but it's not working because of different y axis (density and frequency), so the normal distribution curve is not accurate because of different values, so I am adding a separate normal distribution line separately with a density y axis.
curve(dnorm(x, mean = mean(task2$Mass), sd = sd(task2$Mass)),
main = "Normal distribution line based on our dataset",
xlab= "Body Mass in Kilograms",
ylab= "Density",
col = "red",
lwd = 2,
add = 50)
#Creating a histogram of onrmal distribution line based on the mean and standard deviation of our dataset
# Creating the histogram with density-based y-axis, because the original one had frequency y axis.
hist(task2$Mass,
main = "Histogram of Body Mass",
xlab = "Body Mass in Kilograms",
ylab = "Density", # Changing the y-axis label to "Density"
col = "blue",
border = "black",
breaks = 20,
probability = TRUE) # Setting probability to TRUE for density-based y-axis
curve(dnorm(x, mean = mean(task2$Mass), sd = sd(task2$Mass)),
col = "red",
lwd = 2,
add = TRUE) #Adding the normal distribution curve
library(psych)
describeBy(task2$Mass)
## Warning in describeBy(task2$Mass): no grouping variable requested
## vars n mean sd median trimmed mad min max range skew
## X1 1 50 62.88 6.01 62.8 62.56 3.34 49.7 83.2 33.5 0.85
## kurtosis se
## X1 2.11 0.85
Based on this descriptive statistics values and the histogram, we know that:
hypothesized_mean <- 59.5 # Hypothesized population mean
sample_mean <- 62.88
standard_error <- 0.85 # Sample mean and standard error from the dataset
sample_size <- 50 # Sample size (number of observations)
degrees_of_freedom <- sample_size - 1 # Degrees of freedom (n - 1 for a t-test)
t_statistic <- (sample_mean - hypothesized_mean) / (standard_error / sqrt(sample_size)) # Calculating the t-statistic
p_value <- 2 * pt(-abs(t_statistic), df = degrees_of_freedom) # Calculating the p-value for a two-tailed test
alpha <- 0.05 # Significance level (alpha)
if (p_value < alpha) {
cat("Reject the null hypothesis (H0). The sample mean is statistically significantly different from 59.5.\n")
} else {
cat("Fail to reject the null hypothesis (H0). There is not enough evidence to conclude that the sample mean is different from 59.5.\n")
} # Checking if the p-value is less than the significance level
## Reject the null hypothesis (H0). The sample mean is statistically significantly different from 59.5.
cat("t-statistic:", round(t_statistic, 2), "\n")
## t-statistic: 28.12
cat("p-value:", round(p_value, 4), "\n") # Displaying the t-statistic and p-value
## p-value: 0
library(ggplot2)
t.test(task2$Mass, mu=59.5, alternative = "two.sided")
##
## One Sample t-test
##
## data: task2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
H0: μ is equal to 59.5kgs H1: μ is not equal to 59.5kgs
With significance level 5%, meaning alpha=0.05 p-value =0.000234, less than 0.001
We reject the null hypothesis with 95% confidence interval, at p<0.001, because the sample mean is statistically significantly different from the mean=59.5 kgs.
95% CI for arithmetic mean 61.17<Mu<64.58 Because 59.5kg is not included in the interval, we reject the HO at 5%.
sample_mean <- 62.88
standard_error <- 0.85 # Sample mean and standard error from the dataset
hypothesized_mean <- 59.5 # Hypothesized population mean
cohen_d <- (sample_mean - hypothesized_mean) / standard_error # Calculating Cohen's d
print(cohen_d)
## [1] 3.976471
Cohen’s d value of 3.98 suggests a large effect size. This means that the observed difference between hypothesized and actual mean is significant. It’s positive, meaning the sample mean is higher than the hypothesized one.
library(readxl)
Apartments <- read_excel("~/IMB/Bootcamp/HOMEWORK/Apartments.xlsx")
View(Apartments)
Apartments$Parking <- factor(Apartments$Parking)
Apartments$Balcony <- factor(Apartments$Balcony)
library(psych)
describeBy(Apartments$Price)
## Warning in describeBy(Apartments$Price): no grouping variable
## requested
## vars n mean sd median trimmed mad min max range skew
## X1 1 85 2018.94 377.84 1950 1990.29 429.95 1400 2820 1420 0.54
## kurtosis se
## X1 -0.69 40.98
sample_mean <- 2018.94
sample_sd <- 377.84
sample_size <- 85
hypothesized_mean <- 1900
t_statistic <- (sample_mean - hypothesized_mean) / (sample_sd / sqrt(sample_size)) # Calculating the t-statistic
degrees_of_freedom <- sample_size - 1 # Calculating degrees of freedom
p_value <- 2 * pt(-abs(t_statistic), df = degrees_of_freedom) # Calculating the two-tailed p-value
# Significance level (alpha)
alpha <- 0.05
if (p_value < alpha) {
cat("Reject the null hypothesis (H0). The sample mean is statistically significantly different from 1900 EUR.\n")
} else {
cat("Fail to reject the null hypothesis (H0). There is not enough evidence to conclude that the sample mean is different from 1900 EUR.\n")
}
## Reject the null hypothesis (H0). The sample mean is statistically significantly different from 1900 EUR.
cat("t-statistic:", round(t_statistic, 2), "\n")
## t-statistic: 2.9
cat("p-value:", round(p_value, 4), "\n") # Displaying the t-statistic and p-value
## p-value: 0.0047
library(ggplot2)
t.test(Apartments$Price, mu=1900, alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We reject the null hypothesis (H0) at p-value. The sample mean is statistically significantly different from 1900 EUR. We can say with 95 CI, that 1937.44<Mu<2100.44
fit1 <- lm(Price ~ Age, data = Apartments) # Fitting the simple linear regression model
summary(fit1) # Printing a summary of the regression results
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Regression coefficient: For every additional year of age of the apartment, the price per m2 decreases for 8.98€. Coefficient of correlation: Is close to 0, meaning the age of the apartment does not predict the variability of price per m2. Coefficient of determination: 5.3% of variability of price per m2 is explained by linear effect of the age of the apartment.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
pairs(Apartments[, c("Price", "Age", "Distance")])
cor_matrix <- cor(Apartments[, c("Price", "Age", "Distance")])
print(cor_matrix)
## Price Age Distance
## Price 1.0000000 -0.23025502 -0.63102794
## Age -0.2302550 1.00000000 0.04290813
## Distance -0.6310279 0.04290813 1.00000000
Correlation between independent variables Distance and Age are 0.043, meaning there is no potential problem of multicollinearity and they both have unique effects on the dependant varible, Price.
fit2 <- lm(Price ~ Age + Distance, data = Apartments) # the multiple linear regression model
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Price=2460.10-7.93* Age -20.67 *Distance
vif_values <- vif(fit2)
print(vif_values)
## Age Distance
## 1.001845 1.001845
Because the VIF is much lower than 5, we don’t need to remove any of the variables as VIFs indicate no risk of multicolinearity
std_residuals <- rstandard(fit2) # Calculating standardized residuals
Apartments$Std_Residuals <- std_residuals # Adding standardized residuals to the dataset
cook_distances <- cooks.distance(fit2) # Calculating Cook's distances
Apartments$Cook_Distances <- cook_distances # Adding Cook's distances to the dataset
problematic_units <- which(abs(std_residuals) > 2 | cook_distances > 4/nrow(Apartments)) # Identifying units with standardized residuals > 2 or Cook's distances > 4/n
problematic_units # printing the problematic units
## 22 33 38 53 55
## 22 33 38 53 55
filtered_data2 <- Apartments[-problematic_units, ] # Removing problematic units from the dataset
(filtered_data2)
## # A tibble: 80 × 7
## Age Distance Price Parking Balcony Std_Residuals Cook_Distances
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 28 1640 0 1 -0.665 0.00739
## 2 18 1 2800 1 0 1.78 0.0304
## 3 7 28 1660 0 0 -0.594 0.00588
## 4 28 29 1850 0 1 0.754 0.00830
## 5 18 18 1640 1 1 -1.07 0.00511
## 6 28 12 1770 0 1 -0.778 0.00490
## 7 14 20 1850 0 1 -0.302 0.000548
## 8 18 6 1970 1 1 -0.787 0.00378
## 9 22 7 2270 1 0 0.455 0.00129
## 10 25 2 2570 1 0 1.24 0.0167
## # ℹ 70 more rows
plot(fitted(fit2, scale = 1), rstandard(fit2),
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Scatterplot of Standardized Residuals vs. Standardized Fitted Values") # Creating a scatterplot of standardized residuals vs. standardized fitted values
abline(h = 0, col = "blue", lty = 2) # Adding a horizontal line at y = 0 for reference
Potential heteroskedasticity is not clear from the scatterplot graph, so I will check with Breusch Pagan test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
Because p>0.005, we cannot reject the null hypothesys that states the variance is constant, meaning there is risk for potential heteroskedasticity.
hist(rstandard(fit2), probability = TRUE,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals") # Creating a histogram of standardized residuals
curve(dnorm(x, mean = mean(rstandard(fit2)), sd = sd(rstandard(fit2))),
col = "blue", lwd = 2, add = TRUE) # Adding a density curve (normal distribution)
It doesn’t seem normally distributed, so I will run the Shapiro test as well.
shapiro_test <- shapiro.test(rstandard(fit2))
shapiro_test
##
## Shapiro-Wilk normality test
##
## data: rstandard(fit2)
## W = 0.95306, p-value = 0.00366
Because p<0.005, we reject the null hyphothesys of normal distribution.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
fit2 <- lm(Price ~ Age + Distance, data = Apartments) # Fitting the multiple linear regression model
cooksd <- cooks.distance(fit2) # Calculating Cook's distances
cooksd_threshold <- 4 / length(cooksd) # threshold for Cook's distances
influential_observations <- which(cooksd > cooksd_threshold) # Identifying influential observations
filtered_data <- Apartments[-influential_observations, ] # Creating a new dataset without influential observations
fit2_filtered <- lm(Price ~ Age + Distance, data = filtered_data) # Fitting the multiple linear regression model without influential observations
summary(fit2_filtered)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = filtered_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
If Age of the apartment increases by 1 year, the price per m2 on average decreases by -8.67€, assuming that all other explanatory variables, included in the model, are constant.
If distance from the city center increases by 1km, the price per m2 on average decreases by -24.06€, assuming that all other explanatory variables, included in the model, are constant.
library(dplyr)
Apartments <- Apartments %>%
mutate(Parking = factor(Parking),
Balcony = factor(Balcony)) # Creating dummy variables for Parking and Balcony
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments) # Fitting the multiple linear regression model with dummy variables
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Price=2301-6.78Age -18.05Distance +196.17Parking + 1.94Balcony
anova_result <- anova(fit2, fit3) # Comparing fit3 and fit2 with ANOVA
print(anova_result)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Model 1 is better H1: Model 2 is better
p=0.1
We reject the null hypothesys, because p<0.05. This means model 2, the more complex model, is better, because it explains more variability in price per m2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Assuming that all other variables are the same, the apartment with parking has on average by 196.17€ higher price compared to the apartments without the parking (p < 0.05)
We cannot assume that when all other variables are the same, the apartment with balcony has on average by 1.94€ higher price compared to the apartments without the balcony, because p=97 is not smaller than alpha=0.05.
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[colnames(Apartments)%in% c( "Price","Fitted","Residuals" )])
## # A tibble: 6 × 3
## Price Fitted Residuals
## <dbl> <dbl> <dbl>
## 1 1640 1751. -111.
## 2 2800 2357. 443.
## 3 1660 1749. -88.8
## 4 1850 1590. 260.
## 5 1640 2053. -413.
## 6 1770 1897. -127.
Residual for the apartment ID2 is 442.59€, which is how much our model’s price per m2 for apartment ID2 prediction deviated from its actual observed price per m2.