Lana Rakovec

Task 1

library(carData)
data(Wong)
library(carData)
mydata <- force(Wong)
head(mydata)
##     id days duration  sex      age piq viq
## 1 3358   30        4 Male 20.67077  87  89
## 2 3535   16       17 Male 55.28816  95  77
## 3 3547   40        1 Male 55.91513  95 116
## 4 3592   13       10 Male 61.66461  59  73
## 5 3728   19        6 Male 30.12731  67  73
## 6 3790   13        3 Male 57.06229  76  69

Explanation of dataset: This data observes recovery of IQ after comas of different duration on 200 patients, where:

There is 331 observations and 200 patients observed, because patients were periodically administered an IQ test, however, the average number of measurements per patient is small (1.7 IQ measurements per patient on average).

Source of this dataset is from Cardata package, with a dataset name “Wong”. The source of this datast is:

Wong, P. P., Monette, G., and Weiner, N. I. (2001) Mathematical models of cognitive recovery. Brain Injury, 15, 519–530.

summary(mydata) #Summary of all variables, but id calculations don't make any sense, so I will remove them
##        id            days            duration         sex     
##  Min.   : 405   Min.   :   13.0   Min.   :  0.0   Female: 71  
##  1st Qu.:3808   1st Qu.:   59.0   1st Qu.:  1.0   Male  :260  
##  Median :5085   Median :  150.0   Median :  7.0               
##  Mean   :4735   Mean   :  481.9   Mean   : 14.3               
##  3rd Qu.:5712   3rd Qu.:  416.0   3rd Qu.: 16.0               
##  Max.   :7548   Max.   :11628.0   Max.   :255.0               
##       age              piq              viq        
##  Min.   : 6.513   Min.   : 50.00   Min.   : 64.00  
##  1st Qu.:21.737   1st Qu.: 77.00   1st Qu.: 85.00  
##  Median :26.877   Median : 87.00   Median : 94.00  
##  Mean   :31.853   Mean   : 87.56   Mean   : 94.96  
##  3rd Qu.:40.923   3rd Qu.: 97.00   3rd Qu.:105.00  
##  Max.   :80.033   Max.   :133.00   Max.   :132.00
mydata2 <- mydata[-1] #Removing the ID
summary(mydata2)
##       days            duration         sex           age        
##  Min.   :   13.0   Min.   :  0.0   Female: 71   Min.   : 6.513  
##  1st Qu.:   59.0   1st Qu.:  1.0   Male  :260   1st Qu.:21.737  
##  Median :  150.0   Median :  7.0                Median :26.877  
##  Mean   :  481.9   Mean   : 14.3                Mean   :31.853  
##  3rd Qu.:  416.0   3rd Qu.: 16.0                3rd Qu.:40.923  
##  Max.   :11628.0   Max.   :255.0                Max.   :80.033  
##       piq              viq        
##  Min.   : 50.00   Min.   : 64.00  
##  1st Qu.: 77.00   1st Qu.: 85.00  
##  Median : 87.00   Median : 94.00  
##  Mean   : 87.56   Mean   : 94.96  
##  3rd Qu.: 97.00   3rd Qu.:105.00  
##  Max.   :133.00   Max.   :132.00

We can see, that:

colnames(mydata) <- c("Patient", "Days from coma","Coma duration", "Sex", "Age", "IQ Test Performance", "Verbal IQ performance") #Changing the column names for clearer variable descriptions
head(mydata) 
##   Patient Days from coma Coma duration  Sex      Age
## 1    3358             30             4 Male 20.67077
## 2    3535             16            17 Male 55.28816
## 3    3547             40             1 Male 55.91513
## 4    3592             13            10 Male 61.66461
## 5    3728             19             6 Male 30.12731
## 6    3790             13             3 Male 57.06229
##   IQ Test Performance Verbal IQ performance
## 1                  87                    89
## 2                  95                    77
## 3                  95                   116
## 4                  59                    73
## 5                  67                    73
## 6                  76                    69
library(tidyr)
mydata3 <- drop_na(mydata)
#I wanted to remove any non available data from data set, but there were none, so the amount of the observations remained the same
sorted_data <- mydata[order(-mydata$"Days from coma"), ]
head(sorted_data) #ordering the data by the descending duration from coma
##     Patient Days from coma Coma duration  Sex       Age
## 330    5142          11628            57 Male 16.432580
## 331    5964          11038             0 Male 12.544830
## 329    2773           7631            42 Male  6.513347
## 328    3835           4933            14 Male 25.993160
## 326    1939           3864           130 Male 28.273790
## 325     651           3412            21 Male 22.006840
##     IQ Test Performance Verbal IQ performance
## 330                 101                    95
## 331                  71                    73
## 329                  88                   103
## 328                  91                    88
## 326                  88                   105
## 325                  68                    92

I noticed there are many big numbers of days from coma and possible errors based on patients ages, so I will add a column of days transformed into years for a better representation of the duration

sorted_data$"Years from coma" <- sorted_data$"Days from coma" / 365.25 #Dividing the days to get out the number of years from coma to be able to compare the years from coma and age
new_order_data <- sorted_data[c("Patient", "Days from coma", "Years from coma", "Age", "Sex", "Coma duration", "IQ Test Performance", "Verbal IQ performance" )] #Sorting the columns so that the years from coma and age are next to each other
new_order_data$"Years from coma" <- round(new_order_data$"Years from coma", 0)
new_order_data$"Age" <- round(new_order_data$"Age", 0)
head(new_order_data) #rounding up the numbers for better visibility
##     Patient Days from coma Years from coma Age  Sex Coma duration
## 330    5142          11628              32  16 Male            57
## 331    5964          11038              30  13 Male             0
## 329    2773           7631              21   7 Male            42
## 328    3835           4933              14  26 Male            14
## 326    1939           3864              11  28 Male           130
## 325     651           3412               9  22 Male            21
##     IQ Test Performance Verbal IQ performance
## 330                 101                    95
## 331                  71                    73
## 329                  88                   103
## 328                  91                    88
## 326                  88                   105
## 325                  68                    92

As seen in the first top three patients, the number of years from coma exceed their age, so now I want to find out how many patients there are whose duration from coma exceeds their age

new_order_data$Comparison <- new_order_data$"Years from coma" > new_order_data$"Age" #Comparing the age and years from coma, to see where years from coma exceeds the age
head(new_order_data[order(-new_order_data$"Comparison"), ]) #Ordering the dataset so that I see the values with exceeding duration first
##     Patient Days from coma Years from coma Age  Sex Coma duration
## 330    5142          11628              32  16 Male            57
## 331    5964          11038              30  13 Male             0
## 329    2773           7631              21   7 Male            42
## 328    3835           4933              14  26 Male            14
## 326    1939           3864              11  28 Male           130
## 325     651           3412               9  22 Male            21
##     IQ Test Performance Verbal IQ performance Comparison
## 330                 101                    95       TRUE
## 331                  71                    73       TRUE
## 329                  88                   103       TRUE
## 328                  91                    88      FALSE
## 326                  88                   105      FALSE
## 325                  68                    92      FALSE

From this data set, I see first three patient observations (With value=true when checking if years from coma exceed age) have an error in data, because their age cannot be lower than duration from their coma. Because of this, I will delete these 3 observations.

cleared_data <- new_order_data[c(-1,-2,-3),-9] #Removing the first three observations
head(cleared_data)
##     Patient Days from coma Years from coma Age  Sex Coma duration
## 328    3835           4933              14  26 Male            14
## 326    1939           3864              11  28 Male           130
## 325     651           3412               9  22 Male            21
## 327    2600           3337               9  44 Male             9
## 47     2600           3333               9  44 Male             9
## 321    1939           3111               9  28 Male           130
##     IQ Test Performance Verbal IQ performance
## 328                  91                    88
## 326                  88                   105
## 325                  68                    92
## 327                 101                    84
## 47                   86                    80
## 321                  88                   111

Now, we have a data set with no more errors.

Let’s start comparing variables between each other to see their correlation.

variable1 <- cleared_data$"Days from coma"
variable2 <- cleared_data$"Age"
variable3 <- cleared_data$"Sex"
variable4 <- cleared_data$"Coma duration"
variable5 <- cleared_data$"IQ Test Performance"
variable6 <- cleared_data$"Verbal IQ performance"
#naming the variables I assume could have significant correlation between each other

correlation_coefficient <- cor(variable4, variable5)
print(correlation_coefficient) #checking the correlation between coma duration and IQ test performance
## [1] -0.1786684

No significant correlation between coma duration and IQ test performance found.

correlation_coefficient <- cor(variable4, variable6)
print(correlation_coefficient) #checking the correlation between coma duration and verbal IQ performance
## [1] -0.003885065

No significant correlation between coma duration and verbal IQ performance found.

correlation_coefficient <- cor(variable6, variable5)
print(correlation_coefficient) #checking correlation between IQ test and verbal IQ performance
## [1] 0.6504822

Strong positive correlation between IQ test and verbal IQ performance found, meaning the higher the score patient had on IQ test, the better they scored in verbal IQ performance.

correlation_coefficient <- cor(variable1, variable5)
print(correlation_coefficient) #checking correlation between duration from the coma and IQ test score
## [1] 0.07337349

No significant correlation between duration from the coma and IQ test score found.

library(psych)
describeBy(cleared_data, cleared_data$`Sex`)
## 
##  Descriptive statistics by group 
## group: Female
##                       vars  n    mean      sd median trimmed    mad
## Patient                  1 71 5054.93 1373.10   5208 5224.11 932.56
## Days from coma           2 71  316.54  439.47    135  219.67 131.95
## Years from coma          3 71    0.79    1.29      0    0.51   0.00
## Age                      4 71   30.86   12.25     27   28.93   7.41
## Sex*                     5 71    1.00    0.00      1    1.00   0.00
## Coma duration            6 71    9.27   13.18      4    6.11   4.45
## IQ Test Performance      7 71   89.18   18.00     87   87.96  17.79
## Verbal IQ performance    8 71   94.35   14.25     92   94.54  17.79
##                        min  max range  skew kurtosis     se
## Patient               1075 7309  6234 -1.25     1.60 162.96
## Days from coma          18 2259  2241  2.28     5.42  52.16
## Years from coma          0    6     6  1.94     3.73   0.15
## Age                     17   77    60  1.52     2.01   1.45
## Sex*                     1    1     0   NaN      NaN   0.00
## Coma duration            0   68    68  2.36     5.65   1.56
## IQ Test Performance     50  133    83  0.48    -0.06   2.14
## Verbal IQ performance   64  131    67  0.03    -0.81   1.69
## ---------------------------------------------------- 
## group: Male
##                       vars   n    mean      sd median trimmed     mad
## Patient                  1 257 4647.49 1543.79   5009 4748.16 1298.76
## Days from coma           2 257  415.26  694.00    159  247.62  173.46
## Years from coma          3 257    1.05    1.98      0    0.57    0.00
## Age                      4 257   32.35   14.21     27   30.62   10.38
## Sex*                     5 257    2.00    0.00      2    2.00    0.00
## Coma duration            6 257   15.47   28.41      7    9.52    8.90
## IQ Test Performance      7 257   87.12   14.28     87   86.84   14.83
## Verbal IQ performance    8 257   95.19   14.03     94   94.68   14.83
##                       min  max range  skew kurtosis    se
## Patient               405 7548  7143 -0.63    -0.17 96.30
## Days from coma         13 4933  4920  3.27    12.59 43.29
## Years from coma         0   14    14  3.15    12.16  0.12
## Age                    16   80    64  0.97    -0.14  0.89
## Sex*                    2    2     0   NaN      NaN  0.00
## Coma duration           0  255   255  4.58    26.81  1.77
## IQ Test Performance    50  130    80  0.19    -0.14  0.89
## Verbal IQ performance  64  132    68  0.30    -0.49  0.88

From this data, we can see that on average, coma duration for males is 15,47 days, while for women it is only 9,27 days.

We can also see that females did slightly better in IQ test performance, while males did better in verbal IQ performances.

There is also a visible difference in kurtosis of coma duration, with females’ kurtosis being 5.65 and male 26.81, meaning the distribution of females’ distribution of coma duration has moderately heavy tails, while males’ has extremely heavy tails, meaning the distribution is much more dispersed and has very high probability of extreme values in tails, meaning it is more leptokurtic than the female distribution of coma duration. To be able to explain this better, I will make some box plot and histogram graphs.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
boxplot(`Coma duration` ~ Sex, data = cleared_data, 
        main = "Coma Duration by Gender",
        xlab = "Gender", 
        ylab = "Coma Duration in days",
        ylim = c(0, 100),
        border = c("red", "blue"),  #Setting the border color for boxes 
        col = c("pink", "lightblue"),  #Setting the fill color for boxes 
        outline = TRUE)

#Putting Coma duration and Gender variables in boxplot to analyse the difference between genders.

In this boxplot, I see a lot of outliers (more for males than females, like explained before), so I will take z-value of 2 (Any data that are more than 2 standard deviations away from the mean, as a standard threshold use).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
z_threshold <- 2 #Setting the z-value threshold
filtered_data <- cleared_data %>%
  filter(abs(scale(`Coma duration`)) < z_threshold)
## Warning: Using one column matrices in `filter()` was deprecated in dplyr
## 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.
boxplot(`Coma duration` ~ Sex, data = filtered_data, 
        main = "Coma Duration by Gender",
        xlab = "Gender", 
        ylab = "Coma Duration in days",
        ylim = c(0, 60),
        border = c("red", "blue"),  # Setting the border color for boxes
        col = c("pink", "lightblue"),  # Setting the fill color for boxes 
        outline = TRUE) #Adding the z-threshold filter to the boxplot and adjusting the y axis maximum for better visibility

From the adjusted box plot, we can see that males indeed have longer comas on average than females despite removing the outliers. It is also important to note there are also more males than males in this data set, possibly because they go in comas more often than females.

coma_duration_male <- filtered_data$`Coma duration`[filtered_data$Sex == "Male"]
coma_duration_female <- filtered_data$`Coma duration`[filtered_data$Sex == "Female"]
#Extracting the 'Coma duration' variable by 'Sex'

hist(coma_duration_male, probability = TRUE,col= rgb(0, 0, 1, 0.5), main = "Histogram and Normal Distribution (Male)",
     xlab = "Coma Duration", ylab = "Density") # Creating separate histograms for male and female data, setting the transparency of the columns to 0,5, so that both can be seen
hist(coma_duration_female, probability = TRUE, col = rgb(1, 0, 1, 0.5), main = "Histogram and Normal Distribution (Female)",
     xlab = "Coma Duration", ylab = "Density", add = TRUE)

legend("topright", legend = c("Male", "Female"), fill = c("blue", "pink"))

We can see our data is not normally distributed, as it is positively skewed (skewed to the right) for both male and female gender. Since the transparency columns are hard to read, I want to make another graph with separate distributions, based on gender.

This skewness makes sense in practice, because the coma durations start at 0 days.

par(mfrow = c(1, 2),mar = c(5, 4, 4, 2) + 0.1) # Setting up a layout with two plots side by side


hist(coma_duration_male, probability = TRUE, col = rgb(0, 0, 1, 0.5), main = "Histogram (Male)",
     xlab = "Coma Duration", ylab = "Density") # Adjusting the transparency

hist(coma_duration_female, probability = TRUE, col = rgb(1, 0, 1, 0.5), main = "Histogram (Female)",
     xlab = "Coma Duration", ylab = "Density") # Adjusting the transparency

# Reset the layout to default
par(mfrow = c(1, 1))

Despite x and y axis not being the same, we see male and female distribution being very similar to eachother.

Task 2

task2 <- read.table("~/IMB/Bootcamp/HOMEWORK/Body mass.csv",
                    header = TRUE,
                    sep=";",
                    dec=",") #Creating the dataset
summary(task2)
##        ID             Mass      
##  Min.   : 1.00   Min.   :49.70  
##  1st Qu.:13.25   1st Qu.:60.23  
##  Median :25.50   Median :62.80  
##  Mean   :25.50   Mean   :62.88  
##  3rd Qu.:37.75   3rd Qu.:64.50  
##  Max.   :50.00   Max.   :83.20

From the data summary we can see that the average weight of ninth grade students from 2018/2019 being 59.5kg increased in 2021/2022 to 62.88kgs.

We can see the minimum weight of a student is 49.7 kgs and maximum weight is 83.20 kgs. We also see that 25% of the students weighed 60.23 kgs or less and 75% of the students weighed 64.50kgs or less.

library(dplyr)
   #To see if there are any more descriptive statistics we can get from this dataset


summary(task2$Mass) #Calculating summary statistics for Mass as a variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.70   60.23   62.80   62.88   64.50   83.20
mean_mass <- mean(task2$Mass)
median_mass <- median(task2$Mass)
sd_mass <- sd(task2$Mass)
min_mass <- min(task2$Mass)
max_mass <- max(task2$Mass) #Calculating mean, median, standard deviation, minimum, and maximum


cat("Mean Mass:", mean_mass, "\n")
## Mean Mass: 62.876
cat("Median Mass:", median_mass, "\n")
## Median Mass: 62.8
cat("Standard Deviation:", sd_mass, "\n")
## Standard Deviation: 6.011403
cat("Minimum Mass:", min_mass, "\n")
## Minimum Mass: 49.7
cat("Maximum Mass:", max_mass, "\n") #Printing the calculated statistics
## Maximum Mass: 83.2

Additionally, the standard deviation in mass is 6kgs.

library(dplyr)   


hist(task2$Mass, 
     main = "Histogram of Body Mass",
     xlab = "Body Mass in Kilograms",
     ylab = "Frequency",
     col = "blue",  # Color of the bars
     border = "black",  # Color of the bar borders
     breaks = 20,) 


curve(dnorm(x, mean = mean(task2$Mass), sd = sd(task2$Mass)), 
      col = "red",     # Color of the curve
      lwd = 2,# Line width of the curve
      add = TRUE)      #Adding a normal distribution line to the graph, but it's not working because of different y axis (density and frequency), so the normal distribution curve is not accurate because of different values, so I am adding a separate normal distribution line separately with a density y axis.

curve(dnorm(x, mean = mean(task2$Mass), sd = sd(task2$Mass)), 
      main = "Normal distribution line based on our dataset",
       xlab= "Body Mass in Kilograms",
      ylab= "Density",
      col = "red",    
      lwd = 2,
      add = 50)

#Creating a histogram of onrmal distribution line based on the mean and standard deviation of our dataset
# Creating the histogram with density-based y-axis, because the original one had frequency y axis.
hist(task2$Mass, 
     main = "Histogram of Body Mass",
     xlab = "Body Mass in Kilograms",
     ylab = "Density",  # Changing the y-axis label to "Density"
     col = "blue",  
     border = "black",  
     breaks = 20,
     probability = TRUE)  # Setting probability to TRUE for density-based y-axis
curve(dnorm(x, mean = mean(task2$Mass), sd = sd(task2$Mass)), 
      col = "red",     
      lwd = 2,
      add = TRUE) #Adding the normal distribution curve

library(psych)
describeBy(task2$Mass)
## Warning in describeBy(task2$Mass): no grouping variable requested
##    vars  n  mean   sd median trimmed  mad  min  max range skew
## X1    1 50 62.88 6.01   62.8   62.56 3.34 49.7 83.2  33.5 0.85
##    kurtosis   se
## X1     2.11 0.85

Based on this descriptive statistics values and the histogram, we know that:

hypothesized_mean <- 59.5 # Hypothesized population mean


sample_mean <- 62.88
standard_error <- 0.85 # Sample mean and standard error from the dataset


sample_size <- 50 # Sample size (number of observations)


degrees_of_freedom <- sample_size - 1 # Degrees of freedom (n - 1 for a t-test)


t_statistic <- (sample_mean - hypothesized_mean) / (standard_error / sqrt(sample_size)) # Calculating the t-statistic


p_value <- 2 * pt(-abs(t_statistic), df = degrees_of_freedom) # Calculating the p-value for a two-tailed test


alpha <- 0.05 # Significance level (alpha)


if (p_value < alpha) {
  cat("Reject the null hypothesis (H0). The sample mean is statistically significantly different from 59.5.\n")
} else {
  cat("Fail to reject the null hypothesis (H0). There is not enough evidence to conclude that the sample mean is different from 59.5.\n")
} # Checking if the p-value is less than the significance level
## Reject the null hypothesis (H0). The sample mean is statistically significantly different from 59.5.
cat("t-statistic:", round(t_statistic, 2), "\n")
## t-statistic: 28.12
cat("p-value:", round(p_value, 4), "\n") # Displaying the t-statistic and p-value
## p-value: 0
library(ggplot2)
t.test(task2$Mass, mu=59.5, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  task2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

H0: μ is equal to 59.5kgs H1: μ is not equal to 59.5kgs

With significance level 5%, meaning alpha=0.05 p-value =0.000234, less than 0.001

We reject the null hypothesis with 95% confidence interval, at p<0.001, because the sample mean is statistically significantly different from the mean=59.5 kgs.

95% CI for arithmetic mean 61.17<Mu<64.58 Because 59.5kg is not included in the interval, we reject the HO at 5%.

sample_mean <- 62.88
standard_error <- 0.85 # Sample mean and standard error from the dataset


hypothesized_mean <- 59.5 # Hypothesized population mean


cohen_d <- (sample_mean - hypothesized_mean) / standard_error # Calculating Cohen's d


print(cohen_d) 
## [1] 3.976471

Cohen’s d value of 3.98 suggests a large effect size. This means that the observed difference between hypothesized and actual mean is significant. It’s positive, meaning the sample mean is higher than the hypothesized one.

Task 3

Import the dataset Apartments.xlsx
library(readxl)
Apartments <- read_excel("~/IMB/Bootcamp/HOMEWORK/Apartments.xlsx")
View(Apartments)
Description:
  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking)

Apartments$Balcony <- factor(Apartments$Balcony)

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

library(psych)
describeBy(Apartments$Price)
## Warning in describeBy(Apartments$Price): no grouping variable
## requested
##    vars  n    mean     sd median trimmed    mad  min  max range skew
## X1    1 85 2018.94 377.84   1950 1990.29 429.95 1400 2820  1420 0.54
##    kurtosis    se
## X1    -0.69 40.98
sample_mean <- 2018.94
sample_sd <- 377.84
sample_size <- 85
hypothesized_mean <- 1900


t_statistic <- (sample_mean - hypothesized_mean) / (sample_sd / sqrt(sample_size)) # Calculating the t-statistic


degrees_of_freedom <- sample_size - 1 # Calculating degrees of freedom


p_value <- 2 * pt(-abs(t_statistic), df = degrees_of_freedom) # Calculating the two-tailed p-value

# Significance level (alpha)
alpha <- 0.05


if (p_value < alpha) {
  cat("Reject the null hypothesis (H0). The sample mean is statistically significantly different from 1900 EUR.\n")
} else {
  cat("Fail to reject the null hypothesis (H0). There is not enough evidence to conclude that the sample mean is different from 1900 EUR.\n")
}
## Reject the null hypothesis (H0). The sample mean is statistically significantly different from 1900 EUR.
cat("t-statistic:", round(t_statistic, 2), "\n")
## t-statistic: 2.9
cat("p-value:", round(p_value, 4), "\n") # Displaying the t-statistic and p-value
## p-value: 0.0047
library(ggplot2)
t.test(Apartments$Price, mu=1900, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We reject the null hypothesis (H0) at p-value. The sample mean is statistically significantly different from 1900 EUR. We can say with 95 CI, that 1937.44<Mu<2100.44

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments) # Fitting the simple linear regression model


summary(fit1) # Printing a summary of the regression results
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Regression coefficient: For every additional year of age of the apartment, the price per m2 decreases for 8.98€. Coefficient of correlation: Is close to 0, meaning the age of the apartment does not predict the variability of price per m2. Coefficient of determination: 5.3% of variability of price per m2 is explained by linear effect of the age of the apartment.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
pairs(Apartments[, c("Price", "Age", "Distance")])

cor_matrix <- cor(Apartments[, c("Price", "Age", "Distance")])
print(cor_matrix)
##               Price         Age    Distance
## Price     1.0000000 -0.23025502 -0.63102794
## Age      -0.2302550  1.00000000  0.04290813
## Distance -0.6310279  0.04290813  1.00000000

Correlation between independent variables Distance and Age are 0.043, meaning there is no potential problem of multicollinearity and they both have unique effects on the dependant varible, Price.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments) # the multiple linear regression model

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Price=2460.10-7.93* Age -20.67 *Distance

Chech the multicolinearity with VIF statistics. Explain the findings.

vif_values <- vif(fit2)
print(vif_values)
##      Age Distance 
## 1.001845 1.001845

Because the VIF is much lower than 5, we don’t need to remove any of the variables as VIFs indicate no risk of multicolinearity

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

std_residuals <- rstandard(fit2) # Calculating standardized residuals


Apartments$Std_Residuals <- std_residuals # Adding standardized residuals to the dataset


cook_distances <- cooks.distance(fit2) # Calculating Cook's distances


Apartments$Cook_Distances <- cook_distances # Adding Cook's distances to the dataset



problematic_units <- which(abs(std_residuals) > 2 | cook_distances > 4/nrow(Apartments)) # Identifying units with standardized residuals > 2 or Cook's distances > 4/n


problematic_units # printing the problematic units
## 22 33 38 53 55 
## 22 33 38 53 55
filtered_data2 <- Apartments[-problematic_units, ] # Removing problematic units from the dataset
(filtered_data2)
## # A tibble: 80 × 7
##      Age Distance Price Parking Balcony Std_Residuals Cook_Distances
##    <dbl>    <dbl> <dbl> <fct>   <fct>           <dbl>          <dbl>
##  1     7       28  1640 0       1              -0.665       0.00739 
##  2    18        1  2800 1       0               1.78        0.0304  
##  3     7       28  1660 0       0              -0.594       0.00588 
##  4    28       29  1850 0       1               0.754       0.00830 
##  5    18       18  1640 1       1              -1.07        0.00511 
##  6    28       12  1770 0       1              -0.778       0.00490 
##  7    14       20  1850 0       1              -0.302       0.000548
##  8    18        6  1970 1       1              -0.787       0.00378 
##  9    22        7  2270 1       0               0.455       0.00129 
## 10    25        2  2570 1       0               1.24        0.0167  
## # ℹ 70 more rows

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

plot(fitted(fit2, scale = 1), rstandard(fit2), 
     xlab = "Standardized Fitted Values", 
     ylab = "Standardized Residuals", 
     main = "Scatterplot of Standardized Residuals vs. Standardized Fitted Values") # Creating a scatterplot of standardized residuals vs. standardized fitted values


abline(h = 0, col = "blue", lty = 2) # Adding a horizontal line at y = 0 for reference

Potential heteroskedasticity is not clear from the scatterplot graph, so I will check with Breusch Pagan test.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

Because p>0.005, we cannot reject the null hypothesys that states the variance is constant, meaning there is risk for potential heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(rstandard(fit2), probability = TRUE, 
     main = "Histogram of Standardized Residuals", 
     xlab = "Standardized Residuals") # Creating a histogram of standardized residuals


curve(dnorm(x, mean = mean(rstandard(fit2)), sd = sd(rstandard(fit2))), 
      col = "blue", lwd = 2, add = TRUE) # Adding a density curve (normal distribution)

It doesn’t seem normally distributed, so I will run the Shapiro test as well.

shapiro_test <- shapiro.test(rstandard(fit2))

shapiro_test
## 
##  Shapiro-Wilk normality test
## 
## data:  rstandard(fit2)
## W = 0.95306, p-value = 0.00366

Because p<0.005, we reject the null hyphothesys of normal distribution.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
fit2 <- lm(Price ~ Age + Distance, data = Apartments) # Fitting the multiple linear regression model

cooksd <- cooks.distance(fit2) # Calculating Cook's distances

cooksd_threshold <- 4 / length(cooksd) # threshold for Cook's distances 

influential_observations <- which(cooksd > cooksd_threshold) # Identifying influential observations

filtered_data <- Apartments[-influential_observations, ] # Creating a new dataset without influential observations

fit2_filtered <- lm(Price ~ Age + Distance, data = filtered_data) # Fitting the multiple linear regression model without influential observations

summary(fit2_filtered)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = filtered_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

If Age of the apartment increases by 1 year, the price per m2 on average decreases by -8.67€, assuming that all other explanatory variables, included in the model, are constant.

If distance from the city center increases by 1km, the price per m2 on average decreases by -24.06€, assuming that all other explanatory variables, included in the model, are constant.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

library(dplyr)

Apartments <- Apartments %>%
  mutate(Parking = factor(Parking),
         Balcony = factor(Balcony)) # Creating dummy variables for Parking and Balcony

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments) # Fitting the multiple linear regression model with dummy variables

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Price=2301-6.78Age -18.05Distance +196.17Parking + 1.94Balcony

With function anova check if model fit3 fits data better than model fit2.

anova_result <- anova(fit2, fit3) # Comparing fit3 and fit2 with ANOVA

print(anova_result)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Model 1 is better H1: Model 2 is better

p=0.1

We reject the null hypothesys, because p<0.05. This means model 2, the more complex model, is better, because it explains more variability in price per m2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Assuming that all other variables are the same, the apartment with parking has on average by 196.17€ higher price compared to the apartments without the parking (p < 0.05)

We cannot assume that when all other variables are the same, the apartment with balcony has on average by 1.94€ higher price compared to the apartments without the balcony, because p=97 is not smaller than alpha=0.05.

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[colnames(Apartments)%in% c( "Price","Fitted","Residuals" )])
## # A tibble: 6 × 3
##   Price Fitted Residuals
##   <dbl>  <dbl>     <dbl>
## 1  1640  1751.    -111. 
## 2  2800  2357.     443. 
## 3  1660  1749.     -88.8
## 4  1850  1590.     260. 
## 5  1640  2053.    -413. 
## 6  1770  1897.    -127.

Residual for the apartment ID2 is 442.59€, which is how much our model’s price per m2 for apartment ID2 prediction deviated from its actual observed price per m2.