Task 3

Task 1

data("USArrests")

Source of data: World Almanac and Book of facts 1975. (Crime rates).

Statistical Abstracts of the United States 1975, p.20, (Urban rates), possibly available as https://books.google.ch/books?id=zl9qAAAAMAAJ&pg=PA20.

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Let’s first create a summary:

A data frame with 50 observations on 4 variables.

To make it look more clear, I am going to create a new data frame with reordered columns.

mydata <- USArrests[ , c(1, 2, 4, 3)]

I would also like the names of the countries will be shown next to the data, that’s why I will create a new column “Country” with names of Countries.

mydata$Country <- rownames(mydata)

rownames(mydata) <- NULL

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

I will create a new data frame with additional variable (Country) and columns in my preffered order.

mydata2 <- mydata %>%
  select(Country, UrbanPop, Murder, Assault, Rape)

Now, we can create a summary of the data we are analysing.

summary(mydata2)

##    Country             UrbanPop         Murder          Assault           Rape      
##  Length:50          Min.   :32.00   Min.   : 0.800   Min.   : 45.0   Min.   : 7.30  
##  Class :character   1st Qu.:54.50   1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:15.07  
##  Mode  :character   Median :66.00   Median : 7.250   Median :159.0   Median :20.10  
##                     Mean   :65.54   Mean   : 7.788   Mean   :170.8   Mean   :21.23  
##                     3rd Qu.:77.75   3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:26.18  
##                     Max.   :91.00   Max.   :17.400   Max.   :337.0   Max.   :46.00

Explanation of the variables used in the analysis: - Country = name of the country in USA - UrbanPop = Percent of people living in urban area - Murder = number of Murder arrests per 100,000 residents - Assault = number of Assault arrests per 100,000 residents - Rape = number of Rape arrests per 100,000 residents

Descriptive statistics for each column of the matrix:

#install.packages("psych")
library(psych)

describe(mydata2[ , -1])

##          vars  n   mean    sd median trimmed    mad  min   max range  skew kurtosis    se
## UrbanPop    1 50  65.54 14.47  66.00   65.88  17.79 32.0  91.0  59.0 -0.21    -0.87  2.05
## Murder      2 50   7.79  4.36   7.25    7.53   5.41  0.8  17.4  16.6  0.37    -0.95  0.62
## Assault     3 50 170.76 83.34 159.00  168.48 110.45 45.0 337.0 292.0  0.22    -1.15 11.79
## Rape        4 50  21.23  9.37  20.10   20.36   8.60  7.3  46.0  38.7  0.75     0.08  1.32

Explanation of different values:

Murder: - Mean: On average, 7.79 murder arrests per 100,000 residents happen in USA countries. - Min: The minimum number of murder arrests that happen per 100,000 residents of USA countries is 0.8.

Assault: - Median: 50% of countries have 159 or less Assault Arrests per 100,000 residents. - SD (standard Deviation): On average, the Assault Arrest rates in these countries deviate from their mean (average) by approximately 83.34 Arrests per 100,000 residents.

Rape: - Skewness: A skewness of 0.75 suggests a moderate to strong positive skew, implying that there are relatively more high values or outliers on the right side of the distribution. In practical terms, this means that, within the dataset of Rape Arrests in USA countries, there is a tendency for some countries to have higher rates of Rape Arrests per 100,000 residents compared to the majority of countries. The distribution is not symmetric, and it may be influenced by relatively higher values or outliers on the right side. - Max: The highest number of Rape Arrests per 100,000 residents in USA countries is 46.

subset_row <- mydata2[mydata2$Rape == 46.0, ]

subset_row

##    Country UrbanPop Murder Assault Rape
## 28  Nevada       81   12.2     252   46

We can see, that the country with the biggest number of Rape Arrests per 100,000 residents is Nevada (46).

#install.packages("ggplot2")
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

#Creating histograms for each variable
hist(mydata2$Murder,
     main = "Histogram of Murder Arrest Rate",
     xlab = "Murder Arrests per 100,000 residents",
     ylab = "Frequency",
     breaks = seq(from = 0, to = 18, by = 2))

hist(mydata2$Assault,
     main = "Histogram of Assault Arrest Rate",
     xlab = "Assault Arrests per 100,000 residents",
     ylab = "Frequency",
     breaks = seq(from = 0, to = 400, by = 50))

hist(mydata2$Rape,
     main = "Histogram of Rape Arrest Rate",
     xlab = "Rape Arrests per 100,000 residents",
     ylab = "Frequency",
     breaks = seq(from = 0, to = 50, by = 5))

Interpretation of Histogram of Murder Arrest Rate: It looks like the values are not distributed normally (it looks like the distribution is bimodal) and a bit assymetric to the right (positive skewness).

Interpretation of Histogram of Assault Arrest Rate: It looks like the values are not distributed normally (it looks like the distribution is bimodal).

Interpretation of Histogram of Rape Arrest Rate: It looks like the values are normally distributed but a bit assymetric to the right (positive skewness).

Let’s check the distribution of errors with Shapiro-Wilk normality test:

shapiro_test_result <- shapiro.test(mydata2$Murder)
print(shapiro_test_result)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$Murder
## W = 0.95703, p-value = 0.06674

H0: variable Murder is normally distributed H1: variable is NOT normally distributed

Because p value is more greater than 0.05 (assuming a 0.05 significance level), we can not reject the null hypothesis.

We also have a sample greater than 30 (n = 50), which means that we shouldn’t worry much about the distribution of errors, as they are for sure normally distributed.

Let’s create a regression model for each variable:

lm_model_Murder <- lm(Murder ~ UrbanPop, data = mydata2)

lm_model_Assault <- lm(Assault ~ UrbanPop, data = mydata2)

lm_model_Rape <- lm(Rape ~ UrbanPop, data = mydata2)

It would be interesting if we would compare number of Murder, Assault and Rape Arrests in USA countries (per 100,000 residents) based on the percentage of Urban Population. Therefore, we will create scatterplots and add a regression line.

ggplot(mydata2, aes(x = Murder, y = UrbanPop)) +
  geom_point(color = "cyan3") +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Number of Murder Arrests based on the percentage of Urban Population",
       x = "Number of Murder Arrests per 100,000 residents",
       y = "Urban Population in %")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(mydata2, aes(x = Assault, y = UrbanPop)) +
  geom_point(color = "deeppink1") +
    geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Number of Assault Arrests based on the percentage of Urban Population",
       x = "Number of Assault Arrests per 100,000 residents",
       y = "Urban Population in %")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(mydata2, aes(x = Rape, y = UrbanPop)) +
  geom_point(color = "orchid4") +
    geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Number of Rape Arrests based on the percentage of Urban Population",
       x = "Number of Rape Arrests per 100,000 residents",
       y = "Urban Population in %")

## `geom_smooth()` using formula = 'y ~ x'

Interpretation of scatterplot and regression line in number of Murder Arrests based on the percentage of Urban Population in USA countries: We can see that that there is slightly positive linear correlation between the variables. That means that with higher percentage of urban population, the number of Murder Arrests per 100,00 just slightly increases. There are potential outliers.

Interpretation of scatterplot and regression line in number of Assault Arrests based on the percentage of Urban Population in USA countries: There is a positive linear correlation between variables. That means that with higher percentage of urban population, the number of Assault Arrests per 100,00 increases.

Interpretation of scatterplot and regression line in number of Rape Arrests based on the percentage of Urban Population in USA countries: It looks like there is potential of violation of homoskedasticity (distribution is not contant) - we will check this later with Breusch-Pagan test. There are potential outliers.

To get better result, we should remove outliers and units that have a great impact on the estimated regression function. First, we calculate the standardized residuals and add the values into the data frame mydata2:

std_residuals_Assault <- rstandard(lm_model_Assault)
mydata2$std_res_Assault <- std_residuals_Assault

std_residuals_Murder <- rstandard(lm_model_Murder)
mydata2$std_res_Murder <- std_residuals_Murder

std_residuals_Rape <- rstandard(lm_model_Rape)
mydata2$std_res_Rape <- std_residuals_Rape

We would now like to winsorize the outliers:

mean_Murder <- mean(mydata2$Murder)
upper_bound <- mean_Murder + 3 * std_residuals_Murder
lower_bound <- mean_Murder -3 * std_residuals_Murder
mydata2$std_res_Murder <- ifelse(mydata2$std_res_Murder > +3, +3, mydata2$std_res_Murder)
mydata2$std_res_Murder <- ifelse(mydata2$std_res_Murder < -3, -3, mydata2$std_res_Murder)

mean_Assault <- mean(mydata2$Assault)
upper_bound <- mean_Assault + 3 * std_residuals_Assault
lower_bound <- mean_Assault -3 * std_residuals_Assault
mydata2$std_res_Assault <- ifelse(mydata2$std_res_Assault > +3, +3, mydata2$std_res_Assault)
mydata2$std_res_Assault <- ifelse(mydata2$std_res_Assault < -3, -3, mydata2$std_res_Assault)

mean_Rape <- mean(mydata2$Rape)
upper_bound <- mean_Rape + 3 * std_residuals_Rape
lower_bound <- mean_Rape -3 * std_residuals_Rape
mydata2$std_res_Rape <- ifelse(mydata2$std_res_Rape > +3, +3, mydata2$std_res_Rape)
mydata2$std_res_Rape <- ifelse(mydata2$std_res_Rape < -3, -3, mydata2$std_res_Rape)

We will round the values of standardized residuals:

mydata2$std_res_Assault <- round(mydata2$std_res_Assault, digits = 2)
mydata2$std_res_Murder <- round(mydata2$std_res_Murder, digits = 2)
mydata2$std_res_Rape <- round(mydata2$std_res_Rape, digits = 2)

Let’s also calculate Cook’s distances for each variable:

cooks_dis_Murder <- influence.measures(lm_model_Murder)$cooks.distance

cooks_dis_Rape <- influence.measures(lm_model_Rape)$cooks.distance

cooks_dis_Assault <- influence.measures(lm_model_Assault)$cooks.distance

As the Cook’s distance is NULL, it means that these values are not influential in the context of regression analysis and there is no need to be concerned about its impact on the model.

lm_model_all <- lm(UrbanPop ~ Murder + Assault + Rape, data = mydata2)
summary(lm_model_all)

## 
## Call:
## lm(formula = UrbanPop ~ Murder + Assault + Rape, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.456  -6.950   0.077   7.770  25.221 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.84187    4.82483  10.952 2.09e-14 ***
## Murder      -1.41154    0.71954  -1.962   0.0559 .  
## Assault      0.05190    0.04161   1.247   0.2186    
## Rape         0.69841    0.26776   2.608   0.0122 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.08 on 46 degrees of freedom
## Multiple R-squared:  0.2337, Adjusted R-squared:  0.1837 
## F-statistic: 4.676 on 3 and 46 DF,  p-value: 0.006208

Interpretation of some values: Estimated coefficient - Murder (-1.412) = On average, number of Murder Arrests increases by one with the Urban Population decrease by 1.412 per cent, assuming other variables are held constant.

Estimated coefficient - Assault (0.052) = On average, an increase in number of Assault Arrests per 100,000 residents by one, leads to an increase in percentage of Urban Population.

Standard Error - Murder (0.72) = For example, the standard error for Murder (0.71954) indicates the amount of variation or uncertainty in the Murder coefficient estimate.

H0: Number of Murder Arrests per 100,000 residents are correlated to the percentage of Urban Population H1: Number of Murder Arrests per 100,000 residents are NOT correlated to the percentage of Urban Population

alpha = 0.05 p = 0.056

Conclusion: We can not reject the null hypothesis at p = 0.056, because p is greater than alpha.

###Task 2

A researcher examined the effects of online schooling on body weight in ninth graders (Body mass.csv in Task 2 folder). Based on data from the 2018/2019 school year, the average weight was 59.5 kg. At the beginning of the 2021/2022 school year, researchers randomly selected 50 ninth graders and measured their weight.

Show the descriptive statistics of body mass and its distribution with the histogram.
Test the following hypothesis with program R: H0: mi = 59.5.
Explain the results and interpret the effect size.

mydata_T2 <- read.table("~/R data/TakeHomeExam/Body mass.csv", header = TRUE, sep = ";", dec = ",") 
head(mydata_T2)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

library(psych)
describe(mydata_T2$Mass)

##    vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 50 62.88 6.01   62.8   62.56 3.34 49.7 83.2  33.5 0.85     2.11 0.85

Interpretation of results: - Mean: The average weight of randomly selected 50 ninth graders in the school year 2021/2022 was 62.88 kilograms. - Standard Deviation: On average, the weight of ninth graders in the school year 2021/2022 vary by approximately 6.01 kilograms from the average weight (62.88 kilograms). - Median: 50% of ninth graders weighed 62.8 kilograms or less, and 50% of ninth graders weighed more than 62.8 kilograms in school year 2021/2022 on average. - Min: A ninth graders with smallest body mass in 2021/2022 weighed 49.7 kilograms. - Max: The heaviest ninth grader in the school year 2021/2022 weighed 83.2 kilograms. - Range: The difference between the maximum weight of ninth graders in the school year 2021/2022 and the minimum weight is 33.5 kilograms. - Standard Error: Standard error of 0.85 indicates that there is some degree of variability associated with the sample statistic.

library(ggplot2)
hist(mydata_T2$Mass,
     main = "Histogram of Online Schooling 
     effect on Body Mass of Ninth graders in 2021/2022",
     xlab = "Body Mass",
     ylab = "Frequency",
     breaks = seq(from = 40, to = 90, by = 4))

It looks like the values are normally symmetric distributed.

hypothesized_mean <- 59.5
result <- t.test(mydata_T2, mu = hypothesized_mean)
print(result)

## 
##  One Sample t-test
## 
## data:  mydata_T2
## t = -7.0195, df = 99, p-value = 2.821e-10
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  39.85971 48.51629
## sample estimates:
## mean of x 
##    44.188

H0: 𝜇 = 59.5 H1: 𝜇 != 59.5

p < 0.001 alpha 0.05

With 95 percent confidence interval, we can reject the null hypothesis at p < 0.001.

mean_21_22 <- mean(mydata_T2$Mass)
mean_18_19 <- 59.5

sd_21_22 <- sd(mydata_T2$Mass)

cohen_d <- (mean_21_22 - mean_18_19) / sd_21_22

cat("Cohen's d:", cohen_d, "\n")

## Cohen's d: 0.5615994

Explanation: A Cohen’s d of 0.5615994 signifies a moderate to large effect size, indicating that the difference between average body mass of ninth graders in the school year 2021/2022 and average body mass of ninth graders in the school year 2018/2019 is substantial and not just a result of random variability. Online Schooling strongly affected body mass of ninth graders.

###Task 3

You analyze the price per m2 for a sample of apartments in the Ljubljana region (Apartments.xlsx in Task 3 folder). Follow the questions in R Markdown included in the folder.

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/R data/TakeHomeExam/Apartments.xlsx")
View(Apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking, 
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

Apartments$Balcony <- factor(Apartments$Balcony, 
                          levels = c(0, 1),
                          labels = c("No", "Yes"))
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 No      Yes    
## 2    18        1  2800 Yes     No     
## 3     7       28  1660 No      No     
## 4    28       29  1850 No      Yes    
## 5    18       18  1640 Yes     Yes    
## 6    28       12  1770 No      Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

hypothesized_mean <- 1900
result <- t.test(Apartments$Price, mu = hypothesized_mean)
print(result)

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0: mu = 1900 eur H1: mu != 1900 eur alpha = 0.05 p < 0.001

At 95 percent of confidence we can reject the null hypothesis (at p < 0.001). Average price of apartments is not equal to 1900 eur.

mean(Apartments$Price)

## [1] 2018.941

Average price of apartments is equal to 2018.941 eur.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Estimate of regression coefficient: On average, the price of apartments decreases by approximately 8.975 units (e.g., dollars) for each year increase in age.

Coefficient of correlation (multiple r-squared): The R-squared value is approximately 0.05302, which indicates a weak linear relationship between “Age” and “Price.” This means that “Age” explains only about 5.30% of the variability in apartment prices. A higher R-squared value would indicate a stronger relationship.

Coefficient of determination (adjusted r-squared): The adjusted R-squared takes into account the model’s complexity, providing a slightly lower value of 0.04161. Low R-squared value and the significance level of the age coefficient suggest that while there is a statistically significant relationship between age and price, it is not a very strong predictor of apartment prices, and other factors may also influence the prices significantly.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

pairs(Apartments[c("Price", "Age", "Distance")])

There is no problem with multicolinearity because the values are spreaded and don’t form a line.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

vif_values <- vif(fit2)
print(vif_values)

##      Age Distance 
## 1.001845 1.001845

Both Vif values are less than 5, which means that there is almost no multicolinearity between the variables in the regression model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3) 
Apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(Apartments[order(Apartments$StdResid),], 6) #Three units with lowest value of stand. residuals

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     7        2  1760 No      Yes        -2.15  0.066
## 2    12       14  1650 No      Yes        -1.50  0.013
## 3    12       14  1650 No      No         -1.50  0.013
## 4    13        8  1800 No      No         -1.38  0.012
## 5    14       16  1660 No      Yes        -1.26  0.008
## 6    24        5  1830 Yes     No         -1.19  0.012

head(Apartments[order(-Apartments$CooksD),], 6) #Six units with highest value of Cooks distance

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32 
## 2    43       37  1740 No      No          1.44  0.104
## 3     2       11  2790 Yes     No          2.05  0.069
## 4     7        2  1760 No      Yes        -2.15  0.066
## 5    37        3  2540 Yes     Yes         1.58  0.061
## 6    40        2  2400 No      Yes         1.09  0.038

library(dplyr)
Apartments <- Apartments %>%
  filter(!Age %in% c(7 , 5))

head(Apartments[order(Apartments$StdResid),], 6) #Three units with lowest value of stand. residuals

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1    12       14  1650 No      Yes        -1.50  0.013
## 2    12       14  1650 No      No         -1.50  0.013
## 3    13        8  1800 No      No         -1.38  0.012
## 4    14       16  1660 No      Yes        -1.26  0.008
## 5    24        5  1830 Yes     No         -1.19  0.012
## 6    30       17  1560 No      No         -1.10  0.012

head(Apartments[order(-Apartments$CooksD),], 6) #Six units with highest value of Cooks distance

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1    43       37  1740 No      No          1.44  0.104
## 2     2       11  2790 Yes     No          2.05  0.069
## 3    37        3  2540 Yes     Yes         1.58  0.061
## 4    40        2  2400 No      Yes         1.09  0.038
## 5     8        2  2820 Yes     No          1.66  0.037
## 6     8       26  2300 Yes     Yes         1.57  0.034

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_residuals <- residuals(fit2) / sqrt(var(residuals(fit2)))
std_fitted_values <- fitted(fit2) / sqrt(var(fitted(fit2)))

plot(std_fitted_values, std_residuals,
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     main = "Scatterplot of Standardized Residuals vs.
     Standardized Fitted Values")

abline(h = 0, col = "red", lty = 2)

The distribution of values is constant, so there is no violation of homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.93482, p-value = 0.0005609

H0: data is normally distributed H1: data is not normally distributed

At p < 0.001 and 95% of confidence level, we can reject the null hypothesis - we can not say that data is normally distributed.

On the graph it looks like the distribution in bimodal and assymetric to right.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -427.09 -227.61  -61.36  212.35  565.71 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2492.357     76.921  32.401  < 2e-16 ***
## Age           -7.836      3.319  -2.361   0.0208 *  
## Distance     -22.946      2.897  -7.919 1.57e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.3 on 76 degrees of freedom
## Multiple R-squared:  0.4978, Adjusted R-squared:  0.4846 
## F-statistic: 37.67 on 2 and 76 DF,  p-value: 4.294e-12

Explanation of coefficients: If average age of the apartment increases by one year, price decreases on average by 7.836 eur per m2 at p = 0.0208, assuming that all other explanatory variables, included in the model, are constant. If average distance between the apartment and city center increases by one kilometer, price decreases on average by 22.946 eur per m2 at p < 0.001, assuming that all other explanatory variables, included in the model, are constant.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -416.05 -211.78  -23.77  268.37  523.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2376.705     98.508  24.127  < 2e-16 ***
## Age           -6.935      3.290  -2.108   0.0384 *  
## Distance     -20.836      3.015  -6.912 1.44e-09 ***
## ParkingYes   139.959     65.316   2.143   0.0354 *  
## BalconyYes    -7.205     61.651  -0.117   0.9073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.6 on 74 degrees of freedom
## Multiple R-squared:  0.5276, Adjusted R-squared:  0.502 
## F-statistic: 20.66 on 4 and 74 DF,  p-value: 1.832e-11

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     76 5717496                           
## 2     74 5378761  2    338735 2.3301 0.1044

H0: Model1 is better H1: Model2 is better

p > 0.05 0.001

We can not reject null hypothesis at p = 0.1044 (because p > 0.05).

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -416.05 -211.78  -23.77  268.37  523.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2376.705     98.508  24.127  < 2e-16 ***
## Age           -6.935      3.290  -2.108   0.0384 *  
## Distance     -20.836      3.015  -6.912 1.44e-09 ***
## ParkingYes   139.959     65.316   2.143   0.0354 *  
## BalconyYes    -7.205     61.651  -0.117   0.9073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269.6 on 74 degrees of freedom
## Multiple R-squared:  0.5276, Adjusted R-squared:  0.502 
## F-statistic: 20.66 on 4 and 74 DF,  p-value: 1.832e-11

If an apartment has parking, the estimated increase in the average price per m2 is 139.959 eur per m2 compared to an apartment without parking, holding other variables constant. If an apartment has a balcony, the estimated decrease in the average price is 7.205 eur per m2 compared to an apartment without a balcony, holding other variables constant.

H0: all regression coefficients in the model are equal to 0 H1: at least one regression coefficient in the model is not equal to 0

At p < 0.001 we can reject the null hypothesis and conclude that at least one of the factors in the model has a statistically significant effect on the price. The F-statistic helps determine whether the regression model as a whole is useful for predicting the dependent variable. In our case, it indicates that the model is useful.

Save fitted values and calculate the residual for apartment ID2.

fitted_values <- predict(fit3)
apartment_id2_index <- which(Apartments[ , 1] == 2)
fitted_value_id2 <- fitted_values[apartment_id2_index]
residual_id2 <- Apartments$Price[apartment_id2_index] - fitted_value_id2
cat("Fitted Value for Apartment ID:", fitted_value_id2, "\n")

## Fitted Value for Apartment ID: 2273.602

cat("Residual for Apartment ID2:", residual_id2, "\n")

## Residual for Apartment ID2: 516.3984

The actual price of Apartment ID2 is 516.3984 eur higher than what the regression model predict for that apartment based on its features (age, distance, parking, balcony). This positive residual suggests that Apartment ID2 is more expensive than what the model expected it to be, given its characteristics and the relationships captured by the model.