R take-home exam

Task 1

Larisa Omanović

This dataset comprises demographic and academic information of individuals, including gender, age, field of study, year of study, and CGPA. Additionally, it includes marital status and responses to questions related to mental health, such as depression, anxiety, panic attacks, and whether individuals sought specialist treatment. I am going to analyze the interplay between academic performance, demographics, and mental health factors among students or individuals in various fields of study.

healthdata <- read.csv("healthdata.csv")
healthdata2 <- healthdata [1:50, ] #I took the first 50 rows.

healthdata2 <- healthdata2[ , -c(1, 7, 9, 10, 11)] #I lowered the number of columns as some were not important, others were similar (eg. anxiety and panic attack).

colnames(healthdata2) <- c("gender", "age", "course", "study year", "CGPA", "depression") #Renamed the columns.

healthdata2$course[healthdata2$course %in% c("Pendidikan islam", "Irkhs", "KIRKHS")] <- "Islamic education" #I changed this specific data.

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

healthdata2 <- healthdata2 %>%
  mutate(course = case_when(
    course %in% c("Engineering", "Engine", "KOE") ~ "Engineering",
    course %in% c("Mathemathics", "ENM") ~ "Maths",
    course %in% c("BCS", "BIT") ~ "Computer science", 
    course %in% c("Business", "Human Resources", "Accounting", "Banking Studies", "Business Administration", "KENMS") ~ "Business/Economics",
    course %in% c("Laws", "Law") ~ "Law",
    course %in% c("Psychology", "Marine science", "Islamic education") ~ "Social Sciences",
    course %in% c("Usuluddin", "Marine", "TAASL") ~ "Humanities",
    TRUE ~ "Other")) #I have regrouped the courses so I can get analyse more accurately.

#install.packages("tidyr")
library(tidyr)
healthdata2 <- drop_na(healthdata2, age) #I have removed the missing unit.

library(dplyr)
healthdata2 <- healthdata2 %>%
  mutate(CGPA = case_when(
    CGPA == '0 - 1.99' ~ 1,
    CGPA == '2.00 - 2.49' ~ 2,
    CGPA == '2.50 - 2.99' ~ 3,
    CGPA == '3.00 - 3.49' ~ 4,
    CGPA == '3.50 - 4.00' ~ 5)) #I regrouped the CGPAs.

healthdata2[7, 5] <- 5 #Changed the value manually according to the data set.

library(stringr)
healthdata2$"study year" <- str_to_title(healthdata2$"study year") #I unified the writings.
healthdata2$`study year` <- as.numeric(str_extract(healthdata2$`study year`, "\\d+"))

print(healthdata2)

##    gender age             course study year CGPA depression
## 1  Female  18        Engineering          1    4        Yes
## 2    Male  21    Social Sciences          2    4         No
## 3    Male  19   Computer science          1    4        Yes
## 4  Female  22                Law          3    4        Yes
## 5    Male  23              Maths          4    4         No
## 6    Male  19        Engineering          2    5         No
## 7  Female  23    Social Sciences          2    5        Yes
## 8  Female  18   Computer science          1    5         No
## 9  Female  19 Business/Economics          2    3         No
## 10   Male  18    Social Sciences          1    5         No
## 11 Female  20    Social Sciences          1    5         No
## 12 Female  24        Engineering          3    5        Yes
## 13 Female  18   Computer science          1    4        Yes
## 14   Male  19        Engineering          1    4         No
## 15 Female  18 Business/Economics          2    5         No
## 16   Male  24   Computer science          3    5         No
## 17 Female  24              Other          3    4         No
## 18 Female  24              Maths          4    4        Yes
## 19 Female  20   Computer science          2    5         No
## 20 Female  18    Social Sciences          2    5        Yes
## 21 Female  19        Engineering          1    4         No
## 22 Female  18        Engineering          2    4         No
## 23 Female  24   Computer science          1    5         No
## 24 Female  24        Engineering          1    4         No
## 25 Female  23   Computer science          3    5        Yes
## 26 Female  18 Business/Economics          1    5         No
## 27 Female  19        Engineering          1    5         No
## 28   Male  18        Engineering          2    4        Yes
## 29 Female  24   Computer science          3    5        Yes
## 30 Female  24   Computer science          4    5         No
## 31 Female  23 Business/Economics          2    4         No
## 32   Male  18   Computer science          2    4         No
## 33   Male  19   Computer science          1    5         No
## 34   Male  18   Computer science          2    5        Yes
## 35 Female  19   Computer science          1    4        Yes
## 36 Female  18        Engineering          1    2         No
## 37 Female  18                Law          3    4        Yes
## 38 Female  19   Computer science          1    3        Yes
## 39 Female  18    Social Sciences          1    5         No
## 40 Female  24        Engineering          2    3        Yes
## 41 Female  24   Computer science          3    4         No
## 42 Female  22        Engineering          4    5         No
## 43 Female  20              Other          2    4        Yes
## 44   Male  23         Humanities          2    5         No
## 45   Male  18   Computer science          1    5         No
## 46 Female  19        Engineering          1    5         No
## 47 Female  18        Engineering          4    5         No
## 48   Male  24   Computer science          2    4        Yes
## 49 Female  24   Computer science          3    5         No

This is now the cleaned data. Now I start with the analysis.

First I will calculate the average age participants and then the median:

mean(healthdata2$age)

## [1] 20.53061

round(mean(healthdata2$age), 0)

## [1] 21

The average age rounded to whole years is 21.

median(healthdata2$age)

## [1] 19

The most common age of participants is 19.

About CGPA:

round(summary(healthdata2$CGPA), 2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00    5.00    4.41    5.00    5.00

This shows us that less than 50 or 50% of participants have a CPGA of 5.00. The result of the 1st Quartile is 4.00, which means 25% or less of the participants have a CGPA of 4.00.

#install.packages("pastecs")
library(pastecs)

## Warning: package 'pastecs' was built under R version 4.2.3

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following objects are masked from 'package:dplyr':
## 
##     first, last

numericaldata <- healthdata2[, c("age", "study year", "CGPA")]
round(stat.desc(numericaldata), 1)

##                 age study year  CGPA
## nbr.val        49.0       49.0  49.0
## nbr.null        0.0        0.0   0.0
## nbr.na          0.0        0.0   0.0
## min            18.0        1.0   2.0
## max            24.0        4.0   5.0
## range           6.0        3.0   3.0
## sum          1006.0       98.0 216.0
## median         19.0        2.0   5.0
## mean           20.5        2.0   4.4
## SE.mean         0.4        0.1   0.1
## CI.mean.0.95    0.7        0.3   0.2
## var             6.4        1.0   0.5
## std.dev         2.5        1.0   0.7
## coef.var        0.1        0.5   0.2

The number of valid observations for each variable was 49. There were no missing values for any of the variable. No NA`s neither. The range between max and min values are seen in the row named range. SE.mean represents the error of the mean. For example if we would repeat sample 49 observations from column Age and calculate the mean, those would be approximately +/- 0.4 units of the mean I have calculated now. The Cl.mean.0.095 shows us that for example CGPA, which has variable +/- 0.2, is with 95% confidence likely to fall within the range 4.2 to 4.6. Std.dev or standard deviation shows the average distance between our values (CGPA is 0.7, which means values in data set typically differ from the mean by 0.7 units).

library(psych)

## Warning: package 'psych' was built under R version 4.2.3

describe (healthdata2)

##             vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
## gender*        1 49  1.29 0.46      1    1.24 0.00   1   2     1  0.92    -1.18
## age            2 49 20.53 2.53     19   20.44 1.48  18  24     6  0.36    -1.67
## course*        3 49  3.47 2.20      3    3.27 1.48   1   8     7  1.09    -0.24
## study year     4 49  2.00 1.00      2    1.90 1.48   1   4     3  0.61    -0.80
## CGPA           5 49  4.41 0.70      5    4.51 0.00   2   5     3 -1.08     1.06
## depression*    6 49  1.37 0.49      1    1.34 0.00   1   2     1  0.53    -1.75
##               se
## gender*     0.07
## age         0.36
## course*     0.31
## study year  0.14
## CGPA        0.10
## depression* 0.07

The “gender” variable has a moderate positive skew, indicating that it is right-skewed, with a longer tail on the right side of the distribution. The kurtosis value of -1.18 suggests that the distribution has slightly heavier tails compared to a normal distribution. The “CGPA” variable demonstrates a moderate negative skew, indicating that it is left-skewed with a longer tail on the left side. The kurtosis value of 1.06 suggests that the distribution has slightly heavier tails compared to a normal distribution.

Now I will graph the distribution of the variables:

hist(healthdata2$age,
     main = "Age spread",
     ylab = "frequency",
     xlab = "age",
     breaks = seq(from = 18, to = 24, by = 1), col = "blue")

We see that this histogram has two peaks - is bimodal. That means there are two groups that have the most frequency of occurrence - group 18 and 24.

gendercount <- table(healthdata2$gender)
pie(gendercount, labels = paste(names(gendercount), ": ", gendercount), main = "Gender distribution", col = c("pink", "blue"))

We can see that the majority of students in this data set were women.

healthdata2$depression <- ifelse(healthdata2$depression == "Yes", 1, 0)

#install.packages("ggplot2")
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

scatter_plot <- ggplot(healthdata2, aes(x = CGPA, y = depression)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "pink") + 
  labs(x = "CGPA", y = "Depression") +
  theme_minimal()
print(scatter_plot)

## `geom_smooth()` using formula 'y ~ x'

Graph shows that bigger the GPA less depressed the students are.

healthdata2$depression <- ifelse(healthdata2$depression == 1, "Yes", "No")

library(ggplot2)
library(dplyr)
summary_data <- healthdata2 %>%
  group_by(gender, depression) %>%
  summarise(count = n(), .groups = "drop")
ggplot(summary_data, aes(x = gender, y = count, fill = depression)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Gender and Depression",
       x = "Gender",
       y = "Count") +
  scale_fill_manual(values = c("No" = "blue", "Yes" = "red")) +
  theme_minimal()

ggplot(healthdata2, aes(x = gender, fill = depression)) +
  geom_bar(position = "fill") +
  labs(title = "Ratio of Depression by Gender", y = "Proportion", fill = "Depression") +
  scale_fill_manual(values = c("Yes" = "red", "No" = "blue")) +
  theme_minimal()

From this ratio we can see that females are more inclined to be depressed than male population.

library(ggplot2)
course_depression <- as.data.frame(table(healthdata2$course, healthdata2$depression))
colnames(course_depression) <- c("Course", "Depression", "Count")
course_depression <- course_depression %>%
  group_by(Course) %>%
  mutate(Proportion = Count / sum(Count))

ggplot(course_depression, aes(x = Course, y = Proportion, fill = Depression)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Proportion of Students with Depression by Course",
    x = "Course",
    y = "Proportion") +
  scale_fill_manual(values = c("Yes" = "red", "No" = "blue")) +
  theme_minimal() +
  coord_flip()

We can see that among the most depressed are the law students, followed by maths and other.

ggplot(healthdata2, aes(x = depression, y = CGPA, fill = depression)) +
  geom_boxplot() +
  labs(
    title = "Depression vs. CGPA",
    x = "Depression",
    y = "CGPA") +
  scale_fill_manual(values = c("No" = "lightblue", "Yes" = "pink")) +
  theme_minimal()

Both depressed and not depressed students have CGPA values from 3-5. But the difference we see is in the central tendency. The line, which shows the median, falls at 5 for those not depressed, and for those depressed at 4. So as we can see, in average the lesser the CGPA, more depressed students are. This also shows to the skewness of the data. Because the median with depressed students is at 1st quartile, the data is left-skewed. Also the whiskers are not that long, which shows we have a more concentrated data distribution. We have one outlier - 2 CGPA.

ggplot(healthdata2, aes(x = depression, y = CGPA, fill = depression)) +
  geom_boxplot(outlier.shape = NA) +
  labs(
    title = "Depression vs. CGPA",
    x = "Depression",
    y = "CGPA") +
  scale_fill_manual(values = c("No" = "lightblue", "Yes" = "pink")) +
  theme_minimal()

Here I took out the outlier that extends beyond the whiskers, because it does not carry any valuable information for my data. Identifying the outliers is important for understanding, that there are some extreme values in my dataset. The fact that there was only one outlier and short whisks suggest that this is high quality data.

ggplot(healthdata2, aes(x=age, y= CGPA)) +
  geom_point(size = 2, color = "blue") +
  labs(
    title = "Scatterplot of Age vs. CGPA",
    x = "Age",
    y = "CGPA")

There does not seem to be a correlation between age and CGPA because the data is not clustered along an obvious line.

Task 2

massdata <- read.csv("massdata.csv", sep = ";")
massdata$Mass <- as.numeric(gsub(",",".",massdata$Mass))
str(massdata)

## 'data.frame':    50 obs. of  2 variables:
##  $ ID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Mass: num  62.1 64.5 56.5 53.4 61.3 62.2 62.7 64.5 59.5 68.9 ...

summary(massdata$Mass)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.70   60.23   62.80   62.88   64.50   83.20

library(ggplot2)
histogram <- ggplot(massdata, aes(x=Mass)) + geom_histogram(binwidth = 2, fill = "red", color = "black") + labs( title = "Body Mass Distribution", x = "Body Mass", y = "Frequency")

print(histogram)

library(ggplot2)
hist(massdata$Mass,
     main="Body Mass Distribution",
     ylab = "Frequency",
     xlab = "Body Mass",)

Mass <- c(62.1, 64.5, 56.5, 53.4, 61.3, 62.2, 62.7, 64.5, 59.5, 68.9,
  64.5, 62.8, 63.7, 52.3, 64.1, 64.2, 64.2, 60.2, 59.6, 60.9,
  54.2, 72.1, 60.3, 63.1, 59.8, 49.7, 58.6, 58.7, 62, 65.7,
  64.2, 62.7, 78.5, 61.6, 60.4, 63.3, 69.3, 63, 60.8, 64.9,
  62.8, 57.3, 69.3, 62.8, 62.9, 53.6, 65.2, 75.3, 83.2, 66.4)

t.test(massdata$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  massdata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

In the output of the t-test we got the p-value = 0.000234, which is lower than than 0.05. Because of this we can reject the null hypothesis. This means that the online schooling may have had and impact on the body mass of ninth graders.

library(effectsize)

## Warning: package 'effectsize' was built under R version 4.2.3

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d (massdata$Mass, mu = 59.5)

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

interpret_cohens_d(0.52)

## [1] "medium"
## (Rules: cohen1988)

This value indicates that there is moderate impact of online schooling on the weight of ninth graders. This is because the medium effect size is from 0.5 to 0.8.

Task 3

Import the dataset Apartments.xlsx

#install.packages("openxlsx")
library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.2.3

Apartments <- read.xlsx ("Apartments.xlsx")
Apartments <- as.data.frame(Apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking)

Apartments$Balcony <- factor(Apartments$Balcony)

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price,
       mu = 1900,
       alternative ="two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

With the p - value lower than 0.05 we can reject the H0 hypothesis.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

(fit1 <- lm(data = Apartments, Price ~ Age))

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Coefficients:
## (Intercept)          Age  
##    2185.455       -8.975

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

save(fit1, file = "fit1.RData")

Regression coefficient (Age): For each additional year of age of an apartment the estimated price is expected to decrease by approximately $8.975.

Coefficient of correlation: This cannot be determined we need an additional test.

Coefficient of determination (R-squared): Multiple R-squared value 0.05302 indicates that 5.30% of the variability in apartment prices can be explained by the age of the apartments.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

#install.packages("car")
library(car)

## Warning: package 'car' was built under R version 4.2.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(~ Price + Age + Distance, data = Apartments)

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

(fit2 <- lm(data = Apartments, Price ~ Age + Distance))

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Coefficients:
## (Intercept)          Age     Distance  
##    2460.101       -7.934      -20.667

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

save(fit2, file = "fit2.RData")

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Here we have overall VIF values close to 1 and this indicates low multicollinearity, these two variables are not highly correlated with each other.

#another option
library(mctest)
mctest(fit2)

## 
## Call:
## omcdiag(mod = mod, Inter = TRUE, detr = detr, red = red, conf = conf, 
##     theil = theil, cn = cn)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.9982         0
## Farrar Chi-Square:         0.1520         0
## Red Indicator:             0.0429         0
## Sum of Lambda Inverse:     2.0037         0
## Theil's Method:           -0.4359         0
## Condition Number:          5.0320         0
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

imcdiag(fit2)

## 
## Call:
## imcdiag(mod = fit2)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##             VIF    TOL     Wi  Fi Leamer   CVIF Klein  IND1 IND2
## Age      1.0018 0.9982 0.1531 Inf 0.9991 1.0231     0 0.012    1
## Distance 1.0018 0.9982 0.1531 Inf 0.9991 1.0231     0 0.012    1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## * all coefficients have significant t-ratios
## 
## R-square of y on all x: 0.4396 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Here we can also see that both variables are close to 1 which suggests low multicollinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

library(car)
standardized_residuals <- resid(fit2) / sqrt(1-hatvalues(fit2))

cooks_distance <- cooks.distance(fit2)

residuals_cooks <- data.frame(Standardized_residuals = standardized_residuals, Cooks_Distances = cooks_distance)

head(residuals_cooks)

##   Standardized_residuals Cooks_Distances
## 1              -190.4841     0.007386569
## 2               510.5413     0.030365432
## 3              -169.9897     0.005882612
## 4               215.9729     0.008299153
## 5              -307.3056     0.005112584
## 6              -222.5976     0.004900891

cook_threshold <- 4 / length(cooks_distance)
outliers <- which(cooks_distance > cook_threshold)

Aparments_clean <- Apartments[-outliers,]

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

standardized_residuals <- resid(fit2) /sqrt(1 - hatvalues(fit2))
standardized_fitted_values <- fitted(fit2) / sqrt(1 - hatvalues(fit2))

plot(
  x = standardized_fitted_values,
  y = standardized_residuals,
  xlab = "Standardized Fitted Values",
  ylab = "Standardized Residuals",
  main = "Scatterplot of Standardized Residuals vs. Standardized Fitted Values")

Here we can see heteroskedasticity, because ass fitted values increase the spread of residuals widens.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(standardized_residuals, main = "Histogram of Standardized Residuals", xlab = "Standardized Residuals")

qqnorm(standardized_residuals)
qqline(standardized_residuals)

shapiro.test(standardized_residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  standardized_residuals
## W = 0.95306, p-value = 0.00366

Here we see that the p- value is lower than 0.05 so we can say that the residuals are not normally distributed. This can also be seen from the Q-Q plot and the histogram.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2_clean <- lm(Price ~ Age + Distance, data = Apartments)

summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Regression coefficient (Age): For each additional year of age of an apartment, the estimated price of the apartment is expected to drop by $7.934.

Regression coefficient (Distance): For each additional unit of distance the estimated price of the apartment is expected to drop by $20.667.

Coefficient of determination (R-squared): Multiple R-squared = Indicates that 43.96% of the variability in apartment prices is explained by the age and the distance of the apartments. The adjusted R-Squared = It accounts for the number of predictors in the model.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

Apartments$Parking <- factor(Apartments$Parking, levels = c(0, 1))
Apartments$Balcony <- factor(Apartments$Balcony, levels = c(0, 1))

(fit3 <- lm(data = Apartments, Price ~ Age + Distance + Parking + Balcony))

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Coefficients:
## (Intercept)          Age     Distance     Parking1     Balcony1  
##    2301.667       -6.799      -18.045      196.168        1.935

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

save(fit3, file = "fit3.RData")

With function anova check if model fit3 fits data better than model fit2.

anova <- anova(fit2,fit3)

print(anova)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A p-value lower than 0.05 in the model fit3 suggests that this model fits the data significantly better than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1     196.168     62.868   3.120  0.00251 ** 
## Balcony1       1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Regression coefficient (Parking1): It represents the difference in the estimated mean price of apartments between those with parking (Parking1 = 1) and those without parking (Parking1 = 0). In this case the estimated mean price of apartments with parking is expected to be $196.168 higher than the estimated mean price of apartments without parking.

Regression coefficient (Balcony1): represents the difference in the estimated mean price of apartments between those with a balcony (Balcony1 = 1) and those without a balcony (Balcony1 = 0), in this case, the estimated mean price of apartments with a balcony is expected to be $1.935 higher than the estimated mean price of apartments without a balcony.

H0: the model with these predictors is not a better fit for the data than a model with no predictors.

H1: The model with these predictors is a better fit for the data than a model with no predictors.

In this case where the p-value is significant we would reject the null hypothesis.

Save fitted values and claculate the residual for apartment ID2.

save_fitted_values <- fitted(fit3)

apartment_id <- 2
observed_value <- Apartments$Price[Apartments$ApartmentID == apartment_id]

residual <- observed_value - save_fitted_values[Apartments$ApartmentID == apartment_id]

residual

## numeric(0)

R take-home exam Omanović

Larisa Omanović

2023-09-21

R take-home exam

Task 1

Larisa Omanović

Task 2

Task 3

Import the dataset Apartments.xlsx

Change categorical variables into factors.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

Chech the multicolinearity with VIF statistics. Explain the findings.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

With function anova check if model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

Save fitted values and claculate the residual for apartment ID2.