TASK1

#install.packages("readr")
library(readr)

mydata <- read.table("./Sales.csv",
                     header = TRUE,
                     sep = ",",
                     dec = ".")

head(mydata)

##    Brands       Models        Colors Memory Storage Camera Rating Selling.Price
## 1 SAMSUNG GALAXY M31S   Mirage Black   8 GB  128 GB    Yes    4.3         19330
## 2   Nokia          3.2         Steel   2 GB   16 GB    Yes    3.8         10199
## 3  realme          C2  Diamond Black   2 GB            Yes    4.4          6999
## 4 Infinix       Note 5      Ice Blue   4 GB   64 GB    Yes    4.2         12999
## 5   Apple   iPhone 11          Black    4GB   64 GB    Yes    4.6         49900
## 6  GIONEE         L800         Black   8 MB   16 MB    Yes    4.0          2199
##   Original.Price               Mobile Discount discount.percentage
## 1          20999 SAMSUNG GALAXY M31S      1669            7.947998
## 2          10199            Nokia 3.2        0            0.000000
## 3           7999           realme C2      1000           12.501563
## 4          12999       Infinix Note 5        0            0.000000
## 5          49900     Apple iPhone 11         0            0.000000
## 6           2199          GIONEE L800        0            0.000000

str(mydata)

## 'data.frame':    3114 obs. of  12 variables:
##  $ Brands             : chr  "SAMSUNG" "Nokia" "realme" "Infinix" ...
##  $ Models             : chr  "GALAXY M31S " "3.2" "C2 " "Note 5" ...
##  $ Colors             : chr  "Mirage Black" "Steel" "Diamond Black" "Ice Blue" ...
##  $ Memory             : chr  "8 GB" "2 GB" "2 GB" "4 GB" ...
##  $ Storage            : chr  "128 GB" "16 GB" "" "64 GB" ...
##  $ Camera             : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Rating             : num  4.3 3.8 4.4 4.2 4.6 4 NA 4.6 4.2 4.3 ...
##  $ Selling.Price      : int  19330 10199 6999 12999 49900 2199 99900 42999 20400 21736 ...
##  $ Original.Price     : int  20999 10199 7999 12999 49900 2199 99900 47900 20400 22999 ...
##  $ Mobile             : chr  "SAMSUNG GALAXY M31S " "Nokia 3.2" "realme C2 " "Infinix Note 5" ...
##  $ Discount           : int  1669 0 1000 0 0 0 0 4901 0 1263 ...
##  $ discount.percentage: num  7.95 0 12.5 0 0 ...

Variables

Brands: The brand name of the smartphone (categorical).
Models: The model name or version of the smartphone (categorical).
Colors: The color of the smartphone (categorical).
Memory: The amount of RAM (Random Access Memory) in the smartphone (categorical)
Storage: The internal storage capacity of the smartphone (categorical)
Camera: Whether the smartphone has a camera or not (categorical, Yes/No).
Rating: The average user rating of the smartphone, typically on a scale from 1 to 5 (numeric).
Selling.Price: The price at which the smartphone was sold (numeric).
Original.Price: The original price of the smartphone before any discounts (numeric).
Mobile: The concatenated brand and model of the smartphone (categorical).
Discount: The absolute amount of discount offered on the smartphone (numeric).
discount.percentage: The percentage discount on the smartphone (numeric).

set.seed(123)


mydata_sample <- mydata[sample(nrow(mydata), 20), ]


head(mydata_sample)

##       Brands       Models        Colors Memory               Storage Camera
## 2463  realme         C15     Power Blue   3 GB                 32 GB    Yes
## 2511     HTC   Desire 820  Saffron Grey   2 GB                 16 GB    Yes
## 2227  Lenovo          A7           Blue   2 GB                 32 GB    Yes
## 526  SAMSUNG Guru Music 2         Black        Expandable Upto 16 GB    Yes
## 195    Nokia            5        Silver   3 GB                 16 GB    Yes
## 2986    POCO           X2 Matrix Purple   8 GB                256 GB    Yes
##      Rating Selling.Price Original.Price               Mobile Discount
## 2463    4.3          9999          11999          realme C15      2000
## 2511    4.1         25999          25999       HTC Desire 820        0
## 2227    4.0          6950           8490           Lenovo A7      1540
## 526     4.3          1625           1625 SAMSUNG Guru Music 2        0
## 195     4.1          7139           7139              Nokia 5        0
## 2986    4.5         21499          22999              POCO X2     1500
##      discount.percentage
## 2463           16.668056
## 2511            0.000000
## 2227           18.138987
## 526             0.000000
## 195             0.000000
## 2986            6.522023

Random sample of 20 rows from my original dataset.

Data manipulation

head(mydata_sample)

##       Brands       Models        Colors Memory               Storage Camera
## 2463  realme         C15     Power Blue   3 GB                 32 GB    Yes
## 2511     HTC   Desire 820  Saffron Grey   2 GB                 16 GB    Yes
## 2227  Lenovo          A7           Blue   2 GB                 32 GB    Yes
## 526  SAMSUNG Guru Music 2         Black        Expandable Upto 16 GB    Yes
## 195    Nokia            5        Silver   3 GB                 16 GB    Yes
## 2986    POCO           X2 Matrix Purple   8 GB                256 GB    Yes
##      Rating Selling.Price Original.Price               Mobile Discount
## 2463    4.3          9999          11999          realme C15      2000
## 2511    4.1         25999          25999       HTC Desire 820        0
## 2227    4.0          6950           8490           Lenovo A7      1540
## 526     4.3          1625           1625 SAMSUNG Guru Music 2        0
## 195     4.1          7139           7139              Nokia 5        0
## 2986    4.5         21499          22999              POCO X2     1500
##      discount.percentage
## 2463           16.668056
## 2511            0.000000
## 2227           18.138987
## 526             0.000000
## 195             0.000000
## 2986            6.522023

colnames(mydata_sample)[which(names(mydata_sample) == "Selling.Price")] <- "SalePrice"
colnames(mydata_sample)[which(names(mydata_sample) == "Original.Price")] <- "OriginalPrice"
colnames(mydata_sample)

##  [1] "Brands"              "Models"              "Colors"             
##  [4] "Memory"              "Storage"             "Camera"             
##  [7] "Rating"              "SalePrice"           "OriginalPrice"      
## [10] "Mobile"              "Discount"            "discount.percentage"

filtered_data <- mydata_sample[mydata_sample$Rating > 4 & mydata_sample$Memory == "4 GB", ]
print(filtered_data)

##      Brands       Models        Colors Memory Storage Camera Rating SalePrice
## 1268  Apple   iPhone XS     Space Grey   4 GB   64 GB    Yes    4.7     62999
## 1115 realme         C15     Power Blue   4 GB   64 GB    Yes    4.4     10999
## 953  realme          C35 Glowing Black   4 GB  128 GB    Yes    4.5     12999
## 348  Xiaomi Redmi 6 Pro      Lake Blue   4 GB   64 GB    Yes    4.4     11999
##      OriginalPrice              Mobile Discount discount.percentage
## 1268         89900    Apple iPhone XS     26901            29.92325
## 1115         12999         realme C15      2000            15.38580
## 953          14999          realme C35     2000            13.33422
## 348          11999 Xiaomi Redmi 6 Pro         0             0.00000

mydata_sample$DiscountedPrice <- (mydata_sample$OriginalPrice - mydata_sample$SalePrice) / mydata_sample$OriginalPrice * 100


head(filtered_data)

##      Brands       Models        Colors Memory Storage Camera Rating SalePrice
## 1268  Apple   iPhone XS     Space Grey   4 GB   64 GB    Yes    4.7     62999
## 1115 realme         C15     Power Blue   4 GB   64 GB    Yes    4.4     10999
## 953  realme          C35 Glowing Black   4 GB  128 GB    Yes    4.5     12999
## 348  Xiaomi Redmi 6 Pro      Lake Blue   4 GB   64 GB    Yes    4.4     11999
##      OriginalPrice              Mobile Discount discount.percentage
## 1268         89900    Apple iPhone XS     26901            29.92325
## 1115         12999         realme C15      2000            15.38580
## 953          14999          realme C35     2000            13.33422
## 348          11999 Xiaomi Redmi 6 Pro         0             0.00000

head(mydata_sample)

##       Brands       Models        Colors Memory               Storage Camera
## 2463  realme         C15     Power Blue   3 GB                 32 GB    Yes
## 2511     HTC   Desire 820  Saffron Grey   2 GB                 16 GB    Yes
## 2227  Lenovo          A7           Blue   2 GB                 32 GB    Yes
## 526  SAMSUNG Guru Music 2         Black        Expandable Upto 16 GB    Yes
## 195    Nokia            5        Silver   3 GB                 16 GB    Yes
## 2986    POCO           X2 Matrix Purple   8 GB                256 GB    Yes
##      Rating SalePrice OriginalPrice               Mobile Discount
## 2463    4.3      9999         11999          realme C15      2000
## 2511    4.1     25999         25999       HTC Desire 820        0
## 2227    4.0      6950          8490           Lenovo A7      1540
## 526     4.3      1625          1625 SAMSUNG Guru Music 2        0
## 195     4.1      7139          7139              Nokia 5        0
## 2986    4.5     21499         22999              POCO X2     1500
##      discount.percentage DiscountedPrice
## 2463           16.668056       16.668056
## 2511            0.000000        0.000000
## 2227           18.138987       18.138987
## 526             0.000000        0.000000
## 195             0.000000        0.000000
## 2986            6.522023        6.522023

Created New Variable: A new variable DiscountedPrice was created to show the discount percentage based on the difference between the original and selling prices.

Renamed Variables: The columns Selling.Price and Original.Price were renamed to SalePrice and OriginalPrice for better clarity.

Filtered Data: A new dataset was created containing only smartphones with a rating above 4 and memory of at least 4 GB.

Descripitve statistics

summary(mydata_sample[, c("Rating", "SalePrice", "OriginalPrice")])

##      Rating        SalePrice      OriginalPrice   
##  Min.   :3.900   Min.   :  1625   Min.   :  1625  
##  1st Qu.:4.100   1st Qu.:  9499   1st Qu.: 10747  
##  Median :4.300   Median : 15494   Median : 16495  
##  Mean   :4.300   Mean   : 27114   Mean   : 30556  
##  3rd Qu.:4.425   3rd Qu.: 26999   3rd Qu.: 27749  
##  Max.   :4.700   Max.   :157999   Max.   :179999

Short Code we used

mean_sale_price <- mean(filtered_data$SalePrice, na.rm = TRUE)
median_sale_price <- median(filtered_data$SalePrice, na.rm = TRUE)
sd_sale_price <- sd(filtered_data$SalePrice, na.rm = TRUE)

mean_original_price <- mean(filtered_data$OriginalPrice, na.rm = TRUE)
median_original_price <- median(filtered_data$OriginalPrice, na.rm = TRUE)
sd_original_price <- sd(filtered_data$OriginalPrice, na.rm = TRUE)

mean_rating <- mean(filtered_data$Rating, na.rm = TRUE)
median_rating <- median(filtered_data$Rating, na.rm = TRUE)
sd_rating <- sd(filtered_data$Rating, na.rm = TRUE)

mean_sale_price

## [1] 24749

median_sale_price

## [1] 12499

sd_sale_price

## [1] 25513.07

mean_original_price

## [1] 32474.25

median_original_price

## [1] 13999

sd_original_price

## [1] 38304.14

mean_rating

## [1] 4.5

median_rating

## [1] 4.45

sd_rating

## [1] 0.1414214

LONG Code I used -> calculated the mean, median, and standard deviation for the key variables: SalePrice, OriginalPrice, and Rating. These statistics provide insights into the central tendency and variability of the smartphone prices and ratings.

The mean gives the average value. The median shows the middle point, helping to understand the central value without being affected by outliers. The standard deviation indicates the spread of the data from the mean, showing the variability in the dataset.

Mean Sale Price: The average sale price is 15213. This tells us that, on average, the smartphones in this dataset were sold for 15213 units of currency.
Median Sale Price: The middle value of the sale price is 13999, meaning that half of the smartphones were sold for more than 13999 and half were sold for less. This value is less sensitive to extreme sale prices (outliers).
Standard Deviation of Sale Price: The standard deviation is 4943,69, showing how much the sale prices vary from the mean. A larger standard deviation indicates a greater variability in sale prices.

library(ggplot2)
library(readxl)

# 1. Histogram for Rating
ggplot(mydata_sample, aes(x = Rating)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white", alpha = 0.7) +
  ggtitle("Distribution of Ratings") +
  xlab("Rating") +
  ylab("Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold"))

# 2. Scatterplot for SalePrice vs Rating
ggplot(mydata_sample, aes(x = Rating, y = SalePrice)) +
  geom_point() +
  ggtitle("Scatterplot of Sale Price vs Rating") +
  xlab("Rating") +
  ylab("Sale Price") +
  theme_minimal()

# 3. Boxplot for SalePrice by Memory
ggplot(mydata_sample, aes(x = Rating, y = Rating)) +
  geom_boxplot() +
  ggtitle("Boxplot of Sale Price by Memory") +
  xlab("Memory") +
  ylab("Sale Price") +
  theme_minimal()

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Histogram for Rating: This histogram shows that the ratings is concentrated around main peak at 4,25, suggesting this rating is the most common. Lower and higher ratings, like 3,5 and 4.5, occur less frequently.

Scatterplot of Sale Price vs Rating: The scatterplot suggests a positive relationship between rating and sale price. Lower ratings (around 4.0) correspond to a wider range of sale prices, while higher ratings (around 4.5) are associated with higher sale prices, with fewer low-price points.

This boxplot shows that the majority of sale prices are clustered around 4.25, with a small range of variability. There are a few outliers, with some prices dropping below 3.75 and one outlier above 4.5, indicating a few unusually high and low sale prices compared to the general trend. The data within the main range appears consistent with minimal spread.

TASK2

#install.packages("readr")
library(readxl)
mydataTask2 <- read_xlsx("C:/Users/danie/OneDrive/Desktop/IMB/Bootcamp/Statistics/Bootcamp R/R data/R Take Home Exam 2024/R Take Home Exam 2024/Task 2/Business School.xlsx")


head(mydataTask2)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

mean_salary <- mean(mydataTask2$`Annual Salary`, na.rm = TRUE)
median_salary <- median(mydataTask2$`Annual Salary`, na.rm = TRUE)
sd_salary <- sd(mydataTask2$`Annual Salary`, na.rm = TRUE)

mean_salary

## [1] 109058

median_salary

## [1] 103500

sd_salary

## [1] 41501.49

mean_salary: The average salary is 109,058. median_salary: The middle value of the salary distribution is 103,500, which means half of the salaries are below this value, and half are above. sd_salary: The standard deviation of the salaries is 41,501.49, indicating the amount of variation or dispersion from the mean. A higher standard deviation suggests that salaries vary widely from the average.

library(ggplot2)

ggplot(mydataTask2, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "purple") +
  ggtitle("Distribution of Undergraduate Degrees") +
  xlab("Undergrad Degree") +
  ylab("Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart displays the count of students for each degree type (e.g., Business, Finance, etc.). The most common degree has the tallest bar in the chart.

summary(mydataTask2$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

ggplot(mydataTask2, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 10000, fill = "orange", color = "black") +
  ggtitle("Distribution of Annual Salary") +
  xlab("Annual Salary") +
  ylab("Frequency") +
  theme_minimal() +
  geom_density(aes(y = ..count.. * 10000), color = "green", size = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Descriptive statistics give the following key values for salary: Mean, Median: These indicate the central value of salaries. Standard deviation: Shows how much variation exists in the salaries. Histogram: The histogram visualizes the distribution of salaries. A density curve (in green) is added to see if the distribution is skewed or normal. From the graph we can se skewness to the right

t_test_result <- t.test(mydataTask2$`MBA Grade`, mu = 74)

t_test_result

## 
##  One Sample t-test
## 
## data:  mydataTask2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

mean_mba <- mean(mydataTask2$`MBA Grade`)
sd_mba <- sd(mydataTask2$`MBA Grade`)
cohen_d <- (mean_mba - 74) / sd_mba

cohen_d

## [1] 0.2658658

The t-statistic is 2.66, and the p-value is 0.009, which is less than 0.05. Therefore, we reject the null hypothesis, meaning the average MBA grade is significantly different from 74. The 95% confidence interval for the true mean MBA grade is [74.52, 77.56], which does not include 74. The sample mean is 76.04, indicating the current average is higher than 74. The effect size (Cohen’s d) is 0.27, suggesting a small to medium effect, meaning the difference is statistically significant but not very large in practice.

TASK3

library(readxl)
Apartments <- read_excel("C:/Users/danie/OneDrive/Desktop/IMB/Bootcamp/Statistics/Bootcamp R/R data/R Take Home Exam 2024/R Take Home Exam 2024/Task 3/Apartments.xlsx")

View(Apartments)

head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

the dataset “Apartments.xlsx” is imported into R using the read_excel function from the readxl package. The head(Apartments) function shows the first few rows of the dataset, providing an initial look at the structure of the data. - Age: Age of an apartment in years - Distance: The distance from city center in km - Price: Price per m2 - Parking: 0-No, 1-Yes - Balcony: 0-No, 1-Yes

Apartments$Parking <- factor(Apartments$Parking, levels = c(0, 1), labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

str(Apartments)

## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

converting the variables Parking and Balcony, originally coded as 0 and 1, into factors with meaningful labels (“No” for 0 and “Yes” for 1). The str() function is used to confirm the data structure and to ensure the variables were correctly transformed into categorical variables.

t_test <- t.test(Apartments$Price, mu = 1900)

t_test

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

A one-sample t-test is conducted to test the hypothesis that the mean apartment price is 1900 EUR. The test provides a p-value, and since it is less than 0.05 (p = 0.004731), we reject the null hypothesis, concluding that the average apartment price significantly differs from 1900 EUR.

fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(Apartments$Price, Apartments$Age)

## [1] -0.230255

This section estimates a simple linear regression model where Price is the dependent variable and Age is the independent variable. The model output includes the intercept and the slope, indicating that, on average, the price per square meter decreases by approximately 8.98 EUR for each additional year of apartment age. The correlation between price and age is weak and negative (-0.230255), meaning there is a slight tendency for older apartments to have lower prices, but the relationship is not strong.

library(car)

## Loading required package: carData

scatterplotMatrix(Apartments[c("Price", "Age", "Distance")], smooth = FALSE)

A scatterplot matrix is created to visualize the relationships between Price, Age, and Distance. This helps in assessing whether multicollinearity could be a concern, especially if any of the predictor variables are highly correlated with each other. Scatterplots betweeen Distance and Age don’t show any relationship between the two variables (horizontal line). Showing that we don’t have problem with multicolinearity

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

a multiple regression model is estimated with Price as the dependent variable and Age and Distance as predictors. This model allows us to see how both apartment age and distance from the city center jointly affect the price.

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Variance inflation factors (VIF) are used to assess multicollinearity between the predictors. A VIF value close to 1 (around 1.001845) suggests that there is no significant multicollinearity between Age and distance

Apartments$StdRsd <- round(rstandard(fit2), 3)

calculate the standardized residuals, which measure how far each observation’s residual is from the mean, standardized by the standard deviation of the residuals. The standardized residuals are rounded to three decimal places. Since the standardized residuals fall within the range of -3 to +3, I conclude there are no extreme outliers.

Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

calculate Cook’s distance to assess the influence of each data point on the overall regression model. A histogram is created to display the distribution of Cook’s distances. A large gap between values around 0.15 to 0.30 suggests that a few points may have a disproportionate effect on the regression model.

head(Apartments[order(-Apartments$CooksD), "CooksD"], 8)

## # A tibble: 8 × 1
##   CooksD
##    <dbl>
## 1  0.32 
## 2  0.104
## 3  0.069
## 4  0.066
## 5  0.061
## 6  0.038
## 7  0.037
## 8  0.034

display the top 8 highest Cook’s distance values, which helps identify the most influential observations. The first two points (with Cook’s distances of 0.32 and 0.104) stand out as particularly influential, and the next three values are relatively large. I decide to remove these values to improve the reliability of the model.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Apartments <- Apartments %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.61))
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
Apartments$StdFtd <- scale(fit2$fitted.values)

re-estimate the model after removing the influential observations, ensuring a cleaner dataset with more reliable results. I also calculate and standardize the fitted values, which represent the predicted prices based on the regression model.

library(car)
scatterplot(y = Apartments$StdRsd, x = Apartments$StdFtd,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

create a scatterplot to visualize the relationship between standardized residuals and fitted values. The residuals appear randomly scattered, indicating no visible pattern. This randomness suggests that the model does not suffer from heteroskedasticity, which occurs when the variability of residuals depends on the fitted values.

hist(Apartments$StdRsd,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

plot a histogram of standardized residuals, which shows a slightly right-skewed distribution. However, since all residuals lie between -3 and +3, I conclude there are no extreme outliers in the data.

shapiro.test(Apartments$StdRsd)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdRsd
## W = 0.93191, p-value = 0.0003328

use the Shapiro-Wilk test to check for the normality of residuals. Since the p-value is below 0.05 (p = 0.0003328), I reject the null hypothesis, meaning the standardized residuals are not normally distributed. However, given the sample size of over 30, this violation of normality is unlikely to severely impact the regression results.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -408.72 -217.93  -38.86  220.60  506.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2492.715     75.544  32.997  < 2e-16 ***
## Age           -7.496      3.169  -2.365   0.0205 *  
## Distance     -24.574      2.700  -9.100  6.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared:  0.5324, Adjusted R-squared:  0.5204 
## F-statistic:  44.4 on 2 and 78 DF,  p-value: 1.335e-13

The regression summary shows that 53% of the variability in apartment prices is explained by the combined effects of age and distance from the city center (R-squared = 0.5324). The coefficient for Age indicates that for every additional year in an apartment’s age, its price per square meter decreases by 7.496 EUR. Similarly, for every kilometer increase in distance from the city center, the price decreases by 24.574 EUR.

sqrt(summary(fit2)$r.squared)

## [1] 0.7296492

The square root of R-squared is 0.73, indicating a moderately strong linear relationship between the apartment price, age, and distance.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)

create a new multiple regression model, incorporating two additional variables: Parking and Balcony. This model helps assess the impact of these categorical features on the apartment prices.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     78 5248925                              
## 2     76 4915944  2    332981 2.5739 0.08287 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

use the ANOVA test to compare model fit2 (with only age and distance) and model fit3 (which includes parking and balcony). With a p-value of 0.08287, the difference between the models is not statistically significant, indicating that adding Parking and Balcony does not significantly improve the model fit.

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403.16 -204.08  -41.24  239.52  528.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2372.693     93.344  25.419  < 2e-16 ***
## Age           -6.906      3.118  -2.215   0.0298 *  
## Distance     -22.254      2.839  -7.837 2.25e-11 ***
## ParkingYes   137.479     60.854   2.259   0.0267 *  
## BalconyYes    17.234     57.099   0.302   0.7636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared:  0.5621, Adjusted R-squared:  0.539 
## F-statistic: 24.38 on 4 and 76 DF,  p-value: 5.29e-13

The regression summary for model fit3 shows that apartments with parking tend to have prices higher by 137.479 EUR per square meter on average compared to those without parking. However, the presence of a balcony does not appear to significantly influence the price, as indicated by its high p-value (0.7636).

Apartments$StdFtd <- fitted.values(fit3)
Apartments$StdRsd <- residuals(fit3)
head(Apartments[, colnames(Apartments) %in% c("ID", "Price", "StdFtd", "StdRsd")])

## # A tibble: 6 × 3
##   Price StdRsd StdFtd
##   <dbl>  <dbl>  <dbl>
## 1  1640  -78.5  1718.
## 2  2800  436.   2364.
## 3  1660  -41.2  1701.
## 4  1850  299.   1551.
## 5  1640 -363.   2003.
## 6  1770 -160.   1930.

extract the fitted and residual values for each apartment. For example, the predicted price for Apartment ID2 is 2363.611 EUR, while the actual price is 2800 EUR. The difference between these values (residual) is 436.39 EUR, indicating how much the actual price deviates from the predicted value based on the model.

HOMEWORK

Daniel Maćešić Šulterer

2024-09-18