#install.packages("readr")
library(readr)
mydata <- read.table("./Sales.csv",
header = TRUE,
sep = ",",
dec = ".")
head(mydata)
## Brands Models Colors Memory Storage Camera Rating Selling.Price
## 1 SAMSUNG GALAXY M31S Mirage Black 8 GB 128 GB Yes 4.3 19330
## 2 Nokia 3.2 Steel 2 GB 16 GB Yes 3.8 10199
## 3 realme C2 Diamond Black 2 GB Yes 4.4 6999
## 4 Infinix Note 5 Ice Blue 4 GB 64 GB Yes 4.2 12999
## 5 Apple iPhone 11 Black 4GB 64 GB Yes 4.6 49900
## 6 GIONEE L800 Black 8 MB 16 MB Yes 4.0 2199
## Original.Price Mobile Discount discount.percentage
## 1 20999 SAMSUNG GALAXY M31S 1669 7.947998
## 2 10199 Nokia 3.2 0 0.000000
## 3 7999 realme C2 1000 12.501563
## 4 12999 Infinix Note 5 0 0.000000
## 5 49900 Apple iPhone 11 0 0.000000
## 6 2199 GIONEE L800 0 0.000000
str(mydata)
## 'data.frame': 3114 obs. of 12 variables:
## $ Brands : chr "SAMSUNG" "Nokia" "realme" "Infinix" ...
## $ Models : chr "GALAXY M31S " "3.2" "C2 " "Note 5" ...
## $ Colors : chr "Mirage Black" "Steel" "Diamond Black" "Ice Blue" ...
## $ Memory : chr "8 GB" "2 GB" "2 GB" "4 GB" ...
## $ Storage : chr "128 GB" "16 GB" "" "64 GB" ...
## $ Camera : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Rating : num 4.3 3.8 4.4 4.2 4.6 4 NA 4.6 4.2 4.3 ...
## $ Selling.Price : int 19330 10199 6999 12999 49900 2199 99900 42999 20400 21736 ...
## $ Original.Price : int 20999 10199 7999 12999 49900 2199 99900 47900 20400 22999 ...
## $ Mobile : chr "SAMSUNG GALAXY M31S " "Nokia 3.2" "realme C2 " "Infinix Note 5" ...
## $ Discount : int 1669 0 1000 0 0 0 0 4901 0 1263 ...
## $ discount.percentage: num 7.95 0 12.5 0 0 ...
set.seed(123)
mydata_sample <- mydata[sample(nrow(mydata), 20), ]
head(mydata_sample)
## Brands Models Colors Memory Storage Camera
## 2463 realme C15 Power Blue 3 GB 32 GB Yes
## 2511 HTC Desire 820 Saffron Grey 2 GB 16 GB Yes
## 2227 Lenovo A7 Blue 2 GB 32 GB Yes
## 526 SAMSUNG Guru Music 2 Black Expandable Upto 16 GB Yes
## 195 Nokia 5 Silver 3 GB 16 GB Yes
## 2986 POCO X2 Matrix Purple 8 GB 256 GB Yes
## Rating Selling.Price Original.Price Mobile Discount
## 2463 4.3 9999 11999 realme C15 2000
## 2511 4.1 25999 25999 HTC Desire 820 0
## 2227 4.0 6950 8490 Lenovo A7 1540
## 526 4.3 1625 1625 SAMSUNG Guru Music 2 0
## 195 4.1 7139 7139 Nokia 5 0
## 2986 4.5 21499 22999 POCO X2 1500
## discount.percentage
## 2463 16.668056
## 2511 0.000000
## 2227 18.138987
## 526 0.000000
## 195 0.000000
## 2986 6.522023
Random sample of 20 rows from my original dataset.
head(mydata_sample)
## Brands Models Colors Memory Storage Camera
## 2463 realme C15 Power Blue 3 GB 32 GB Yes
## 2511 HTC Desire 820 Saffron Grey 2 GB 16 GB Yes
## 2227 Lenovo A7 Blue 2 GB 32 GB Yes
## 526 SAMSUNG Guru Music 2 Black Expandable Upto 16 GB Yes
## 195 Nokia 5 Silver 3 GB 16 GB Yes
## 2986 POCO X2 Matrix Purple 8 GB 256 GB Yes
## Rating Selling.Price Original.Price Mobile Discount
## 2463 4.3 9999 11999 realme C15 2000
## 2511 4.1 25999 25999 HTC Desire 820 0
## 2227 4.0 6950 8490 Lenovo A7 1540
## 526 4.3 1625 1625 SAMSUNG Guru Music 2 0
## 195 4.1 7139 7139 Nokia 5 0
## 2986 4.5 21499 22999 POCO X2 1500
## discount.percentage
## 2463 16.668056
## 2511 0.000000
## 2227 18.138987
## 526 0.000000
## 195 0.000000
## 2986 6.522023
colnames(mydata_sample)[which(names(mydata_sample) == "Selling.Price")] <- "SalePrice"
colnames(mydata_sample)[which(names(mydata_sample) == "Original.Price")] <- "OriginalPrice"
colnames(mydata_sample)
## [1] "Brands" "Models" "Colors"
## [4] "Memory" "Storage" "Camera"
## [7] "Rating" "SalePrice" "OriginalPrice"
## [10] "Mobile" "Discount" "discount.percentage"
filtered_data <- mydata_sample[mydata_sample$Rating > 4 & mydata_sample$Memory == "4 GB", ]
print(filtered_data)
## Brands Models Colors Memory Storage Camera Rating SalePrice
## 1268 Apple iPhone XS Space Grey 4 GB 64 GB Yes 4.7 62999
## 1115 realme C15 Power Blue 4 GB 64 GB Yes 4.4 10999
## 953 realme C35 Glowing Black 4 GB 128 GB Yes 4.5 12999
## 348 Xiaomi Redmi 6 Pro Lake Blue 4 GB 64 GB Yes 4.4 11999
## OriginalPrice Mobile Discount discount.percentage
## 1268 89900 Apple iPhone XS 26901 29.92325
## 1115 12999 realme C15 2000 15.38580
## 953 14999 realme C35 2000 13.33422
## 348 11999 Xiaomi Redmi 6 Pro 0 0.00000
mydata_sample$DiscountedPrice <- (mydata_sample$OriginalPrice - mydata_sample$SalePrice) / mydata_sample$OriginalPrice * 100
head(filtered_data)
## Brands Models Colors Memory Storage Camera Rating SalePrice
## 1268 Apple iPhone XS Space Grey 4 GB 64 GB Yes 4.7 62999
## 1115 realme C15 Power Blue 4 GB 64 GB Yes 4.4 10999
## 953 realme C35 Glowing Black 4 GB 128 GB Yes 4.5 12999
## 348 Xiaomi Redmi 6 Pro Lake Blue 4 GB 64 GB Yes 4.4 11999
## OriginalPrice Mobile Discount discount.percentage
## 1268 89900 Apple iPhone XS 26901 29.92325
## 1115 12999 realme C15 2000 15.38580
## 953 14999 realme C35 2000 13.33422
## 348 11999 Xiaomi Redmi 6 Pro 0 0.00000
head(mydata_sample)
## Brands Models Colors Memory Storage Camera
## 2463 realme C15 Power Blue 3 GB 32 GB Yes
## 2511 HTC Desire 820 Saffron Grey 2 GB 16 GB Yes
## 2227 Lenovo A7 Blue 2 GB 32 GB Yes
## 526 SAMSUNG Guru Music 2 Black Expandable Upto 16 GB Yes
## 195 Nokia 5 Silver 3 GB 16 GB Yes
## 2986 POCO X2 Matrix Purple 8 GB 256 GB Yes
## Rating SalePrice OriginalPrice Mobile Discount
## 2463 4.3 9999 11999 realme C15 2000
## 2511 4.1 25999 25999 HTC Desire 820 0
## 2227 4.0 6950 8490 Lenovo A7 1540
## 526 4.3 1625 1625 SAMSUNG Guru Music 2 0
## 195 4.1 7139 7139 Nokia 5 0
## 2986 4.5 21499 22999 POCO X2 1500
## discount.percentage DiscountedPrice
## 2463 16.668056 16.668056
## 2511 0.000000 0.000000
## 2227 18.138987 18.138987
## 526 0.000000 0.000000
## 195 0.000000 0.000000
## 2986 6.522023 6.522023
Created New Variable: A new variable DiscountedPrice was created to show the discount percentage based on the difference between the original and selling prices.
Renamed Variables: The columns Selling.Price and Original.Price were renamed to SalePrice and OriginalPrice for better clarity.
Filtered Data: A new dataset was created containing only smartphones with a rating above 4 and memory of at least 4 GB.
summary(mydata_sample[, c("Rating", "SalePrice", "OriginalPrice")])
## Rating SalePrice OriginalPrice
## Min. :3.900 Min. : 1625 Min. : 1625
## 1st Qu.:4.100 1st Qu.: 9499 1st Qu.: 10747
## Median :4.300 Median : 15494 Median : 16495
## Mean :4.300 Mean : 27114 Mean : 30556
## 3rd Qu.:4.425 3rd Qu.: 26999 3rd Qu.: 27749
## Max. :4.700 Max. :157999 Max. :179999
Short Code we used
mean_sale_price <- mean(filtered_data$SalePrice, na.rm = TRUE)
median_sale_price <- median(filtered_data$SalePrice, na.rm = TRUE)
sd_sale_price <- sd(filtered_data$SalePrice, na.rm = TRUE)
mean_original_price <- mean(filtered_data$OriginalPrice, na.rm = TRUE)
median_original_price <- median(filtered_data$OriginalPrice, na.rm = TRUE)
sd_original_price <- sd(filtered_data$OriginalPrice, na.rm = TRUE)
mean_rating <- mean(filtered_data$Rating, na.rm = TRUE)
median_rating <- median(filtered_data$Rating, na.rm = TRUE)
sd_rating <- sd(filtered_data$Rating, na.rm = TRUE)
mean_sale_price
## [1] 24749
median_sale_price
## [1] 12499
sd_sale_price
## [1] 25513.07
mean_original_price
## [1] 32474.25
median_original_price
## [1] 13999
sd_original_price
## [1] 38304.14
mean_rating
## [1] 4.5
median_rating
## [1] 4.45
sd_rating
## [1] 0.1414214
LONG Code I used -> calculated the mean, median, and standard deviation for the key variables: SalePrice, OriginalPrice, and Rating. These statistics provide insights into the central tendency and variability of the smartphone prices and ratings.
The mean gives the average value. The median shows the middle point, helping to understand the central value without being affected by outliers. The standard deviation indicates the spread of the data from the mean, showing the variability in the dataset.
library(ggplot2)
library(readxl)
# 1. Histogram for Rating
ggplot(mydata_sample, aes(x = Rating)) +
geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white", alpha = 0.7) +
ggtitle("Distribution of Ratings") +
xlab("Rating") +
ylab("Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"))
# 2. Scatterplot for SalePrice vs Rating
ggplot(mydata_sample, aes(x = Rating, y = SalePrice)) +
geom_point() +
ggtitle("Scatterplot of Sale Price vs Rating") +
xlab("Rating") +
ylab("Sale Price") +
theme_minimal()
# 3. Boxplot for SalePrice by Memory
ggplot(mydata_sample, aes(x = Rating, y = Rating)) +
geom_boxplot() +
ggtitle("Boxplot of Sale Price by Memory") +
xlab("Memory") +
ylab("Sale Price") +
theme_minimal()
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Histogram for Rating: This histogram shows that the ratings is
concentrated around main peak at 4,25, suggesting this rating is the
most common. Lower and higher ratings, like 3,5 and 4.5, occur less
frequently.
Scatterplot of Sale Price vs Rating: The scatterplot suggests a positive relationship between rating and sale price. Lower ratings (around 4.0) correspond to a wider range of sale prices, while higher ratings (around 4.5) are associated with higher sale prices, with fewer low-price points.
This boxplot shows that the majority of sale prices are clustered around 4.25, with a small range of variability. There are a few outliers, with some prices dropping below 3.75 and one outlier above 4.5, indicating a few unusually high and low sale prices compared to the general trend. The data within the main range appears consistent with minimal spread.
#install.packages("readr")
library(readxl)
mydataTask2 <- read_xlsx("C:/Users/danie/OneDrive/Desktop/IMB/Bootcamp/Statistics/Bootcamp R/R data/R Take Home Exam 2024/R Take Home Exam 2024/Task 2/Business School.xlsx")
head(mydataTask2)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
mean_salary <- mean(mydataTask2$`Annual Salary`, na.rm = TRUE)
median_salary <- median(mydataTask2$`Annual Salary`, na.rm = TRUE)
sd_salary <- sd(mydataTask2$`Annual Salary`, na.rm = TRUE)
mean_salary
## [1] 109058
median_salary
## [1] 103500
sd_salary
## [1] 41501.49
mean_salary: The average salary is 109,058. median_salary: The middle value of the salary distribution is 103,500, which means half of the salaries are below this value, and half are above. sd_salary: The standard deviation of the salaries is 41,501.49, indicating the amount of variation or dispersion from the mean. A higher standard deviation suggests that salaries vary widely from the average.
library(ggplot2)
ggplot(mydataTask2, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "purple") +
ggtitle("Distribution of Undergraduate Degrees") +
xlab("Undergrad Degree") +
ylab("Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart displays the count of students for each degree type (e.g., Business, Finance, etc.). The most common degree has the tallest bar in the chart.
summary(mydataTask2$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
ggplot(mydataTask2, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 10000, fill = "orange", color = "black") +
ggtitle("Distribution of Annual Salary") +
xlab("Annual Salary") +
ylab("Frequency") +
theme_minimal() +
geom_density(aes(y = ..count.. * 10000), color = "green", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Descriptive statistics give the following key values for salary: Mean, Median: These indicate the central value of salaries. Standard deviation: Shows how much variation exists in the salaries. Histogram: The histogram visualizes the distribution of salaries. A density curve (in green) is added to see if the distribution is skewed or normal. From the graph we can se skewness to the right
t_test_result <- t.test(mydataTask2$`MBA Grade`, mu = 74)
t_test_result
##
## One Sample t-test
##
## data: mydataTask2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
mean_mba <- mean(mydataTask2$`MBA Grade`)
sd_mba <- sd(mydataTask2$`MBA Grade`)
cohen_d <- (mean_mba - 74) / sd_mba
cohen_d
## [1] 0.2658658
The t-statistic is 2.66, and the p-value is 0.009, which is less than 0.05. Therefore, we reject the null hypothesis, meaning the average MBA grade is significantly different from 74. The 95% confidence interval for the true mean MBA grade is [74.52, 77.56], which does not include 74. The sample mean is 76.04, indicating the current average is higher than 74. The effect size (Cohen’s d) is 0.27, suggesting a small to medium effect, meaning the difference is statistically significant but not very large in practice.
library(readxl)
Apartments <- read_excel("C:/Users/danie/OneDrive/Desktop/IMB/Bootcamp/Statistics/Bootcamp R/R data/R Take Home Exam 2024/R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
the dataset “Apartments.xlsx” is imported into R using the read_excel function from the readxl package. The head(Apartments) function shows the first few rows of the dataset, providing an initial look at the structure of the data. - Age: Age of an apartment in years - Distance: The distance from city center in km - Price: Price per m2 - Parking: 0-No, 1-Yes - Balcony: 0-No, 1-Yes
Apartments$Parking <- factor(Apartments$Parking, levels = c(0, 1), labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
str(Apartments)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
converting the variables Parking and Balcony, originally coded as 0 and 1, into factors with meaningful labels (“No” for 0 and “Yes” for 1). The str() function is used to confirm the data structure and to ensure the variables were correctly transformed into categorical variables.
t_test <- t.test(Apartments$Price, mu = 1900)
t_test
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
A one-sample t-test is conducted to test the hypothesis that the mean apartment price is 1900 EUR. The test provides a p-value, and since it is less than 0.05 (p = 0.004731), we reject the null hypothesis, concluding that the average apartment price significantly differs from 1900 EUR.
fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255
This section estimates a simple linear regression model where Price is the dependent variable and Age is the independent variable. The model output includes the intercept and the slope, indicating that, on average, the price per square meter decreases by approximately 8.98 EUR for each additional year of apartment age. The correlation between price and age is weak and negative (-0.230255), meaning there is a slight tendency for older apartments to have lower prices, but the relationship is not strong.
library(car)
## Loading required package: carData
scatterplotMatrix(Apartments[c("Price", "Age", "Distance")], smooth = FALSE)
A scatterplot matrix is created to visualize the relationships between Price, Age, and Distance. This helps in assessing whether multicollinearity could be a concern, especially if any of the predictor variables are highly correlated with each other. Scatterplots betweeen Distance and Age don’t show any relationship between the two variables (horizontal line). Showing that we don’t have problem with multicolinearity
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
a multiple regression model is estimated with Price as the dependent variable and Age and Distance as predictors. This model allows us to see how both apartment age and distance from the city center jointly affect the price.
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
Variance inflation factors (VIF) are used to assess multicollinearity between the predictors. A VIF value close to 1 (around 1.001845) suggests that there is no significant multicollinearity between Age and distance
Apartments$StdRsd <- round(rstandard(fit2), 3)
calculate the standardized residuals, which measure how far each observation’s residual is from the mean, standardized by the standard deviation of the residuals. The standardized residuals are rounded to three decimal places. Since the standardized residuals fall within the range of -3 to +3, I conclude there are no extreme outliers.
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
calculate Cook’s distance to assess the influence of each data point on
the overall regression model. A histogram is created to display the
distribution of Cook’s distances. A large gap between values around 0.15
to 0.30 suggests that a few points may have a disproportionate effect on
the regression model.
head(Apartments[order(-Apartments$CooksD), "CooksD"], 8)
## # A tibble: 8 × 1
## CooksD
## <dbl>
## 1 0.32
## 2 0.104
## 3 0.069
## 4 0.066
## 5 0.061
## 6 0.038
## 7 0.037
## 8 0.034
display the top 8 highest Cook’s distance values, which helps identify the most influential observations. The first two points (with Cook’s distances of 0.32 and 0.104) stand out as particularly influential, and the next three values are relatively large. I decide to remove these values to improve the reliability of the model.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Apartments <- Apartments %>%
filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.61))
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
Apartments$StdFtd <- scale(fit2$fitted.values)
re-estimate the model after removing the influential observations, ensuring a cleaner dataset with more reliable results. I also calculate and standardize the fitted values, which represent the predicted prices based on the regression model.
library(car)
scatterplot(y = Apartments$StdRsd, x = Apartments$StdFtd,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
create a scatterplot to visualize the relationship between standardized
residuals and fitted values. The residuals appear randomly scattered,
indicating no visible pattern. This randomness suggests that the model
does not suffer from heteroskedasticity, which occurs when the
variability of residuals depends on the fitted values.
hist(Apartments$StdRsd,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
plot a histogram of standardized residuals, which shows a slightly
right-skewed distribution. However, since all residuals lie between -3
and +3, I conclude there are no extreme outliers in the data.
shapiro.test(Apartments$StdRsd)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdRsd
## W = 0.93191, p-value = 0.0003328
use the Shapiro-Wilk test to check for the normality of residuals. Since the p-value is below 0.05 (p = 0.0003328), I reject the null hypothesis, meaning the standardized residuals are not normally distributed. However, given the sample size of over 30, this violation of normality is unlikely to severely impact the regression results.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -408.72 -217.93 -38.86 220.60 506.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2492.715 75.544 32.997 < 2e-16 ***
## Age -7.496 3.169 -2.365 0.0205 *
## Distance -24.574 2.700 -9.100 6.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared: 0.5324, Adjusted R-squared: 0.5204
## F-statistic: 44.4 on 2 and 78 DF, p-value: 1.335e-13
The regression summary shows that 53% of the variability in apartment prices is explained by the combined effects of age and distance from the city center (R-squared = 0.5324). The coefficient for Age indicates that for every additional year in an apartment’s age, its price per square meter decreases by 7.496 EUR. Similarly, for every kilometer increase in distance from the city center, the price decreases by 24.574 EUR.
sqrt(summary(fit2)$r.squared)
## [1] 0.7296492
The square root of R-squared is 0.73, indicating a moderately strong linear relationship between the apartment price, age, and distance.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
create a new multiple regression model, incorporating two additional variables: Parking and Balcony. This model helps assess the impact of these categorical features on the apartment prices.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 5248925
## 2 76 4915944 2 332981 2.5739 0.08287 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
use the ANOVA test to compare model fit2 (with only age and distance) and model fit3 (which includes parking and balcony). With a p-value of 0.08287, the difference between the models is not statistically significant, indicating that adding Parking and Balcony does not significantly improve the model fit.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403.16 -204.08 -41.24 239.52 528.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2372.693 93.344 25.419 < 2e-16 ***
## Age -6.906 3.118 -2.215 0.0298 *
## Distance -22.254 2.839 -7.837 2.25e-11 ***
## ParkingYes 137.479 60.854 2.259 0.0267 *
## BalconyYes 17.234 57.099 0.302 0.7636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared: 0.5621, Adjusted R-squared: 0.539
## F-statistic: 24.38 on 4 and 76 DF, p-value: 5.29e-13
The regression summary for model fit3 shows that apartments with parking tend to have prices higher by 137.479 EUR per square meter on average compared to those without parking. However, the presence of a balcony does not appear to significantly influence the price, as indicated by its high p-value (0.7636).
Apartments$StdFtd <- fitted.values(fit3)
Apartments$StdRsd <- residuals(fit3)
head(Apartments[, colnames(Apartments) %in% c("ID", "Price", "StdFtd", "StdRsd")])
## # A tibble: 6 × 3
## Price StdRsd StdFtd
## <dbl> <dbl> <dbl>
## 1 1640 -78.5 1718.
## 2 2800 436. 2364.
## 3 1660 -41.2 1701.
## 4 1850 299. 1551.
## 5 1640 -363. 2003.
## 6 1770 -160. 1930.
extract the fitted and residual values for each apartment. For example, the predicted price for Apartment ID2 is 2363.611 EUR, while the actual price is 2800 EUR. The difference between these values (residual) is 436.39 EUR, indicating how much the actual price deviates from the predicted value based on the model.