data()
mydata <- force(mtcars)
head(mydata)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
A data frame with 32 observations on 11 variables.
summary(mydata)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
colnames(mydata)[colnames(mydata) %in% c("am", "wt")] <- c("Transmission", "weight")
colnames(mydata)[colnames(mydata) %in% c("hp", "vs", "carb")] <- c("Horsepower", "Engine", "Carburetors")
head(mydata)
## mpg cyl disp Horsepower drat Transmission qsec Engine
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1
## weight gear Carburetors
## Mazda RX4 1 4 4
## Mazda RX4 Wag 1 4 4
## Datsun 710 1 4 1
## Hornet 4 Drive 0 3 1
## Hornet Sportabout 0 3 2
## Valiant 0 3 1
mydata$`Car Transmission` <- ifelse(mydata$Transmission == 0, "automatic", "manual")
head(mydata)
## mpg cyl disp Horsepower drat Transmission qsec Engine
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1
## weight gear Carburetors Car Transmission
## Mazda RX4 1 4 4 manual
## Mazda RX4 Wag 1 4 4 manual
## Datsun 710 1 4 1 manual
## Hornet 4 Drive 0 3 1 manual
## Hornet Sportabout 0 3 2 manual
## Valiant 0 3 1 manual
manual_cars <- mydata[mydata$`Car Transmission` == "manual", ]
head(manual_cars)
## mpg cyl disp Horsepower drat Transmission qsec Engine
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1
## weight gear Carburetors Car Transmission
## Mazda RX4 1 4 4 manual
## Mazda RX4 Wag 1 4 4 manual
## Datsun 710 1 4 1 manual
## Hornet 4 Drive 0 3 1 manual
## Hornet Sportabout 0 3 2 manual
## Valiant 0 3 1 manual
select_variables <- mydata[, c("mpg", "Horsepower", "weight")]
summary(select_variables)
## mpg Horsepower weight
## Min. :10.40 Min. : 52.0 Min. :0.0000
## 1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:0.0000
## Median :19.20 Median :123.0 Median :0.0000
## Mean :20.09 Mean :146.7 Mean :0.4062
## 3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:1.0000
## Max. :33.90 Max. :335.0 Max. :1.0000
The typical car in this dataset gets around 20.09 miles per gallon. This means that, on average, it can drive 20.09 miles on a single gallon of gas.
Most cars in this group have an engine power output of about 146.7 horsepower. This gives us a good idea of the overall strength and performance of the vehicles.
If the cars were ordered in order of their fuel usage/consumption the middle car would get 19.20 miles per gallon. This means that half of the cars are more fuel-efficient than this, while the other half are less efficient.
sd(mydata$Horsepower) # Standard deviation for horsepower (hp)
## [1] 68.56287
sd(mydata$weight) # Standard deviation for weight (wt)
## [1] 0.4989909
With a standard deviation of 68.56287 horsepower, there are significant differences between the most powerful and least powerful vehicles. This means some cars have much stronger engines than others.
With a standard deviation of 0.50, we can conclude that the cars in the dataset are similar in weight, that there’s not a huge difference in size.
hist(mydata$mpg, main="Histogram of MPG", xlab="Miles Per Gallon", col="darkolivegreen", border="gray20")
The histogram shows that the most common MPG value is around 20 with a possible distribution skewed to the right indicating that there are few cars with a very high fuel consumption or Miles/Gallon values.
The histogram also shows that there are a few cars with very few fuel efficiency or low Miles/Gallon values.
plot(mydata$Horsepower, mydata$mpg, main="Scatterplot of MPG vs. Horsepower", xlab="Horsepower", ylab="Miles per Gallon", col="darkolivegreen")
Explanation:
boxplot(mydata$mpg, main="Miles per Gallon", ylab="MPG", col="darkolivegreen")
The box plot suggests that the majority of cars in your dataset have a MPG value between 15 and approximately 23, with a median MPG of approximately 18.5.
#install.packages("readxl")
library(readxl)
MBA <- read_excel("/Users/jethro/Downloads/R Take Home Exam 2024/Task 2/Business School.xlsx")
head(MBA)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
#install.packages("ggplot2")
library(ggplot2)
ggplot(MBA, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
geom_bar(color = "gray20") +
labs(title = "Undergraduate Degree Distribution", x = "Undergraduate Degree", y = "Total")
undergrad_cnt <- table(MBA$`Undergrad Degree`)
most_listed_degree <- names(undergrad_cnt)[which.max(undergrad_cnt)]
highest_count <- max(undergrad_cnt)
print(most_listed_degree)
## [1] "Business"
print(highest_count)
## [1] 35
The most common undergraduate degree among the MBA students is Business.
statistics <- summary(MBA$`Annual Salary`)
mean <- mean(MBA$`Annual Salary`, na.rm = TRUE)
median <- median(MBA$`Annual Salary`, na.rm = TRUE)
paste("Mean Salary:", mean)
## [1] "Mean Salary: 109058"
paste("Median Salary:", median)
## [1] "Median Salary: 103500"
# Based on the results above, we create a histogram for 'Annual Salary'
library(ggplot2)
ggplot(MBA, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "darkolivegreen", color = "gray20") +
labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Density") +
theme_minimal()
The histogram shows a normal distribution, suggesting that MBA graduates’ salaries are concentrated around the mean. This is evidenced by the nature of the bars forming a characteristic bell-shaped curve. The frequency of salaries decreases as we move away from the mean in either direction, indicating that fewer graduates earn significantly lower or higher salaries.
Grade_test <- t.test(MBA$`MBA Grade`, mu = 74)
# Calculate effect size (Cohen's d)
mean_grade <- mean(MBA$`MBA Grade`)
sd_grade <- sd(MBA$`MBA Grade`)
cohens_d <- (mean_grade - 74) / sd_grade
# Print Cohen's d
cohens_d
## [1] 0.2658658
Based on the one-sample t-test, the average MBA grade is significantly different from the hypothesized mean of 74. The effect size as 0.27, indicates a small effect. The magnitude of the difference between the sample mean and 74 is relatively minor.
Apartments <- read_excel("/Users/jethro/Downloads/R Take Home Exam 2024/Task 3/Apartments.xlsx")
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
# Understanding the structure of the dataset
str(Apartments)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
Apartments$Parking <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
# Display the new data input
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
HP_test <- t.test(Apartments$Price, mu = 1900)
# View the results
HP_test
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The p-value of 0.004731 is much lower than the common significance level of 0.05 and because of this, we reject the null hypothesis and rely on this to conclude that the mean price per m² is not equal to 1900EUR
fit1 <- lm(Price ~ Age, data = Apartments)
# Summary
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
From the above, three conclusions can be made in answering the question:
# Install and load library GGally
# install.packages("GGally")
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Apartments, columns = c("Price", "Age", "Distance"))
The correlation between Price and Age is −0.23 meaning that as the age of an apartment increases, the price tends to decrease slightly, but the relationship is not very strong.
The correlation between Price and Distance is −0.63 indicating that apartments further from the city center tend to have lower prices.
The correlation between Age and Distance is 0.04 meaning there is no significant relationship between the age of the apartment and its distance from the city center. Therefore, multicolinearity between Age and Distance is not a concern.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
#install and load package car
# install.packages("car")
library(car)
## Loading required package: carData
vif(fit2)
## Age Distance
## 1.001845 1.001845
# Standardized residuals
standardized_residuals <- rstandard(fit2)
# Cook's distances
cooks_distance <- cooks.distance(fit2)
# Histograms
hist(standardized_residuals, main = "Histogram for Standardized Residuals",
xlab = "Standardized Residuals", col = "darkolivegreen", border = "gray20")
hist(cooks_distance, main = "Histogram for Cook's Distances",
xlab = "Cook's Distance", col = "darkolivegreen", border = "gray20")
# outliers
outliers <- which(abs(standardized_residuals) > 3)
# influential and problem points
n <- nrow(Apartments)
high_influence <- which(cooks_distance > (4/n))
problem_data_units <- unique(c(outliers, high_influence))
cat("Potentially problematic units:\n", problem_data_units, "\n")
## Potentially problematic units:
## 22 33 38 53 55
# New dataset without problematic data units
New_Apartments_set <- Apartments[-problem_data_units, ]
fit2_clean <- lm(Price ~ Age + Distance, data = New_Apartments_set)
head(fit2_clean)
## $coefficients
## (Intercept) Age Distance
## 2502.466701 -8.673848 -24.063024
##
## $residuals
## 1 2 3 4 5 6 7
## -127.98510 477.72559 -107.98510 288.22873 -273.20301 -200.84267 -49.77235
## 8 9 10 11 12 13 14
## -231.95929 126.79912 332.50555 421.93460 -40.69882 -411.49819 -126.97194
## 15 16 17 18 19 20 21
## 67.89473 -270.17204 -194.75758 162.94350 -143.58381 -397.20249 -101.66733
## 22 23 24 25 26 27 28
## -28.19250 -140.08534 492.56270 153.80855 410.37789 -269.13999 -171.58063
## 29 30 31 32 33 34 35
## 46.52227 303.17996 111.93460 -212.21300 -336.02445 -343.97923 -300.08534
## 36 37 38 39 40 41 42
## 292.61327 184.10216 -287.58115 -50.23132 38.39622 14.46407 16.18351
## 43 44 45 46 47 48 49
## 470.54276 24.16784 -250.04116 226.54329 -198.52864 212.12049 -216.63366
## 50 51 52 53 54 55 56
## -273.17985 -61.87686 418.33480 435.05013 -131.32906 104.31596 477.72559
## 57 58 59 60 61 62 63
## -107.98510 288.22873 -273.20301 -200.84267 -49.77235 -231.95929 126.79912
## 64 65 66 67 68 69 70
## 332.50555 421.93460 -40.69882 -411.49819 -126.97194 67.89473 -270.17204
## 71 72 73 74 75 76 77
## -194.75758 16.18351 470.54276 24.16784 -250.04116 226.54329 -198.52864
## 78 79 80
## 212.12049 -216.63366 24.16784
##
## $effects
## (Intercept) Age Distance
## -17955.62586 773.34116 2295.53502 240.78790 -263.30487 -253.34951
##
## -15.03143 -225.63716 109.17247 295.20380 464.32891 -28.41668
##
## -366.42187 -70.36822 36.95267 -219.03402 -237.82303 237.43430
##
## -135.27736 -359.97587 -20.81683 -41.05115 -134.35921 565.46181
##
## 195.60686 427.33342 -258.94385 -136.14201 155.59525 150.30627
##
## 154.32891 -152.22959 -302.47552 -374.32528 -294.35921 164.38606
##
## 108.73685 -279.47102 -81.76938 146.87320 -41.32077 32.63766
##
## 506.77369 -51.09579 -362.86698 290.10270 -167.36372 228.27663
##
## -231.57832 -336.32009 -45.91702 471.66051 500.79723 -122.02692
##
## 38.79172 482.55772 -28.42829 240.78790 -263.30487 -253.34951
##
## -15.03143 -225.63716 109.17247 295.20380 464.32891 -28.41668
##
## -366.42187 -70.36822 36.95267 -219.03402 -237.82303 32.63766
##
## 506.77369 -51.09579 -362.86698 290.10270 -167.36372 228.27663
##
## -231.57832 -51.09579
##
## $rank
## [1] 3
##
## $fitted.values
## 1 2 3 4 5 6 7 8
## 1767.985 2322.274 1767.985 1561.771 1913.203 1970.843 1899.772 2201.959
## 9 10 11 12 13 14 15 16
## 2143.201 2237.494 2278.065 1720.699 2061.498 2126.972 2222.105 2070.172
## 17 18 19 20 21 22 23 24
## 2204.758 2177.057 1543.584 2197.202 2161.667 1758.192 2250.085 1807.437
## 25 26 27 28 29 30 31 32
## 2326.191 2339.622 1889.140 2341.581 2373.478 1606.820 2278.065 2352.213
## 33 34 35 36 37 38 39 40
## 1996.024 2173.979 2250.085 2107.387 1325.898 2057.581 2270.231 2421.604
## 41 42 43 44 45 46 47 48
## 2235.536 1383.816 1779.457 1815.832 1860.041 2063.457 2188.529 1407.880
## 49 50 51 52 53 54 55 56
## 1926.634 1833.180 1921.877 2391.665 2384.950 1961.329 2025.684 2322.274
## 57 58 59 60 61 62 63 64
## 1767.985 1561.771 1913.203 1970.843 1899.772 2201.959 2143.201 2237.494
## 65 66 67 68 69 70 71 72
## 2278.065 1720.699 2061.498 2126.972 2222.105 2070.172 2204.758 1383.816
## 73 74 75 76 77 78 79 80
## 1779.457 1815.832 1860.041 2063.457 2188.529 1407.880 1926.634 1815.832
##
## $assign
## [1] 0 1 2
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = New_Apartments_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
standardized_fitted <- rstandard(lm(fitted(fit2) ~ fitted(fit2)))
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
## the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
library(ggplot2)
ggplot(data = data.frame(standardized_fitted, standardized_residuals),
aes(x = standardized_fitted, y = standardized_residuals)) +
geom_point(color = "gray") +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray20") +
labs(title = "Residuals vs Fitted Values",
x = "Standardized Fitted Values",
y = "Standardized Residuals") +
theme_minimal()
The standardized residuals are mostly clustered around zero, with a few spreading out towards -1 and 1. This indicates that the model likely satisfies the assumption of homoskedasticity.
standardized_residuals <- rstandard(fit2)
qqnorm(standardized_residuals)
qqline(standardized_residuals, col = "gray20", lwd = 2)
Because the points on the Q-Q plot are moving closely to the the diagonal line, this would be an indication of a normal distribution of the residuals.
cooks_distance <- cooks.distance(fit2)
n <- nrow(Apartments)
high_influence_data <- which(cooks_distance > (4/n)) # Identify influential points
print(high_influence_data)
## 22 33 38 53 55
## 22 33 38 53 55
# delete high influence data points
new_apartments <- Apartments[-high_influence_data, ]
fit2_prune <- lm(Price ~ Age + Distance, data = new_apartments)
summary(fit2_prune)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = new_apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Explanations
Intercept: Shows an estimate of average price per m² when both distance and age are zero.
Age: This shows that with every year that passes, the price per per m² of Apartments reduces by approximately 8.67 EUROS. The presence of a p-value less than 0.01 shows that the effect of age on price for the apartments is statistically quite huge.
Distance: This shows that with every KILOMETER away from the city, the price per per m² of Apartments reduces by approximately 24.06 EUROS. The presence of a p-value less than 0.01 shows that the effect of distance on price for the apartments is statistically quite huge.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reduction of the RSS when Parking and Balcony were added to Model 2 is a basis for one to state that Model 2 fits the data better than Model 1.
Additionally, noting that the p-value of 0.01007 is less than the significance level of 0.05, we can arrive on the conclusion same as above as the additional data was a basis for a better price prediction for the apartments.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
The F-statistic tests whether the model’s predictors (Age, Distance, Parking, and Balcony) collectively explain a significant amount of variation in the Price variable. A low p-value like we have would suggest that the we have an alternative hypothesis which means that at least one of the predictor variables significantly contributes to explaining the variation in the Price.
Fitted_values <- fitted(fit3)
Residuals_Values <- residuals(fit3)
Fitted_Value_ID2 <- Fitted_values[2]
Residual_ID2 <- Residuals_Values[2]
cat("Fitted value for apartment ID 2:", Fitted_Value_ID2, "\n")
## Fitted value for apartment ID 2: 2357.411
cat("Residual for apartment ID 2:", Residual_ID2, "\n")
## Residual for apartment ID 2: 442.5889
options(repos = c(CRAN = "https://cran.rstudio.com/"))
# Set CRAN mirror first
options(repos = c(CRAN = "https://cran.rstudio.com/"))
# Then install the required packages
if (!require(readxl)) {
install.packages("readxl")
}