mydata <- read.csv("~/Downloads/R data/IMB Bootcamp/Medical Insurance Dataset.csv",
header = TRUE,
sep = ",",
dec = ".")
head(mydata)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
str(mydata)
## 'data.frame': 3038 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Variables in the dataset:
AGE Numeric variable.
Ranges from 18 to 64 years old.
Mean is 39.09.
Median also at 39.
SEX Categorical variable: “male” or “female”.
BMI (Body Mass Index). Numeric variable.
Range: 15.96 (very underweight) to 53.13 (obese).
Average BMI ≈ 30.7 (borderline obese).
CHILDREN - also covered in the price of the insurance -> cost of
which seen in CHARGES. Numeric variable.
Range: 0 to 5 children.
Average ≈ 1.099.
SMOKER Categorical (yes/no).
REGION Categorical (character) variable: “northwest”, “northeast”, “southeast”, “southwest”.
CHARGES Numeric variable.
Range: 1,122 to 63,770 USD.
Average ≈ 13,336 USD.
This dataset contains 3,038 individuals with variables describing their demographics (age, sex, number of children, region), health status (BMI, smoker or not), and the resulting insurance charges.
mydata$cost_per_person <- mydata$charges / (mydata$children + 1)
head(mydata)
## age sex bmi children smoker region charges cost_per_person
## 1 19 female 27.900 0 yes southwest 16884.924 16884.9240
## 2 18 male 33.770 1 no southeast 1725.552 862.7762
## 3 28 male 33.000 3 no southeast 4449.462 1112.3655
## 4 33 male 22.705 0 no northwest 21984.471 21984.4706
## 5 32 male 28.880 0 no northwest 3866.855 3866.8552
## 6 31 female 25.740 0 no southeast 3756.622 3756.6216
I will not be deleting any units, because I have a full dataset.
anyNA(mydata)
## [1] FALSE
I will be deleting the column of REGION as I have no interest in this data.
mydata<-mydata[ , -6]
head(mydata)
## age sex bmi children smoker charges cost_per_person
## 1 19 female 27.900 0 yes 16884.924 16884.9240
## 2 18 male 33.770 1 no 1725.552 862.7762
## 3 28 male 33.000 3 no 4449.462 1112.3655
## 4 33 male 22.705 0 no 21984.471 21984.4706
## 5 32 male 28.880 0 no 3866.855 3866.8552
## 6 31 female 25.740 0 no 3756.622 3756.6216
I will be renaming “sex” into “gender”.
colnames(mydata)[2] <- "gender"
head(mydata)
## age gender bmi children smoker charges cost_per_person
## 1 19 female 27.900 0 yes 16884.924 16884.9240
## 2 18 male 33.770 1 no 1725.552 862.7762
## 3 28 male 33.000 3 no 4449.462 1112.3655
## 4 33 male 22.705 0 no 21984.471 21984.4706
## 5 32 male 28.880 0 no 3866.855 3866.8552
## 6 31 female 25.740 0 no 3756.622 3756.6216
mydata_order <- mydata[order(mydata$"age"), ]
head(mydata_order)
## age gender bmi children smoker charges cost_per_person
## 2 18 male 33.770 1 no 1725.552 862.7762
## 23 18 male 34.100 0 no 1137.011 1137.0110
## 32 18 female 26.315 0 no 2198.190 2198.1899
## 47 18 female 38.665 2 no 3393.356 1131.1188
## 51 18 female 35.625 0 no 2211.131 2211.1307
## 58 18 male 31.680 2 yes 34303.167 11434.3891
mydata$gender <- factor(mydata$gender,
levels = c("male", "female"),
labels = c("M", "F"))
head(mydata)
## age gender bmi children smoker charges cost_per_person
## 1 19 F 27.900 0 yes 16884.924 16884.9240
## 2 18 M 33.770 1 no 1725.552 862.7762
## 3 28 M 33.000 3 no 4449.462 1112.3655
## 4 33 M 22.705 0 no 21984.471 21984.4706
## 5 32 M 28.880 0 no 3866.855 3866.8552
## 6 31 F 25.740 0 no 3756.622 3756.6216
#install.packages("psych")
library(psych)
describe(mydata[ ,c("age", "bmi", "children")])
## vars n mean sd median trimmed mad min max range skew
## age 1 3038 39.09 14.12 39.0 38.86 19.27 18.00 64.00 46.00 0.07
## bmi 2 3038 30.70 6.09 30.4 30.55 6.26 15.96 53.13 37.17 0.27
## children 3 3038 1.10 1.22 1.0 0.94 1.48 0.00 5.00 5.00 0.96
## kurtosis se
## age -1.26 0.26
## bmi -0.09 0.11
## children 0.26 0.02
1. Age Mean age = 39.09 years, close to median (39.0) →
distribution is fairly symmetric. Standard deviation = 14.12, so ages
are spread around ±14 years from the mean. Minimum age = 18 (no
minors).
BMI (Body Mass Index) Mean BMI = 30.70, median = 30.4 → slightly right-skewed but fairly symmetric. Standard deviation = 6.09, moderate variability. Minimum BMI = 15.96 (underweight case).
Children Mean = 1.10, median = 1 → most people have 1 child. Standard deviation = 1.22 → some variation, but not huge. Minimum = 0 → some individuals have no children.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gender_count <- mydata %>% count(gender)
barplot(gender_count$n,
names.arg = gender_count$gender,
col = "darkorange1",
main = "Gender Distribution",
xlab = "Gender",
ylab = "Frequency")
The number of females and males are almost the same.
library(dplyr)
smoker_count <- mydata %>% count(smoker)
barplot(smoker_count$n,
names.arg = smoker_count$smoker,
col = "olivedrab",
main = "NUMBER OF SMOKERS VS. NON-SMOKERS",
xlab = "Smoking",
ylab = "Frequency")
There are a lot more non-smokers than smokers.
It would be logical the cost per person would be lower with
every additional child.
summary(mydata[c("cost_per_person", "children")])
## cost_per_person children
## Min. : 768 Min. :0.000
## 1st Qu.: 2198 1st Qu.:0.000
## Median : 4320 Median :1.000
## Mean : 8300 Mean :1.099
## 3rd Qu.:11455 3rd Qu.:2.000
## Max. :63770 Max. :5.000
The statistics show that the average insurance cost per person
is $8,300, with a median of $4,320. The mean being much higher than the
median indicates that the distribution is skewed by a small number of
very expensive cases, reaching a maximum of about $63,770. Most
individuals have relatively few dependents: the median is 1 child, the
mean is about 1.1 children, and the maximum observed is 5 children.
#install.packages("ggplot")
library(ggplot2)
ggplot(mydata, aes(x = children, y = cost_per_person)) +
geom_jitter(width = 0.2, alpha = 0.5, color = "darkslategray") +
geom_smooth(method = "lm", se = FALSE, color = "darkgoldenrod4") +
labs(x = "Number of Children", y = "Cost per Person",
title = "CORRELATION BETWEEN COST PER PERSON & NUMBER OF CHILDREN") +
theme_minimal()
The scatterplot suggests a weak negative association between the
number of children and insurance cost per person: on average,
individuals with more children tend to have slightly lower costs.
However, the relationship is not strong, as costs vary widely across all
groups, especially among those with no children.
model <- lm(cost_per_person ~ age + gender + bmi + children + smoker, data = mydata)
summary(model)
##
## Call:
## lm(formula = cost_per_person ~ age + gender + bmi + children +
## smoker, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10941 -2828 -1305 1225 38161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4381.195 595.384 -7.359 2.38e-13 ***
## age 183.292 7.316 25.054 < 2e-16 ***
## genderF 27.195 206.140 0.132 0.895
## bmi 179.749 16.965 10.595 < 2e-16 ***
## children -2878.691 84.454 -34.086 < 2e-16 ***
## smokeryes 15108.062 253.497 59.599 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5650 on 3032 degrees of freedom
## Multiple R-squared: 0.6408, Adjusted R-squared: 0.6402
## F-statistic: 1082 on 5 and 3032 DF, p-value: < 2.2e-16
Each additional year of age increases predicted cost by €183
(holding other factors constant). Being female increases cost by €27
compared to male, but not statistically significant (p ≈ 0.9). Each unit
increase in BMI raises predicted cost by €180. Each additional child is
associated with €2879 lower costs. Being a smoker increases predicted
cost by €15,108 compared to non-smokers. Smoking has the largest effect
in the model and is highly significant so I will explore its effect on
price further.
library(ggplot2)
mydata$smoker_fact <- factor(mydata$smoker, levels = c("no", "yes"))
ggplot(mydata, aes(x = smoker, y = cost_per_person, fill = smoker)) +
stat_summary(fun = "mean", geom = "bar", alpha = 0.5, width = 0.5, color = "black") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.5, color = "darkred") +
scale_fill_manual(values = c("no" = "chartreuse4", "yes" = "darkslateblue")) +
scale_x_discrete(labels = c("no" = "No", "yes" = "Yes")) +
labs(title = "Effect of Smoking on Insurance Cost per Person",
x = "Smoker", y = "Mean Cost per Person") +
theme_minimal()
The graph shows a large difference in insurance costs between
smokers and non-smokers. On average, smokers incur costs around $20,000,
which is about four times higher than non-smokers (around $5,000).
Smoking is therefore a very strong predictor of insurance cost in this
dataset.
#install.packages("readxl")
library(readxl)
Business_School <- read_excel("~/Downloads/R data/IMB Bootcamp/Business School.xlsx")
head(Business_School)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) +
geom_bar(colour = "white", fill = "chocolate") +
ylab("Frequency") +
theme_minimal() +
geom_text(stat = "count",
aes(label = ..count..),
vjust = -0.3)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The graph shows that the most common degree is the Undergrad degree
of Business.
library(psych)
describe(Business_School$`Annual Salary`)
## vars n mean sd median trimmed mad min max range skew
## X1 1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
## kurtosis se
## X1 9.41 4150.15
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
ggplot(Business_School, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, boundary = 10, closed = "left",
color = "darkred", fill = "antiquewhite2") +
labs(title = "Annual Salary - Histogram",
x = "Annual Salary", y = "Count") +
scale_x_continuous(
labels = label_comma()) +
theme_minimal()
In this histogram graph we can see the average salary is around 100
000. We can also notice some outliers who have significantly higher
salaries, but I decide not to remove them since they are not there
because of a mistake in the dataset - they are just extraordinary.
shapiro.test(Business_School$`MBA Grade`)
##
## Shapiro-Wilk normality test
##
## data: Business_School$`MBA Grade`
## W = 0.98799, p-value = 0.5078
I did this test to check whether the MBA grades are normally
distributed. The null hypothesis is: MBA grades are normally distributed
and the alternative hypothesis is: MBA grades are not normally
distributed. The p-value is above the typical threshold of 0.05 so I
conclude to not reject the null hypothesis - the MBA grades are normally
distributed.
mba_grades <- Business_School$`MBA Grade`
t_result <- t.test(mba_grades,
mu = 74,
alternative = "two.sided")
print(t_result)
##
## One Sample t-test
##
## data: mba_grades
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Here I preformed the t-test, where I specified the numerical
value (MBA grade) and used the argument mu = 74, which corresponds to
the null hypothesis. The results shows the average grade of the students
is 76.0. The p-value is calculated as 0.0092 which is less than 0.05 so
I can reject the hypothesis.
library(readxl)
Apartments <- read_excel("~/Downloads/R Take Home Exam 2025/Task 3/Apartments.xlsx")
View(Apartments)
Description:
library(dplyr)
Apartments <- Apartments %>%
mutate(
ParkingF = factor(Parking, levels = c(0,1), labels = c("No","Yes")),
BalconyF = factor(Balcony, levels = c(0,1), labels = c("No","Yes")))
str(Apartments)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingF: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ BalconyF: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
tt <- t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
print(tt)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
I am rejecting the null hypothesis at the 5% level. The mean
apartment price is significantly different from 1900 - higher by 119 on
average.
fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Each additional year is on average associated with 8,98 lower
apartment price - a weak negative linear association. The correlation is
not highly significant. About 5.3% of the variation in price is
explained by age alone; the other ~94.7% is due to other factors.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(Apartments[c("Price", "Age", "Distance")],
smooth = FALSE)
The scatterplot matrix shows no multicollinearity.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Holding distance fixed, each extra year of age is associated
with about 7.9 lower price on average (statistically significant).
Holding age fixed, each extra unit of distance lowers expected price by
about 20.7 (highly significant). About 44% of the variation in prices is
explained by age and distance. I conclude there is strong evidence that
at least one - age or sistance helps explain the price (overall model is
significant).
library(car)
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
VIFs ~ 1 and mean VIF ~ 1 tell me there is no evidence of
multicollinearity between age and distance. Coefficient estimates should
be stable and their standard errors are unaffected by correlation among
the regressors.
Apartments$StdResid <- round(rstandard(fit1),3)
hist(Apartments$StdResid,
col = "coral",
border = "brown",
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
Upon visual inspection, we can see the residuals are slightly
asymmetrically distributed to the right - most standardized residuals
sit within about ±2, with a few a bit above 2.
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
col = "lightpink",
border = "hotpink",
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances")
There is one clear outlier around 0.33, which is highly influential
and worth inspecting if there are more.
head(Apartments[order(-Apartments$CooksD),])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 0.108 0.32
## 2 43 37 1740 0 0 No No -0.168 0.104
## 3 2 11 2790 1 0 Yes No 1.72 0.069
## 4 7 2 1760 0 1 No Yes -0.995 0.066
## 5 37 3 2540 1 1 Yes Yes 1.91 0.061
## 6 40 2 2400 0 1 No Yes 1.61 0.038
It looks like number 38 is the far right outlier.
Apartments$ID <- seq_len(nrow(Apartments))
Apartments$CooksD <- cooks.distance(fit2)
head(Apartments[order(-Apartments$CooksD),
c("Age","Distance","Price","StdResid","CooksD")], 10)
## # A tibble: 10 × 5
## Age Distance Price StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 45 2180 0.108 0.320
## 2 43 37 1740 -0.168 0.104
## 3 2 11 2790 1.72 0.0691
## 4 7 2 1760 -0.995 0.0663
## 5 37 3 2540 1.91 0.0609
## 6 40 2 2400 1.61 0.0375
## 7 8 2 2820 1.94 0.0365
## 8 8 26 2300 0.51 0.0341
## 9 10 1 2810 1.95 0.0320
## 10 18 1 2800 2.11 0.0304
Here we have the possible outliers. I will presume the rule of D
> 4/n. For my dataset that means I will remove all the cases with D
> 0,04.
library(dplyr)
drop_ids <- c(38, 55, 33, 53, 22)
Apartments <- Apartments %>%
filter(!ID %in% drop_ids)
View(Apartments)
Previously mentioned cases with D > 0,04 filtered out.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$StdFitValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = Apartments$StdResid,
x = Apartments$StdFitValues,
ylab = "Standardised residuals",
xlab = "Standardised fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The scatter suggests heteroskedasticity increasing with fitted
values. I will have to use robust SEs and check that results are robust
after removing the few influential points I identified.
Apartments$StdResid <- round(rstandard(fit2),3)
hist(Apartments$StdResid,
col = "coral",
border = "brown",
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
Removing those IDs reduced the influence of a handful of extreme
points, yielding residuals more consistent with model assumptions
(normality and constant variance). Parameter estimates and standard
errors should be less distorted.
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.94154, p-value = 0.001166
Even after dropping those IDs, residuals aren’t normal enough
for a clean Shapiro pass. I would have to do the procedure of finding
and eliminating the ouliers again.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Each additional year is on average associated with 8,67 lower
apartment price - a weak negative linear association. The correlation is
significant. About 5.3% of the variation in price is explained by age
alone; the other ~94.7% is due to other factors.
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
This is R, the multiple correlation between the observed
response and the model’s fitted values. My result 0.732 means the
model’s predictions correlate about 0.73 with actual prices. Squaring it
gives back R² ≈ 0.536, i.e., the model explains ~53.6% of the
variance.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
print(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Coefficients:
## (Intercept) Age Distance Parking Balcony
## 2393.316 -7.970 -21.961 128.700 6.032
Age (−7.97): each 1-unit increase in Age is associated with ~8
units lower Price, other variables fixed.. Distance (−21.96): each
1-unit increase in Distance lowers Price by ~22 units, other variables
fixed. Parking (+128.7): apartments with parking are about 129 units
more expensive than those without, other variables fixed. Balcony
(+6.03): a balcony adds ~6 units on average, other variables fixed.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
Since p = 0.12 > 0.05, I do not have evidence that adding
Parking and Balcony improves fit beyond Age and Distance.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## Parking 128.700 60.801 2.117 0.0376 *
## Balcony 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Apartments with parking are estimated to cost about 129 units
more than those without parking, other variables fixed (statistically
significant at 5%). Having a balcony is associated with ~6 units higher
price, other variables fixed, but this effect is not statistically
important.
Fitted_ID2 <- fitted(fit3)[Apartments$ID == 2]
Residual_ID2 <- resid(fit3)[Apartments$ID == 2]
round(c(Fitted = Fitted_ID2, Residual = Residual_ID2), 3)
## Fitted.2 Residual.2
## 2356.597 443.403
A positive residual means the apartment’s actual price is higher
than the model predicts. This apartment was sold for ≈ 2356.597 +
443.403 = 2800 (same units as Price), i.e., about 443 above the model’s
expectation given its Age, Distance, Parking, and Balcony.