data("USArrests")
Source of data: World Almanac and Book of facts 1975. (Crime rates).
Statistical Abstracts of the United States 1975, p.20, (Urban rates), possibly available as https://books.google.ch/books?id=zl9qAAAAMAAJ&pg=PA20.
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
Let’s first create a summary:
A data frame with 50 observations on 4 variables.
To make it look more clear, I am going to create a new data frame with reordered columns.
mydata <- USArrests[ , c(1, 2, 4, 3)]
I would also like the names of the countries will be shown next to the data, that’s why I will create a new column “Country” with names of Countries.
mydata$Country <- rownames(mydata)
rownames(mydata) <- NULL
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
I will create a new data frame with additional variable (Country) and columns in my preffered order.
mydata2 <- mydata %>%
select(Country, UrbanPop, Murder, Assault, Rape)
Now, we can create a summary of the data we are analysing.
summary(mydata2)
## Country UrbanPop Murder Assault Rape
## Length:50 Min. :32.00 Min. : 0.800 Min. : 45.0 Min. : 7.30
## Class :character 1st Qu.:54.50 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:15.07
## Mode :character Median :66.00 Median : 7.250 Median :159.0 Median :20.10
## Mean :65.54 Mean : 7.788 Mean :170.8 Mean :21.23
## 3rd Qu.:77.75 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:26.18
## Max. :91.00 Max. :17.400 Max. :337.0 Max. :46.00
Explanation of the variables used in the analysis: - Country = name of the country in USA - UrbanPop = Percent of people living in urban area - Murder = number of Murder arrests per 100,000 residents - Assault = number of Assault arrests per 100,000 residents - Rape = number of Rape arrests per 100,000 residents
Descriptive statistics for each column of the matrix:
#install.packages("psych")
library(psych)
describe(mydata2[ , -1])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## UrbanPop 1 50 65.54 14.47 66.00 65.88 17.79 32.0 91.0 59.0 -0.21 -0.87 2.05
## Murder 2 50 7.79 4.36 7.25 7.53 5.41 0.8 17.4 16.6 0.37 -0.95 0.62
## Assault 3 50 170.76 83.34 159.00 168.48 110.45 45.0 337.0 292.0 0.22 -1.15 11.79
## Rape 4 50 21.23 9.37 20.10 20.36 8.60 7.3 46.0 38.7 0.75 0.08 1.32
Explanation of different values:
Murder: - Mean: On average, 7.79 murder arrests per 100,000 residents happen in USA countries. - Min: The minimum number of murder arrests that happen per 100,000 residents of USA countries is 0.8.
Assault: - Median: 50% of countries have 159 or less Assault Arrests per 100,000 residents. - SD (standard Deviation): On average, the Assault Arrest rates in these countries deviate from their mean (average) by approximately 83.34 Arrests per 100,000 residents.
Rape: - Skewness: A skewness of 0.75 suggests a moderate to strong positive skew, implying that there are relatively more high values or outliers on the right side of the distribution. In practical terms, this means that, within the dataset of Rape Arrests in USA countries, there is a tendency for some countries to have higher rates of Rape Arrests per 100,000 residents compared to the majority of countries. The distribution is not symmetric, and it may be influenced by relatively higher values or outliers on the right side. - Max: The highest number of Rape Arrests per 100,000 residents in USA countries is 46.
subset_row <- mydata2[mydata2$Rape == 46.0, ]
subset_row
## Country UrbanPop Murder Assault Rape
## 28 Nevada 81 12.2 252 46
We can see, that the country with the biggest number of Rape Arrests per 100,000 residents is Nevada (46).
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
#Creating histograms for each variable
hist(mydata2$Murder,
main = "Histogram of Murder Arrest Rate",
xlab = "Murder Arrests per 100,000 residents",
ylab = "Frequency",
breaks = seq(from = 0, to = 18, by = 2))
hist(mydata2$Assault,
main = "Histogram of Assault Arrest Rate",
xlab = "Assault Arrests per 100,000 residents",
ylab = "Frequency",
breaks = seq(from = 0, to = 400, by = 50))
hist(mydata2$Rape,
main = "Histogram of Rape Arrest Rate",
xlab = "Rape Arrests per 100,000 residents",
ylab = "Frequency",
breaks = seq(from = 0, to = 50, by = 5))
Interpretation of Histogram of Murder Arrest Rate: It looks like the values are not distributed normally (it looks like the distribution is bimodal) and a bit assymetric to the right (positive skewness).
Interpretation of Histogram of Assault Arrest Rate: It looks like the values are not distributed normally (it looks like the distribution is bimodal).
Interpretation of Histogram of Rape Arrest Rate: It looks like the values are normally distributed but a bit assymetric to the right (positive skewness).
Let’s check the distribution of errors with Shapiro-Wilk normality test:
shapiro_test_result <- shapiro.test(mydata2$Murder)
print(shapiro_test_result)
##
## Shapiro-Wilk normality test
##
## data: mydata2$Murder
## W = 0.95703, p-value = 0.06674
H0: variable Murder is normally distributed H1: variable is NOT normally distributed
Because p value is more greater than 0.05 (assuming a 0.05 significance level), we can not reject the null hypothesis.
We also have a sample greater than 30 (n = 50), which means that we shouldn’t worry much about the distribution of errors, as they are for sure normally distributed.
Let’s create a regression model for each variable:
lm_model_Murder <- lm(Murder ~ UrbanPop, data = mydata2)
lm_model_Assault <- lm(Assault ~ UrbanPop, data = mydata2)
lm_model_Rape <- lm(Rape ~ UrbanPop, data = mydata2)
It would be interesting if we would compare number of Murder, Assault and Rape Arrests in USA countries (per 100,000 residents) based on the percentage of Urban Population. Therefore, we will create scatterplots and add a regression line.
ggplot(mydata2, aes(x = Murder, y = UrbanPop)) +
geom_point(color = "cyan3") +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Number of Murder Arrests based on the percentage of Urban Population",
x = "Number of Murder Arrests per 100,000 residents",
y = "Urban Population in %")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(mydata2, aes(x = Assault, y = UrbanPop)) +
geom_point(color = "deeppink1") +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Number of Assault Arrests based on the percentage of Urban Population",
x = "Number of Assault Arrests per 100,000 residents",
y = "Urban Population in %")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(mydata2, aes(x = Rape, y = UrbanPop)) +
geom_point(color = "orchid4") +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Number of Rape Arrests based on the percentage of Urban Population",
x = "Number of Rape Arrests per 100,000 residents",
y = "Urban Population in %")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation of scatterplot and regression line in number of Murder
Arrests based on the percentage of Urban Population in USA countries: We
can see that that there is slightly positive linear correlation between
the variables. That means that with higher percentage of urban
population, the number of Murder Arrests per 100,00 just slightly
increases. There are potential outliers.
Interpretation of scatterplot and regression line in number of Assault Arrests based on the percentage of Urban Population in USA countries: There is a positive linear correlation between variables. That means that with higher percentage of urban population, the number of Assault Arrests per 100,00 increases.
Interpretation of scatterplot and regression line in number of Rape Arrests based on the percentage of Urban Population in USA countries: It looks like there is potential of violation of homoskedasticity (distribution is not contant) - we will check this later with Breusch-Pagan test. There are potential outliers.
To get better result, we should remove outliers and units that have a great impact on the estimated regression function. First, we calculate the standardized residuals and add the values into the data frame mydata2:
std_residuals_Assault <- rstandard(lm_model_Assault)
mydata2$std_res_Assault <- std_residuals_Assault
std_residuals_Murder <- rstandard(lm_model_Murder)
mydata2$std_res_Murder <- std_residuals_Murder
std_residuals_Rape <- rstandard(lm_model_Rape)
mydata2$std_res_Rape <- std_residuals_Rape
We would now like to winsorize the outliers:
mean_Murder <- mean(mydata2$Murder)
upper_bound <- mean_Murder + 3 * std_residuals_Murder
lower_bound <- mean_Murder -3 * std_residuals_Murder
mydata2$std_res_Murder <- ifelse(mydata2$std_res_Murder > +3, +3, mydata2$std_res_Murder)
mydata2$std_res_Murder <- ifelse(mydata2$std_res_Murder < -3, -3, mydata2$std_res_Murder)
mean_Assault <- mean(mydata2$Assault)
upper_bound <- mean_Assault + 3 * std_residuals_Assault
lower_bound <- mean_Assault -3 * std_residuals_Assault
mydata2$std_res_Assault <- ifelse(mydata2$std_res_Assault > +3, +3, mydata2$std_res_Assault)
mydata2$std_res_Assault <- ifelse(mydata2$std_res_Assault < -3, -3, mydata2$std_res_Assault)
mean_Rape <- mean(mydata2$Rape)
upper_bound <- mean_Rape + 3 * std_residuals_Rape
lower_bound <- mean_Rape -3 * std_residuals_Rape
mydata2$std_res_Rape <- ifelse(mydata2$std_res_Rape > +3, +3, mydata2$std_res_Rape)
mydata2$std_res_Rape <- ifelse(mydata2$std_res_Rape < -3, -3, mydata2$std_res_Rape)
We will round the values of standardized residuals:
mydata2$std_res_Assault <- round(mydata2$std_res_Assault, digits = 2)
mydata2$std_res_Murder <- round(mydata2$std_res_Murder, digits = 2)
mydata2$std_res_Rape <- round(mydata2$std_res_Rape, digits = 2)
Let’s also calculate Cook’s distances for each variable:
cooks_dis_Murder <- influence.measures(lm_model_Murder)$cooks.distance
cooks_dis_Rape <- influence.measures(lm_model_Rape)$cooks.distance
cooks_dis_Assault <- influence.measures(lm_model_Assault)$cooks.distance
As the Cook’s distance is NULL, it means that these values are not influential in the context of regression analysis and there is no need to be concerned about its impact on the model.
lm_model_all <- lm(UrbanPop ~ Murder + Assault + Rape, data = mydata2)
summary(lm_model_all)
##
## Call:
## lm(formula = UrbanPop ~ Murder + Assault + Rape, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.456 -6.950 0.077 7.770 25.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.84187 4.82483 10.952 2.09e-14 ***
## Murder -1.41154 0.71954 -1.962 0.0559 .
## Assault 0.05190 0.04161 1.247 0.2186
## Rape 0.69841 0.26776 2.608 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.08 on 46 degrees of freedom
## Multiple R-squared: 0.2337, Adjusted R-squared: 0.1837
## F-statistic: 4.676 on 3 and 46 DF, p-value: 0.006208
Interpretation of some values: Estimated coefficient - Murder (-1.412) = On average, number of Murder Arrests increases by one with the Urban Population decrease by 1.412 per cent, assuming other variables are held constant.
Estimated coefficient - Assault (0.052) = On average, an increase in number of Assault Arrests per 100,000 residents by one, leads to an increase in percentage of Urban Population.
Standard Error - Murder (0.72) = For example, the standard error for Murder (0.71954) indicates the amount of variation or uncertainty in the Murder coefficient estimate.
H0: Number of Murder Arrests per 100,000 residents are correlated to the percentage of Urban Population H1: Number of Murder Arrests per 100,000 residents are NOT correlated to the percentage of Urban Population
alpha = 0.05 p = 0.056
Conclusion: We can not reject the null hypothesis at p = 0.056, because p is greater than alpha.
###Task 2
A researcher examined the effects of online schooling on body weight in ninth graders (Body mass.csv in Task 2 folder). Based on data from the 2018/2019 school year, the average weight was 59.5 kg. At the beginning of the 2021/2022 school year, researchers randomly selected 50 ninth graders and measured their weight.
mydata_T2 <- read.table("~/R data/TakeHomeExam/Body mass.csv", header = TRUE, sep = ";", dec = ",")
head(mydata_T2)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
library(psych)
describe(mydata_T2$Mass)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 62.88 6.01 62.8 62.56 3.34 49.7 83.2 33.5 0.85 2.11 0.85
Interpretation of results: - Mean: The average weight of randomly selected 50 ninth graders in the school year 2021/2022 was 62.88 kilograms. - Standard Deviation: On average, the weight of ninth graders in the school year 2021/2022 vary by approximately 6.01 kilograms from the average weight (62.88 kilograms). - Median: 50% of ninth graders weighed 62.8 kilograms or less, and 50% of ninth graders weighed more than 62.8 kilograms in school year 2021/2022 on average. - Min: A ninth graders with smallest body mass in 2021/2022 weighed 49.7 kilograms. - Max: The heaviest ninth grader in the school year 2021/2022 weighed 83.2 kilograms. - Range: The difference between the maximum weight of ninth graders in the school year 2021/2022 and the minimum weight is 33.5 kilograms. - Standard Error: Standard error of 0.85 indicates that there is some degree of variability associated with the sample statistic.
library(ggplot2)
hist(mydata_T2$Mass,
main = "Histogram of Online Schooling
effect on Body Mass of Ninth graders in 2021/2022",
xlab = "Body Mass",
ylab = "Frequency",
breaks = seq(from = 40, to = 90, by = 4))
It looks like the values are normally symmetric distributed.
hypothesized_mean <- 59.5
result <- t.test(mydata_T2, mu = hypothesized_mean)
print(result)
##
## One Sample t-test
##
## data: mydata_T2
## t = -7.0195, df = 99, p-value = 2.821e-10
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 39.85971 48.51629
## sample estimates:
## mean of x
## 44.188
H0: 𝜇 = 59.5 H1: 𝜇 != 59.5
p < 0.001 alpha 0.05
With 95 percent confidence interval, we can reject the null hypothesis at p < 0.001.
mean_21_22 <- mean(mydata_T2$Mass)
mean_18_19 <- 59.5
sd_21_22 <- sd(mydata_T2$Mass)
cohen_d <- (mean_21_22 - mean_18_19) / sd_21_22
cat("Cohen's d:", cohen_d, "\n")
## Cohen's d: 0.5615994
Explanation: A Cohen’s d of 0.5615994 signifies a moderate to large effect size, indicating that the difference between average body mass of ninth graders in the school year 2021/2022 and average body mass of ninth graders in the school year 2018/2019 is substantial and not just a result of random variability. Online Schooling strongly affected body mass of ninth graders.
###Task 3
You analyze the price per m2 for a sample of apartments in the Ljubljana region (Apartments.xlsx in Task 3 folder). Follow the questions in R Markdown included in the folder.
library(readxl)
Apartments <- read_excel("~/R data/TakeHomeExam/Apartments.xlsx")
View(Apartments)
Description:
Apartments$Parking <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
hypothesized_mean <- 1900
result <- t.test(Apartments$Price, mu = hypothesized_mean)
print(result)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0: mu = 1900 eur H1: mu != 1900 eur alpha = 0.05 p < 0.001
At 95 percent of confidence we can reject the null hypothesis (at p < 0.001). Average price of apartments is not equal to 1900 eur.
mean(Apartments$Price)
## [1] 2018.941
Average price of apartments is equal to 2018.941 eur.
fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Estimate of regression coefficient: On average, the price of apartments decreases by approximately 8.975 units (e.g., dollars) for each year increase in age.
Coefficient of correlation (multiple r-squared): The R-squared value is approximately 0.05302, which indicates a weak linear relationship between “Age” and “Price.” This means that “Age” explains only about 5.30% of the variability in apartment prices. A higher R-squared value would indicate a stronger relationship.
Coefficient of determination (adjusted r-squared): The adjusted R-squared takes into account the model’s complexity, providing a slightly lower value of 0.04161. Low R-squared value and the significance level of the age coefficient suggest that while there is a statistically significant relationship between age and price, it is not a very strong predictor of apartment prices, and other factors may also influence the prices significantly.
pairs(Apartments[c("Price", "Age", "Distance")])
There is no problem with multicolinearity because the values are spreaded and don’t form a line.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
vif_values <- vif(fit2)
print(vif_values)
## Age Distance
## 1.001845 1.001845
Both Vif values are less than 5, which means that there is almost no multicolinearity between the variables in the regression model.
Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(Apartments[order(Apartments$StdResid),], 6) #Three units with lowest value of stand. residuals
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 No Yes -2.15 0.066
## 2 12 14 1650 No Yes -1.50 0.013
## 3 12 14 1650 No No -1.50 0.013
## 4 13 8 1800 No No -1.38 0.012
## 5 14 16 1660 No Yes -1.26 0.008
## 6 24 5 1830 Yes No -1.19 0.012
head(Apartments[order(-Apartments$CooksD),], 6) #Six units with highest value of Cooks distance
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
library(dplyr)
Apartments <- Apartments %>%
filter(!Age %in% c(7 , 5))
head(Apartments[order(Apartments$StdResid),], 6) #Three units with lowest value of stand. residuals
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 12 14 1650 No Yes -1.50 0.013
## 2 12 14 1650 No No -1.50 0.013
## 3 13 8 1800 No No -1.38 0.012
## 4 14 16 1660 No Yes -1.26 0.008
## 5 24 5 1830 Yes No -1.19 0.012
## 6 30 17 1560 No No -1.10 0.012
head(Apartments[order(-Apartments$CooksD),], 6) #Six units with highest value of Cooks distance
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 43 37 1740 No No 1.44 0.104
## 2 2 11 2790 Yes No 2.05 0.069
## 3 37 3 2540 Yes Yes 1.58 0.061
## 4 40 2 2400 No Yes 1.09 0.038
## 5 8 2 2820 Yes No 1.66 0.037
## 6 8 26 2300 Yes Yes 1.57 0.034
std_residuals <- residuals(fit2) / sqrt(var(residuals(fit2)))
std_fitted_values <- fitted(fit2) / sqrt(var(fitted(fit2)))
plot(std_fitted_values, std_residuals,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Scatterplot of Standardized Residuals vs.
Standardized Fitted Values")
abline(h = 0, col = "red", lty = 2)
The distribution of values is constant, so there is no violation of homoskedasticity.
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.93482, p-value = 0.0005609
H0: data is normally distributed H1: data is not normally distributed
At p < 0.001 and 95% of confidence level, we can reject the null hypothesis - we can not say that data is normally distributed.
On the graph it looks like the distribution in bimodal and assymetric to right.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -427.09 -227.61 -61.36 212.35 565.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2492.357 76.921 32.401 < 2e-16 ***
## Age -7.836 3.319 -2.361 0.0208 *
## Distance -22.946 2.897 -7.919 1.57e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.3 on 76 degrees of freedom
## Multiple R-squared: 0.4978, Adjusted R-squared: 0.4846
## F-statistic: 37.67 on 2 and 76 DF, p-value: 4.294e-12
Explanation of coefficients: If average age of the apartment increases by one year, price decreases on average by 7.836 eur per m2 at p = 0.0208, assuming that all other explanatory variables, included in the model, are constant. If average distance between the apartment and city center increases by one kilometer, price decreases on average by 22.946 eur per m2 at p < 0.001, assuming that all other explanatory variables, included in the model, are constant.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -416.05 -211.78 -23.77 268.37 523.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2376.705 98.508 24.127 < 2e-16 ***
## Age -6.935 3.290 -2.108 0.0384 *
## Distance -20.836 3.015 -6.912 1.44e-09 ***
## ParkingYes 139.959 65.316 2.143 0.0354 *
## BalconyYes -7.205 61.651 -0.117 0.9073
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.6 on 74 degrees of freedom
## Multiple R-squared: 0.5276, Adjusted R-squared: 0.502
## F-statistic: 20.66 on 4 and 74 DF, p-value: 1.832e-11
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 76 5717496
## 2 74 5378761 2 338735 2.3301 0.1044
H0: Model1 is better H1: Model2 is better
p > 0.05 0.001
We can not reject null hypothesis at p = 0.1044 (because p > 0.05).
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -416.05 -211.78 -23.77 268.37 523.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2376.705 98.508 24.127 < 2e-16 ***
## Age -6.935 3.290 -2.108 0.0384 *
## Distance -20.836 3.015 -6.912 1.44e-09 ***
## ParkingYes 139.959 65.316 2.143 0.0354 *
## BalconyYes -7.205 61.651 -0.117 0.9073
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269.6 on 74 degrees of freedom
## Multiple R-squared: 0.5276, Adjusted R-squared: 0.502
## F-statistic: 20.66 on 4 and 74 DF, p-value: 1.832e-11
If an apartment has parking, the estimated increase in the average price per m2 is 139.959 eur per m2 compared to an apartment without parking, holding other variables constant. If an apartment has a balcony, the estimated decrease in the average price is 7.205 eur per m2 compared to an apartment without a balcony, holding other variables constant.
H0: all regression coefficients in the model are equal to 0 H1: at least one regression coefficient in the model is not equal to 0
At p < 0.001 we can reject the null hypothesis and conclude that at least one of the factors in the model has a statistically significant effect on the price. The F-statistic helps determine whether the regression model as a whole is useful for predicting the dependent variable. In our case, it indicates that the model is useful.
fitted_values <- predict(fit3)
apartment_id2_index <- which(Apartments[ , 1] == 2)
fitted_value_id2 <- fitted_values[apartment_id2_index]
residual_id2 <- Apartments$Price[apartment_id2_index] - fitted_value_id2
cat("Fitted Value for Apartment ID:", fitted_value_id2, "\n")
## Fitted Value for Apartment ID: 2273.602
cat("Residual for Apartment ID2:", residual_id2, "\n")
## Residual for Apartment ID2: 516.3984
The actual price of Apartment ID2 is 516.3984 eur higher than what the regression model predict for that apartment based on its features (age, distance, parking, balcony). This positive residual suggests that Apartment ID2 is more expensive than what the model expected it to be, given its characteristics and the relationships captured by the model.