This dataset comprises demographic and academic information of individuals, including gender, age, field of study, year of study, and CGPA. Additionally, it includes marital status and responses to questions related to mental health, such as depression, anxiety, panic attacks, and whether individuals sought specialist treatment. I am going to analyze the interplay between academic performance, demographics, and mental health factors among students or individuals in various fields of study.
healthdata <- read.csv("healthdata.csv")
healthdata2 <- healthdata [1:50, ] #I took the first 50 rows.
healthdata2 <- healthdata2[ , -c(1, 7, 9, 10, 11)] #I lowered the number of columns as some were not important, others were similar (eg. anxiety and panic attack).
colnames(healthdata2) <- c("gender", "age", "course", "study year", "CGPA", "depression") #Renamed the columns.
healthdata2$course[healthdata2$course %in% c("Pendidikan islam", "Irkhs", "KIRKHS")] <- "Islamic education" #I changed this specific data.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
healthdata2 <- healthdata2 %>%
mutate(course = case_when(
course %in% c("Engineering", "Engine", "KOE") ~ "Engineering",
course %in% c("Mathemathics", "ENM") ~ "Maths",
course %in% c("BCS", "BIT") ~ "Computer science",
course %in% c("Business", "Human Resources", "Accounting", "Banking Studies", "Business Administration", "KENMS") ~ "Business/Economics",
course %in% c("Laws", "Law") ~ "Law",
course %in% c("Psychology", "Marine science", "Islamic education") ~ "Social Sciences",
course %in% c("Usuluddin", "Marine", "TAASL") ~ "Humanities",
TRUE ~ "Other")) #I have regrouped the courses so I can get analyse more accurately.
#install.packages("tidyr")
library(tidyr)
healthdata2 <- drop_na(healthdata2, age) #I have removed the missing unit.
library(dplyr)
healthdata2 <- healthdata2 %>%
mutate(CGPA = case_when(
CGPA == '0 - 1.99' ~ 1,
CGPA == '2.00 - 2.49' ~ 2,
CGPA == '2.50 - 2.99' ~ 3,
CGPA == '3.00 - 3.49' ~ 4,
CGPA == '3.50 - 4.00' ~ 5)) #I regrouped the CGPAs.
healthdata2[7, 5] <- 5 #Changed the value manually according to the data set.
library(stringr)
healthdata2$"study year" <- str_to_title(healthdata2$"study year") #I unified the writings.
healthdata2$`study year` <- as.numeric(str_extract(healthdata2$`study year`, "\\d+"))
print(healthdata2)
## gender age course study year CGPA depression
## 1 Female 18 Engineering 1 4 Yes
## 2 Male 21 Social Sciences 2 4 No
## 3 Male 19 Computer science 1 4 Yes
## 4 Female 22 Law 3 4 Yes
## 5 Male 23 Maths 4 4 No
## 6 Male 19 Engineering 2 5 No
## 7 Female 23 Social Sciences 2 5 Yes
## 8 Female 18 Computer science 1 5 No
## 9 Female 19 Business/Economics 2 3 No
## 10 Male 18 Social Sciences 1 5 No
## 11 Female 20 Social Sciences 1 5 No
## 12 Female 24 Engineering 3 5 Yes
## 13 Female 18 Computer science 1 4 Yes
## 14 Male 19 Engineering 1 4 No
## 15 Female 18 Business/Economics 2 5 No
## 16 Male 24 Computer science 3 5 No
## 17 Female 24 Other 3 4 No
## 18 Female 24 Maths 4 4 Yes
## 19 Female 20 Computer science 2 5 No
## 20 Female 18 Social Sciences 2 5 Yes
## 21 Female 19 Engineering 1 4 No
## 22 Female 18 Engineering 2 4 No
## 23 Female 24 Computer science 1 5 No
## 24 Female 24 Engineering 1 4 No
## 25 Female 23 Computer science 3 5 Yes
## 26 Female 18 Business/Economics 1 5 No
## 27 Female 19 Engineering 1 5 No
## 28 Male 18 Engineering 2 4 Yes
## 29 Female 24 Computer science 3 5 Yes
## 30 Female 24 Computer science 4 5 No
## 31 Female 23 Business/Economics 2 4 No
## 32 Male 18 Computer science 2 4 No
## 33 Male 19 Computer science 1 5 No
## 34 Male 18 Computer science 2 5 Yes
## 35 Female 19 Computer science 1 4 Yes
## 36 Female 18 Engineering 1 2 No
## 37 Female 18 Law 3 4 Yes
## 38 Female 19 Computer science 1 3 Yes
## 39 Female 18 Social Sciences 1 5 No
## 40 Female 24 Engineering 2 3 Yes
## 41 Female 24 Computer science 3 4 No
## 42 Female 22 Engineering 4 5 No
## 43 Female 20 Other 2 4 Yes
## 44 Male 23 Humanities 2 5 No
## 45 Male 18 Computer science 1 5 No
## 46 Female 19 Engineering 1 5 No
## 47 Female 18 Engineering 4 5 No
## 48 Male 24 Computer science 2 4 Yes
## 49 Female 24 Computer science 3 5 No
This is now the cleaned data. Now I start with the analysis.
First I will calculate the average age participants and then the median:
mean(healthdata2$age)
## [1] 20.53061
round(mean(healthdata2$age), 0)
## [1] 21
The average age rounded to whole years is 21.
median(healthdata2$age)
## [1] 19
The most common age of participants is 19.
About CGPA:
round(summary(healthdata2$CGPA), 2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 4.00 5.00 4.41 5.00 5.00
This shows us that less than 50 or 50% of participants have a CPGA of 5.00. The result of the 1st Quartile is 4.00, which means 25% or less of the participants have a CGPA of 4.00.
#install.packages("pastecs")
library(pastecs)
## Warning: package 'pastecs' was built under R version 4.2.3
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
numericaldata <- healthdata2[, c("age", "study year", "CGPA")]
round(stat.desc(numericaldata), 1)
## age study year CGPA
## nbr.val 49.0 49.0 49.0
## nbr.null 0.0 0.0 0.0
## nbr.na 0.0 0.0 0.0
## min 18.0 1.0 2.0
## max 24.0 4.0 5.0
## range 6.0 3.0 3.0
## sum 1006.0 98.0 216.0
## median 19.0 2.0 5.0
## mean 20.5 2.0 4.4
## SE.mean 0.4 0.1 0.1
## CI.mean.0.95 0.7 0.3 0.2
## var 6.4 1.0 0.5
## std.dev 2.5 1.0 0.7
## coef.var 0.1 0.5 0.2
The number of valid observations for each variable was 49. There were no missing values for any of the variable. No NA`s neither. The range between max and min values are seen in the row named range. SE.mean represents the error of the mean. For example if we would repeat sample 49 observations from column Age and calculate the mean, those would be approximately +/- 0.4 units of the mean I have calculated now. The Cl.mean.0.095 shows us that for example CGPA, which has variable +/- 0.2, is with 95% confidence likely to fall within the range 4.2 to 4.6. Std.dev or standard deviation shows the average distance between our values (CGPA is 0.7, which means values in data set typically differ from the mean by 0.7 units).
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
describe (healthdata2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## gender* 1 49 1.29 0.46 1 1.24 0.00 1 2 1 0.92 -1.18
## age 2 49 20.53 2.53 19 20.44 1.48 18 24 6 0.36 -1.67
## course* 3 49 3.47 2.20 3 3.27 1.48 1 8 7 1.09 -0.24
## study year 4 49 2.00 1.00 2 1.90 1.48 1 4 3 0.61 -0.80
## CGPA 5 49 4.41 0.70 5 4.51 0.00 2 5 3 -1.08 1.06
## depression* 6 49 1.37 0.49 1 1.34 0.00 1 2 1 0.53 -1.75
## se
## gender* 0.07
## age 0.36
## course* 0.31
## study year 0.14
## CGPA 0.10
## depression* 0.07
The “gender” variable has a moderate positive skew, indicating that it is right-skewed, with a longer tail on the right side of the distribution. The kurtosis value of -1.18 suggests that the distribution has slightly heavier tails compared to a normal distribution. The “CGPA” variable demonstrates a moderate negative skew, indicating that it is left-skewed with a longer tail on the left side. The kurtosis value of 1.06 suggests that the distribution has slightly heavier tails compared to a normal distribution.
Now I will graph the distribution of the variables:
hist(healthdata2$age,
main = "Age spread",
ylab = "frequency",
xlab = "age",
breaks = seq(from = 18, to = 24, by = 1), col = "blue")
We see that this histogram has two peaks - is bimodal. That means there are two groups that have the most frequency of occurrence - group 18 and 24.
gendercount <- table(healthdata2$gender)
pie(gendercount, labels = paste(names(gendercount), ": ", gendercount), main = "Gender distribution", col = c("pink", "blue"))
We can see that the majority of students in this data set were women.
healthdata2$depression <- ifelse(healthdata2$depression == "Yes", 1, 0)
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
scatter_plot <- ggplot(healthdata2, aes(x = CGPA, y = depression)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "pink") +
labs(x = "CGPA", y = "Depression") +
theme_minimal()
print(scatter_plot)
## `geom_smooth()` using formula 'y ~ x'
Graph shows that bigger the GPA less depressed the students are.
healthdata2$depression <- ifelse(healthdata2$depression == 1, "Yes", "No")
library(ggplot2)
library(dplyr)
summary_data <- healthdata2 %>%
group_by(gender, depression) %>%
summarise(count = n(), .groups = "drop")
ggplot(summary_data, aes(x = gender, y = count, fill = depression)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Gender and Depression",
x = "Gender",
y = "Count") +
scale_fill_manual(values = c("No" = "blue", "Yes" = "red")) +
theme_minimal()
ggplot(healthdata2, aes(x = gender, fill = depression)) +
geom_bar(position = "fill") +
labs(title = "Ratio of Depression by Gender", y = "Proportion", fill = "Depression") +
scale_fill_manual(values = c("Yes" = "red", "No" = "blue")) +
theme_minimal()
From this ratio we can see that females are more inclined to be depressed than male population.
library(ggplot2)
course_depression <- as.data.frame(table(healthdata2$course, healthdata2$depression))
colnames(course_depression) <- c("Course", "Depression", "Count")
course_depression <- course_depression %>%
group_by(Course) %>%
mutate(Proportion = Count / sum(Count))
ggplot(course_depression, aes(x = Course, y = Proportion, fill = Depression)) +
geom_bar(stat = "identity") +
labs(
title = "Proportion of Students with Depression by Course",
x = "Course",
y = "Proportion") +
scale_fill_manual(values = c("Yes" = "red", "No" = "blue")) +
theme_minimal() +
coord_flip()
We can see that among the most depressed are the law students, followed by maths and other.
ggplot(healthdata2, aes(x = depression, y = CGPA, fill = depression)) +
geom_boxplot() +
labs(
title = "Depression vs. CGPA",
x = "Depression",
y = "CGPA") +
scale_fill_manual(values = c("No" = "lightblue", "Yes" = "pink")) +
theme_minimal()
Both depressed and not depressed students have CGPA values from 3-5. But the difference we see is in the central tendency. The line, which shows the median, falls at 5 for those not depressed, and for those depressed at 4. So as we can see, in average the lesser the CGPA, more depressed students are. This also shows to the skewness of the data. Because the median with depressed students is at 1st quartile, the data is left-skewed. Also the whiskers are not that long, which shows we have a more concentrated data distribution. We have one outlier - 2 CGPA.
ggplot(healthdata2, aes(x = depression, y = CGPA, fill = depression)) +
geom_boxplot(outlier.shape = NA) +
labs(
title = "Depression vs. CGPA",
x = "Depression",
y = "CGPA") +
scale_fill_manual(values = c("No" = "lightblue", "Yes" = "pink")) +
theme_minimal()
Here I took out the outlier that extends beyond the whiskers, because it does not carry any valuable information for my data. Identifying the outliers is important for understanding, that there are some extreme values in my dataset. The fact that there was only one outlier and short whisks suggest that this is high quality data.
ggplot(healthdata2, aes(x=age, y= CGPA)) +
geom_point(size = 2, color = "blue") +
labs(
title = "Scatterplot of Age vs. CGPA",
x = "Age",
y = "CGPA")
There does not seem to be a correlation between age and CGPA because the data is not clustered along an obvious line.
massdata <- read.csv("massdata.csv", sep = ";")
massdata$Mass <- as.numeric(gsub(",",".",massdata$Mass))
str(massdata)
## 'data.frame': 50 obs. of 2 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Mass: num 62.1 64.5 56.5 53.4 61.3 62.2 62.7 64.5 59.5 68.9 ...
summary(massdata$Mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.70 60.23 62.80 62.88 64.50 83.20
library(ggplot2)
histogram <- ggplot(massdata, aes(x=Mass)) + geom_histogram(binwidth = 2, fill = "red", color = "black") + labs( title = "Body Mass Distribution", x = "Body Mass", y = "Frequency")
print(histogram)
library(ggplot2)
hist(massdata$Mass,
main="Body Mass Distribution",
ylab = "Frequency",
xlab = "Body Mass",)
Mass <- c(62.1, 64.5, 56.5, 53.4, 61.3, 62.2, 62.7, 64.5, 59.5, 68.9,
64.5, 62.8, 63.7, 52.3, 64.1, 64.2, 64.2, 60.2, 59.6, 60.9,
54.2, 72.1, 60.3, 63.1, 59.8, 49.7, 58.6, 58.7, 62, 65.7,
64.2, 62.7, 78.5, 61.6, 60.4, 63.3, 69.3, 63, 60.8, 64.9,
62.8, 57.3, 69.3, 62.8, 62.9, 53.6, 65.2, 75.3, 83.2, 66.4)
t.test(massdata$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: massdata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
In the output of the t-test we got the p-value = 0.000234, which is lower than than 0.05. Because of this we can reject the null hypothesis. This means that the online schooling may have had and impact on the body mass of ninth graders.
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.2.3
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
cohens_d (massdata$Mass, mu = 59.5)
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.52)
## [1] "medium"
## (Rules: cohen1988)
This value indicates that there is moderate impact of online schooling on the weight of ninth graders. This is because the medium effect size is from 0.5 to 0.8.
#install.packages("openxlsx")
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.2.3
Apartments <- read.xlsx ("Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
Description:
Apartments$Parking <- factor(Apartments$Parking)
Apartments$Balcony <- factor(Apartments$Balcony)
t.test(Apartments$Price,
mu = 1900,
alternative ="two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
With the p - value lower than 0.05 we can reject the H0 hypothesis.
(fit1 <- lm(data = Apartments, Price ~ Age))
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Coefficients:
## (Intercept) Age
## 2185.455 -8.975
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
save(fit1, file = "fit1.RData")
Regression coefficient (Age): For each additional year of age of an apartment the estimated price is expected to decrease by approximately $8.975.
Coefficient of correlation: This cannot be determined we need an additional test.
Coefficient of determination (R-squared): Multiple R-squared value 0.05302 indicates that 5.30% of the variability in apartment prices can be explained by the age of the apartments.
#install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(~ Price + Age + Distance, data = Apartments)
(fit2 <- lm(data = Apartments, Price ~ Age + Distance))
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Coefficients:
## (Intercept) Age Distance
## 2460.101 -7.934 -20.667
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
save(fit2, file = "fit2.RData")
vif(fit2)
## Age Distance
## 1.001845 1.001845
Here we have overall VIF values close to 1 and this indicates low multicollinearity, these two variables are not highly correlated with each other.
#another option
library(mctest)
mctest(fit2)
##
## Call:
## omcdiag(mod = mod, Inter = TRUE, detr = detr, red = red, conf = conf,
## theil = theil, cn = cn)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.9982 0
## Farrar Chi-Square: 0.1520 0
## Red Indicator: 0.0429 0
## Sum of Lambda Inverse: 2.0037 0
## Theil's Method: -0.4359 0
## Condition Number: 5.0320 0
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
imcdiag(fit2)
##
## Call:
## imcdiag(mod = fit2)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1 IND2
## Age 1.0018 0.9982 0.1531 Inf 0.9991 1.0231 0 0.012 1
## Distance 1.0018 0.9982 0.1531 Inf 0.9991 1.0231 0 0.012 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## * all coefficients have significant t-ratios
##
## R-square of y on all x: 0.4396
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Here we can also see that both variables are close to 1 which suggests low multicollinearity.
library(car)
standardized_residuals <- resid(fit2) / sqrt(1-hatvalues(fit2))
cooks_distance <- cooks.distance(fit2)
residuals_cooks <- data.frame(Standardized_residuals = standardized_residuals, Cooks_Distances = cooks_distance)
head(residuals_cooks)
## Standardized_residuals Cooks_Distances
## 1 -190.4841 0.007386569
## 2 510.5413 0.030365432
## 3 -169.9897 0.005882612
## 4 215.9729 0.008299153
## 5 -307.3056 0.005112584
## 6 -222.5976 0.004900891
cook_threshold <- 4 / length(cooks_distance)
outliers <- which(cooks_distance > cook_threshold)
Aparments_clean <- Apartments[-outliers,]
standardized_residuals <- resid(fit2) /sqrt(1 - hatvalues(fit2))
standardized_fitted_values <- fitted(fit2) / sqrt(1 - hatvalues(fit2))
plot(
x = standardized_fitted_values,
y = standardized_residuals,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Scatterplot of Standardized Residuals vs. Standardized Fitted Values")
Here we can see heteroskedasticity, because ass fitted values increase
the spread of residuals widens.
hist(standardized_residuals, main = "Histogram of Standardized Residuals", xlab = "Standardized Residuals")
qqnorm(standardized_residuals)
qqline(standardized_residuals)
shapiro.test(standardized_residuals)
##
## Shapiro-Wilk normality test
##
## data: standardized_residuals
## W = 0.95306, p-value = 0.00366
Here we see that the p- value is lower than 0.05 so we can say that the residuals are not normally distributed. This can also be seen from the Q-Q plot and the histogram.
fit2_clean <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Regression coefficient (Age): For each additional year of age of an apartment, the estimated price of the apartment is expected to drop by $7.934.
Regression coefficient (Distance): For each additional unit of distance the estimated price of the apartment is expected to drop by $20.667.
Coefficient of determination (R-squared): Multiple R-squared = Indicates that 43.96% of the variability in apartment prices is explained by the age and the distance of the apartments. The adjusted R-Squared = It accounts for the number of predictors in the model.
Apartments$Parking <- factor(Apartments$Parking, levels = c(0, 1))
Apartments$Balcony <- factor(Apartments$Balcony, levels = c(0, 1))
(fit3 <- lm(data = Apartments, Price ~ Age + Distance + Parking + Balcony))
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Coefficients:
## (Intercept) Age Distance Parking1 Balcony1
## 2301.667 -6.799 -18.045 196.168 1.935
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
save(fit3, file = "fit3.RData")
anova <- anova(fit2,fit3)
print(anova)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A p-value lower than 0.05 in the model fit3 suggests that this model fits the data significantly better than fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Regression coefficient (Parking1): It represents the difference in the estimated mean price of apartments between those with parking (Parking1 = 1) and those without parking (Parking1 = 0). In this case the estimated mean price of apartments with parking is expected to be $196.168 higher than the estimated mean price of apartments without parking.
Regression coefficient (Balcony1): represents the difference in the estimated mean price of apartments between those with a balcony (Balcony1 = 1) and those without a balcony (Balcony1 = 0), in this case, the estimated mean price of apartments with a balcony is expected to be $1.935 higher than the estimated mean price of apartments without a balcony.
H0: the model with these predictors is not a better fit for the data than a model with no predictors.
H1: The model with these predictors is a better fit for the data than a model with no predictors.
In this case where the p-value is significant we would reject the null hypothesis.
save_fitted_values <- fitted(fit3)
apartment_id <- 2
observed_value <- Apartments$Price[Apartments$ApartmentID == apartment_id]
residual <- observed_value - save_fitted_values[Apartments$ApartmentID == apartment_id]
residual
## numeric(0)