data(package = .packages(all.available = TRUE))
library(carData)
mydata <- Salaries
summary(mydata) #to show the table that is in my data
## rank discipline yrs.since.phd yrs.service
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## sex salary
## Female: 39 Min. : 57800
## Male :358 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
head(mydata) #show only first 6 rows
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Explanations for variables:
Rank; The academic rank of a faculty member
Discipline; In which discipline each professor works
Years since PhD; Number of years since the faculty member received their PhD
Years of service; Experiance in years
Gender
Salary; Annual salary in US dollars
names(mydata) [names(mydata) == "sex"] <- "gender"
summary(mydata)
## rank discipline yrs.since.phd yrs.service
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## gender salary
## Female: 39 Min. : 57800
## Male :358 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
We renamed the column sex to gender.
mydata$discipline <- factor(mydata$discipline,
levels = c("A", "B"),
labels = c("Mathematics", "Engineering")
)
summary(mydata)
## rank discipline yrs.since.phd yrs.service
## AsstProf : 67 Mathematics:181 Min. : 1.00 Min. : 0.00
## AssocProf: 64 Engineering:216 1st Qu.:12.00 1st Qu.: 7.00
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## gender salary
## Female: 39 Min. : 57800
## Male :358 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
Because there no official data on whats discipline A and discipline B, just for an example, we labeled discipline A as Mathematics and discipline B as Engineering.
mydata$gap <- mydata$yrs.since.phd - mydata$yrs.service
head(mydata$gap)
## [1] 1 4 1 6 -1 0
We’ve made a new variable that measures the years between obtaining a PhD and starting academic service, helping us analyze career paths and their possible effect on salaries.
mydata_new <- mydata [ ,-2] #deleting discipline because there's no official data what is A and what is B
summary(mydata_new) #summary because we have categorical data such as rank and gender, so we can get frequency of those
## rank yrs.since.phd yrs.service gender
## AsstProf : 67 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## salary gap
## Min. : 57800 Min. :-11.0
## 1st Qu.: 91000 1st Qu.: 1.0
## Median :107300 Median : 3.0
## Mean :113706 Mean : 4.7
## 3rd Qu.:134185 3rd Qu.: 7.0
## Max. :231545 Max. : 30.0
professors <- data.frame(mydata_new[mydata_new$rank == "Prof", ])
summary(professors)
## rank yrs.since.phd yrs.service gender
## AsstProf : 0 Min. :11.00 Min. : 0.00 Female: 18
## AssocProf: 0 1st Qu.:20.00 1st Qu.:15.00 Male :248
## Prof :266 Median :28.00 Median :21.00
## Mean :28.30 Mean :22.82
## 3rd Qu.:36.75 3rd Qu.:30.00
## Max. :56.00 Max. :60.00
## salary gap
## Min. : 57800 Min. :-11.000
## 1st Qu.:105975 1st Qu.: 1.000
## Median :123322 Median : 4.000
## Mean :126772 Mean : 5.485
## 3rd Qu.:145080 3rd Qu.: 9.750
## Max. :231545 Max. : 30.000
We’ve made a new data.frame that only contains units that are ranked as professors.
library(pastecs)
round(stat.desc(mydata_new [,-c(1,4)]),2)
## yrs.since.phd yrs.service salary gap
## nbr.val 397.00 397.00 397.00 397.00
## nbr.null 0.00 11.00 0.00 76.00
## nbr.na 0.00 0.00 0.00 0.00
## min 1.00 0.00 57800.00 -11.00
## max 56.00 60.00 231545.00 30.00
## range 55.00 60.00 173745.00 41.00
## sum 8859.00 6993.00 45141464.00 1866.00
## median 21.00 16.00 107300.00 3.00
## mean 22.31 17.61 113706.46 4.70
## SE.mean 0.65 0.65 1520.16 0.28
## CI.mean.0.95 1.27 1.28 2988.60 0.54
## var 166.07 169.16 917425865.05 30.30
## std.dev 12.89 13.01 30289.04 5.50
## coef.var 0.58 0.74 0.27 1.17
Here we’re presenting descriptive statistics of mydata_new and rounding them to 2 decimals and also deleting second and fourth column, because they’re categorical variables so there’s no need of having them presented also.
What we can decipher from the table with descriptive statistics is:
the average salary is: 113.706,46 US$,
range gives us the difference between the maximum and minimum value, so for example the difference between a professor that has had their PhD for the longest and a professor who was the last to get their Phd is 55 years,
the median tells us for example 50 % of professors has 16 or less years of service, while the other 50 % have more than 16 years of service
hist(mydata_new$salary,
main = "Distribution of salaries",
xlab = "Salary",
ylab = "Frequency",
breaks=seq(from = 57800 , to = 300000, by = 10000),
right = FALSE)
What we can see from the histogram is that the distribution of salaries has positive skewness which means that is asymmetrical to the right. So as expected there’s more faculty members with somewhat lower salary and there is less faculty members with high salary.
If we want to know if there are any outliers we can check that with boxploth.
library(ggplot2)
# Salary by rank
ggplot(mydata, aes(x = rank, y = salary)) +
geom_boxplot(fill = "tan") +
labs(x = "Rank", y = "Salary", title = "Salary by Rank")
# Salary by sex
ggplot(mydata, aes(x = gender, y = salary)) +
geom_boxplot(fill = "lightgreen") +
labs(x = "Gender", y = "Salary", title = "Salary by Gender")
So we can see the outliers are professors more specifically male professors.
library(readxl)
mydata1 <- read_xlsx("./Business_School.xlsx")
mydata1 <- as.data.frame(mydata1) # to convert an object into data (rows/columns)
summary(mydata1)
## Student ID Undergrad Degree Undergrad Grade MBA Grade
## Min. : 1.00 Length:100 Min. : 61.20 Min. :58.14
## 1st Qu.: 25.75 Class :character 1st Qu.: 71.47 1st Qu.:71.14
## Median : 50.50 Mode :character Median : 76.65 Median :76.38
## Mean : 50.50 Mean : 76.90 Mean :76.04
## 3rd Qu.: 75.25 3rd Qu.: 81.70 3rd Qu.:82.15
## Max. :100.00 Max. :100.00 Max. :95.00
## Work Experience Employability (Before) Employability (After)
## Length:100 Min. :101.0 Min. :119.0
## Class :character 1st Qu.:245.8 1st Qu.:312.0
## Mode :character Median :256.8 Median :435.6
## Mean :257.9 Mean :422.7
## 3rd Qu.:261.0 3rd Qu.:529.0
## Max. :421.0 Max. :631.0
## Status Annual Salary
## Length:100 Min. : 20000
## Class :character 1st Qu.: 87125
## Mode :character Median :103500
## Mean :109058
## 3rd Qu.:124000
## Max. :340000
head(mydata1)
## Student ID Undergrad Degree Undergrad Grade MBA Grade
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## Work Experience Employability (Before) Employability (After) Status
## 1 No 252 276 Placed
## 2 Yes 101 119 Placed
## 3 No 401 462 Placed
## 4 No 287 342 Placed
## 5 No 275 347 Placed
## 6 No 254 313 Placed
## Annual Salary
## 1 111000
## 2 107000
## 3 109000
## 4 148000
## 5 255500
## 6 103500
ggplot(mydata1, aes (x= `Undergrad Degree`))+
geom_bar(fill = "lightyellow")+
labs(title = "Distribution of Undergraduate Degrees",
x= "Undergraduate Degree",
y = "Number of Students")+
theme_minimal()
So from the distribution we can see that the most common undergraduate degree is business.
library(pastecs)
round(stat.desc(mydata1$`Annual Salary`, basic=TRUE, desc=TRUE,2))
## nbr.val nbr.null nbr.na min max
## 100 0 0 20000 340000
## range sum median mean SE.mean
## 320000 10905800 103500 109058 4150
## CI.mean.0.95 var std.dev coef.var
## 8235 1722373475 41501 0
library(ggplot2)
ggplot(mydata1, aes (x= `Annual Salary`))+
geom_histogram(binwidth = 10000, fill= "pink1")+
labs(title= "Distribution of Annual Salaries", x = "Annual salary", y= "Count")+
theme_minimal()
The distribution is right-skewed, so positive skewness. Most salaries cluster in the middle range, with fewer observation at very high salary levels. Most students earn between 100.000 and 130.000.
We check if data is normally distributed with Shapiro-Wilk test, where null hypothesis is that the MBA grade is normally distributed and hypothesis one that it’s not.
shapiro.test(mydata1$`MBA Grade`)
##
## Shapiro-Wilk normality test
##
## data: mydata1$`MBA Grade`
## W = 0.98799, p-value = 0.5078
Based on the p-value that is above threshold (alfa=0,05), we cannot reject the null hypothesis, so we have no evidence to believe that the distribution of price deviates from normal distribution, so we can preform t-test.
t.test(mydata1$`MBA Grade`, mu= 74)
##
## One Sample t-test
##
## data: mydata1$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
P-value is less than 0.01 so we reject the null hypothesis, the average MBA grade is significantly different from 74.
The effect size can be computed as Cohen’s distance
x <- mydata1$`MBA Grade`
sample_mean <- mean (x)
sample_sd <- sd(x )
d <- (sample_mean - 74)/ sample_sd
d
## [1] 0.2658658
If is approximately 0.2, there is a small effect.
library(readxl)
mydata2 <- read_xlsx("./Apartments.xlsx")
mydata2 <- as.data.frame(mydata2) # to convert an object into data (rows/columns)
head(mydata2)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata2$Parking <- factor(mydata2$Parking,
levels = c(0,1),
labels = c("No", "Yes")
)
mydata2$Balcony <- factor(mydata2$Balcony,
levels = c(0, 1),
labels = c("No", "Yes")
)
summary(mydata2)
## Age Distance Price Parking Balcony
## Min. : 1.00 Min. : 1.00 Min. :1400 No :42 No :48
## 1st Qu.:12.00 1st Qu.: 4.00 1st Qu.:1710 Yes:43 Yes:37
## Median :18.00 Median :12.00 Median :1950
## Mean :18.55 Mean :14.22 Mean :2019
## 3rd Qu.:24.00 3rd Qu.:20.00 3rd Qu.:2290
## Max. :45.00 Max. :45.00 Max. :2820
t.test(mydata2$Price, mu= 1900)
##
## One Sample t-test
##
## data: mydata2$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The mean isn’t included in the confidence interval and p-value is less than alfa (= 0.05) so we can reject the null hypothesis with 5 % risk. The degrees of freedom is 84 and the estimated mean of price is 2018.9.
fit1 <- lm(Price~Age,
data=mydata2)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
The estimated coefficient of age is -8,98, which means that when the age of an apartment goes up by 1 year the price decreases by 8.98 euros and if the estimated coefficient is zero means that a starting price of a new apartment is 2185.46 euros. The coefficient of determination is 0.053 so it means that 5.3 % of variability of price is explained by lineary fact of age.
cor(mydata2$Price, mydata2$Age)
## [1] -0.230255
The relationship apartment price and age in negative and the strenght of the relationshhip is weak (it’s between 0.1 and 0.3)
library(car)
scatterplotMatrix(~ Price + Age + Distance,
data=mydata2,
smooth = FALSE
)
So the expected realtionship between price and age is negative as we can also see from the scatterplot so here we don’t detect no multicolinearity. Also with distance. Bigger the distance from the city center less is the price of an apartment. And by the looks of the scatterplot between age and distance age doesn’t affect distance. So I would say there is no problem with multicolinearity because regression coefficients aren’t contrary to out expectance.
fit2 <- lm(Price ~ Age + Distance,
data=mydata2)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
The values of both VIF statistics are less than 5 so there’s no multicolinearity.
mydata2$StdResid <- round(rstandard(fit2),3)
mydata2$CooksD <- round(cooks.distance(fit2),3)
hist(mydata2$StdResid,
xlab = "Standardized residuals",
ylab="Frequency",
main= "Histogram of Standardized Residuals")
There are no residuals that are outside of the interval [-3,3] so there isn’t any outliers.
hist(mydata2$CooksD,
xlab = "Cooks distance",
ylab= "Frequency",
main = "Histogram of Cooks distances")
Looks like the values on the right are units with high influence, so we try to drop them.
head(mydata2[order(mydata2$StdResid),],3)
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
head(mydata2[order(-mydata2$CooksD),],6)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 39 40 2 2400 No Yes 1.091 0.038
We’ll drop the 38th apartment
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata2 <- mydata2 %>%
filter(!(Age == 5 & Distance == 45 ))#we choose the exact unique values of the 38th unit
hist(mydata2$CooksD,
xlab = "Cooks distance",
ylab="Frequency",
main= "Histogram of Cook's distance")
Even after droppint 38th unit we still see units that have high influence so we have to drop them too.
head(mydata2[order(-mydata2$CooksD),],6)
## Age Distance Price Parking Balcony StdResid CooksD
## 54 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 52 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 38 40 2 2400 No Yes 1.091 0.038
## 57 8 2 2820 Yes No 1.655 0.037
We still need to drop 54, 33, 52 and 22th apartments.
mydata2$ID <- 1:nrow(mydata2)
head(mydata2) #here we need to drop more units so it's easier to determine them by ID
## Age Distance Price Parking Balcony StdResid CooksD ID
## 1 7 28 1640 No Yes -0.665 0.007 1
## 2 18 1 2800 Yes No 1.783 0.030 2
## 3 7 28 1660 No No -0.594 0.006 3
## 4 28 29 1850 No Yes 0.754 0.008 4
## 5 18 18 1640 Yes Yes -1.073 0.005 5
## 6 28 12 1770 No Yes -0.778 0.005 6
library(dplyr)
mydata2 <- mydata2 %>%
filter(!ID %in% c(54, 33, 52, 22)) #dropping 54th apartment
hist(mydata2$CooksD,
xlab = "Cooks distance",
ylab="Frequency",
main= "Histogram of Cook's distance")
hist(mydata2$StdResid,
xlab = "Standardized residuals",
ylab="Frequency",
main= "Histogram of Standardized Residuals")
Histogram is now continious so we dropped all outliers.
fit3 <- lm(Price ~ Age + Distance,
data=mydata2)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
#had to make a new fit because if i used fit 2 for scatterplot it wouldn't work because there were 85 units and in the data2 because we deleted 38th unit only 84 units
mydata2$StdFitted <- scale(fit3$fitted.values)
library(car)
scatterplot(y= mydata2$StdResid, x= mydata2$StdFitted,
ylab="Standardized residuals",
xlab="Standardized fitted values",
boxplots =FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit3)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
We cannot reject null hypothesis so there no proof of heteroskedasticity, and the model appears to have roughly constant variance of errors.
hist(mydata2$StdResid,
xlab = "Standardized residuals",
ylab="Frequency",
main= "Histogram of Standardized Residuals")
We test if the distribution of standardized residuals is normally distributed with Shapiro-Wilk test.
shapiro.test(mydata2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata2$StdResid
## W = 0.93418, p-value = 0.0004761
Based on the p-value that is less than threshold value (alfa= 0,05), we can reject the null hypothesis, that standard residuals are normally distributed.
mydata2_clean <- mydata2
fit2_clean <- lm(Price ~ Age + Distance,#making a new fit2 without the eliminated units
data=mydata2_clean)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Intercept means that if age and distance would be 0 the starting price of a new apartment right in the center of the city would be 2502.5. The relationship between age and price is negative so given the distance, each additional year of apartment age the price decreases by 8.67 euros? and the relationship between distance and price is also negative, so given the age, each additional km from the city center decreases the price of an apartment by approximately 24.1 euros?.
mydata2$Parking <- as.factor(mydata2$Parking)
mydata2$Balcony <- as.factor(mydata2$Balcony)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata2)
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
Both models fit the data equally well (p=0.1135 is more than alfa and we cannot reject the null hypothesis)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Given to age, distance and balcony the apartments with parking are on average priced 128.7 times more than the apartments that don’t. Given to the age, distance and parking the apartment with balcony are on average priced 6 times more than the apartments that don’t. The null hypothesis is that all coefficients are equal to 0, so they have no effect on price.
mydata2$Fitted <- fitted(fit3)
mydata2$Residuals <- resid(fit3)
mydata2[mydata2$ID == 2, c("Price", "Fitted", "Residuals")]
## Price Fitted Residuals
## 2 2800 2356.597 443.4026