Olivia Samsa

Task 1

Installing data from R
data(package = .packages(all.available = TRUE))
library(carData)
mydata <- Salaries

summary(mydata) #to show the table that is in my data
##         rank     discipline yrs.since.phd    yrs.service   
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00  
##  Prof     :266              Median :21.00   Median :16.00  
##                             Mean   :22.31   Mean   :17.61  
##                             3rd Qu.:32.00   3rd Qu.:27.00  
##                             Max.   :56.00   Max.   :60.00  
##      sex          salary      
##  Female: 39   Min.   : 57800  
##  Male  :358   1st Qu.: 91000  
##               Median :107300  
##               Mean   :113706  
##               3rd Qu.:134185  
##               Max.   :231545
head(mydata) #show only first 6 rows
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Explanations for variables:

  • Rank; The academic rank of a faculty member

  • Discipline; In which discipline each professor works

  • Years since PhD; Number of years since the faculty member received their PhD

  • Years of service; Experiance in years

  • Gender

  • Salary; Annual salary in US dollars

Renaming variables
names(mydata) [names(mydata) == "sex"] <- "gender"

summary(mydata)
##         rank     discipline yrs.since.phd    yrs.service   
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00  
##  Prof     :266              Median :21.00   Median :16.00  
##                             Mean   :22.31   Mean   :17.61  
##                             3rd Qu.:32.00   3rd Qu.:27.00  
##                             Max.   :56.00   Max.   :60.00  
##     gender        salary      
##  Female: 39   Min.   : 57800  
##  Male  :358   1st Qu.: 91000  
##               Median :107300  
##               Mean   :113706  
##               3rd Qu.:134185  
##               Max.   :231545

We renamed the column sex to gender.

Making category variables more descriptive by changing labels
mydata$discipline <- factor(mydata$discipline,
                            levels = c("A", "B"),
                            labels = c("Mathematics", "Engineering")
  
)
summary(mydata)
##         rank           discipline  yrs.since.phd    yrs.service   
##  AsstProf : 67   Mathematics:181   Min.   : 1.00   Min.   : 0.00  
##  AssocProf: 64   Engineering:216   1st Qu.:12.00   1st Qu.: 7.00  
##  Prof     :266                     Median :21.00   Median :16.00  
##                                    Mean   :22.31   Mean   :17.61  
##                                    3rd Qu.:32.00   3rd Qu.:27.00  
##                                    Max.   :56.00   Max.   :60.00  
##     gender        salary      
##  Female: 39   Min.   : 57800  
##  Male  :358   1st Qu.: 91000  
##               Median :107300  
##               Mean   :113706  
##               3rd Qu.:134185  
##               Max.   :231545

Because there no official data on whats discipline A and discipline B, just for an example, we labeled discipline A as Mathematics and discipline B as Engineering.

Creating a new variable
mydata$gap <- mydata$yrs.since.phd - mydata$yrs.service
head(mydata$gap)
## [1]  1  4  1  6 -1  0

We’ve made a new variable that measures the years between obtaining a PhD and starting academic service, helping us analyze career paths and their possible effect on salaries.

Deleting second column (discipline)
mydata_new <- mydata [ ,-2] #deleting discipline because there's no official data what is A and what is B

summary(mydata_new) #summary because we have categorical data such as rank and gender, so we can get frequency of those
##         rank     yrs.since.phd    yrs.service       gender   
##  AsstProf : 67   Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266   Median :21.00   Median :16.00               
##                  Mean   :22.31   Mean   :17.61               
##                  3rd Qu.:32.00   3rd Qu.:27.00               
##                  Max.   :56.00   Max.   :60.00               
##      salary            gap       
##  Min.   : 57800   Min.   :-11.0  
##  1st Qu.: 91000   1st Qu.:  1.0  
##  Median :107300   Median :  3.0  
##  Mean   :113706   Mean   :  4.7  
##  3rd Qu.:134185   3rd Qu.:  7.0  
##  Max.   :231545   Max.   : 30.0
Creating new data.frame based on conditions
professors <- data.frame(mydata_new[mydata_new$rank == "Prof", ])

summary(professors)
##         rank     yrs.since.phd    yrs.service       gender   
##  AsstProf :  0   Min.   :11.00   Min.   : 0.00   Female: 18  
##  AssocProf:  0   1st Qu.:20.00   1st Qu.:15.00   Male  :248  
##  Prof     :266   Median :28.00   Median :21.00               
##                  Mean   :28.30   Mean   :22.82               
##                  3rd Qu.:36.75   3rd Qu.:30.00               
##                  Max.   :56.00   Max.   :60.00               
##      salary            gap         
##  Min.   : 57800   Min.   :-11.000  
##  1st Qu.:105975   1st Qu.:  1.000  
##  Median :123322   Median :  4.000  
##  Mean   :126772   Mean   :  5.485  
##  3rd Qu.:145080   3rd Qu.:  9.750  
##  Max.   :231545   Max.   : 30.000

We’ve made a new data.frame that only contains units that are ranked as professors.

Presenting descriptive statistics
library(pastecs)
round(stat.desc(mydata_new [,-c(1,4)]),2)
##              yrs.since.phd yrs.service       salary     gap
## nbr.val             397.00      397.00       397.00  397.00
## nbr.null              0.00       11.00         0.00   76.00
## nbr.na                0.00        0.00         0.00    0.00
## min                   1.00        0.00     57800.00  -11.00
## max                  56.00       60.00    231545.00   30.00
## range                55.00       60.00    173745.00   41.00
## sum                8859.00     6993.00  45141464.00 1866.00
## median               21.00       16.00    107300.00    3.00
## mean                 22.31       17.61    113706.46    4.70
## SE.mean               0.65        0.65      1520.16    0.28
## CI.mean.0.95          1.27        1.28      2988.60    0.54
## var                 166.07      169.16 917425865.05   30.30
## std.dev              12.89       13.01     30289.04    5.50
## coef.var              0.58        0.74         0.27    1.17

Here we’re presenting descriptive statistics of mydata_new and rounding them to 2 decimals and also deleting second and fourth column, because they’re categorical variables so there’s no need of having them presented also.

What we can decipher from the table with descriptive statistics is:

  • the average salary is: 113.706,46 US$,

  • range gives us the difference between the maximum and minimum value, so for example the difference between a professor that has had their PhD for the longest and a professor who was the last to get their Phd is 55 years,

  • the median tells us for example 50 % of professors has 16 or less years of service, while the other 50 % have more than 16 years of service

Graphing the distribution
hist(mydata_new$salary,
     main = "Distribution of salaries",
     xlab = "Salary",
     ylab = "Frequency",
     breaks=seq(from = 57800 , to = 300000, by = 10000),
     right = FALSE)

What we can see from the histogram is that the distribution of salaries has positive skewness which means that is asymmetrical to the right. So as expected there’s more faculty members with somewhat lower salary and there is less faculty members with high salary.

If we want to know if there are any outliers we can check that with boxploth.

library(ggplot2)
# Salary by rank
ggplot(mydata, aes(x = rank, y = salary)) +
  geom_boxplot(fill = "tan") +
  labs(x = "Rank", y = "Salary", title = "Salary by Rank")

# Salary by sex
ggplot(mydata, aes(x = gender, y = salary)) +
  geom_boxplot(fill = "lightgreen") +
  labs(x = "Gender", y = "Salary", title = "Salary by Gender")

So we can see the outliers are professors more specifically male professors.

Task 2

Graphing the distribution of unergrad degrees

library(readxl)
mydata1 <- read_xlsx("./Business_School.xlsx")
mydata1 <- as.data.frame(mydata1) # to convert an object into data (rows/columns)

summary(mydata1)
##    Student ID     Undergrad Degree   Undergrad Grade    MBA Grade    
##  Min.   :  1.00   Length:100         Min.   : 61.20   Min.   :58.14  
##  1st Qu.: 25.75   Class :character   1st Qu.: 71.47   1st Qu.:71.14  
##  Median : 50.50   Mode  :character   Median : 76.65   Median :76.38  
##  Mean   : 50.50                      Mean   : 76.90   Mean   :76.04  
##  3rd Qu.: 75.25                      3rd Qu.: 81.70   3rd Qu.:82.15  
##  Max.   :100.00                      Max.   :100.00   Max.   :95.00  
##  Work Experience    Employability (Before) Employability (After)
##  Length:100         Min.   :101.0          Min.   :119.0        
##  Class :character   1st Qu.:245.8          1st Qu.:312.0        
##  Mode  :character   Median :256.8          Median :435.6        
##                     Mean   :257.9          Mean   :422.7        
##                     3rd Qu.:261.0          3rd Qu.:529.0        
##                     Max.   :421.0          Max.   :631.0        
##     Status          Annual Salary   
##  Length:100         Min.   : 20000  
##  Class :character   1st Qu.: 87125  
##  Mode  :character   Median :103500  
##                     Mean   :109058  
##                     3rd Qu.:124000  
##                     Max.   :340000
head(mydata1)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade
## 1          1         Business            68.4      90.2
## 2          2 Computer Science            70.2      68.7
## 3          3          Finance            76.4      83.3
## 4          4         Business            82.6      88.7
## 5          5          Finance            76.9      75.4
## 6          6 Computer Science            83.3      82.1
##   Work Experience Employability (Before) Employability (After) Status
## 1              No                    252                   276 Placed
## 2             Yes                    101                   119 Placed
## 3              No                    401                   462 Placed
## 4              No                    287                   342 Placed
## 5              No                    275                   347 Placed
## 6              No                    254                   313 Placed
##   Annual Salary
## 1        111000
## 2        107000
## 3        109000
## 4        148000
## 5        255500
## 6        103500
ggplot(mydata1, aes (x= `Undergrad Degree`))+
  geom_bar(fill = "lightyellow")+
  labs(title = "Distribution of Undergraduate Degrees",
       x= "Undergraduate Degree",
       y = "Number of Students")+
  theme_minimal()

So from the distribution we can see that the most common undergraduate degree is business.

Showing the discriptive statistics of the Annual salaries adn its distribution

library(pastecs)
round(stat.desc(mydata1$`Annual Salary`, basic=TRUE, desc=TRUE,2))
##      nbr.val     nbr.null       nbr.na          min          max 
##          100            0            0        20000       340000 
##        range          sum       median         mean      SE.mean 
##       320000     10905800       103500       109058         4150 
## CI.mean.0.95          var      std.dev     coef.var 
##         8235   1722373475        41501            0
library(ggplot2)
ggplot(mydata1, aes (x= `Annual Salary`))+
  geom_histogram(binwidth = 10000, fill= "pink1")+
  labs(title= "Distribution of Annual Salaries", x = "Annual salary", y= "Count")+
  theme_minimal()

The distribution is right-skewed, so positive skewness. Most salaries cluster in the middle range, with fewer observation at very high salary levels. Most students earn between 100.000 and 130.000.

Testing the hypothesis

We check if data is normally distributed with Shapiro-Wilk test, where null hypothesis is that the MBA grade is normally distributed and hypothesis one that it’s not.

shapiro.test(mydata1$`MBA Grade`)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$`MBA Grade`
## W = 0.98799, p-value = 0.5078

Based on the p-value that is above threshold (alfa=0,05), we cannot reject the null hypothesis, so we have no evidence to believe that the distribution of price deviates from normal distribution, so we can preform t-test.

t.test(mydata1$`MBA Grade`, mu= 74)
## 
##  One Sample t-test
## 
## data:  mydata1$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

P-value is less than 0.01 so we reject the null hypothesis, the average MBA grade is significantly different from 74.

The effect size can be computed as Cohen’s distance

x <- mydata1$`MBA Grade`

sample_mean <- mean (x)
sample_sd <- sd(x )

d <- (sample_mean - 74)/ sample_sd
d
## [1] 0.2658658

If is approximately 0.2, there is a small effect.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata2 <- read_xlsx("./Apartments.xlsx")
mydata2 <- as.data.frame(mydata2) # to convert an object into data (rows/columns)

head(mydata2)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata2$Parking <- factor(mydata2$Parking,
                            levels = c(0,1),
                            labels = c("No", "Yes")
  
)
mydata2$Balcony <- factor(mydata2$Balcony,
                            levels = c(0, 1),
                            labels = c("No", "Yes")
  
)

summary(mydata2)
##       Age           Distance         Price      Parking  Balcony 
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400   No :42   No :48  
##  1st Qu.:12.00   1st Qu.: 4.00   1st Qu.:1710   Yes:43   Yes:37  
##  Median :18.00   Median :12.00   Median :1950                    
##  Mean   :18.55   Mean   :14.22   Mean   :2019                    
##  3rd Qu.:24.00   3rd Qu.:20.00   3rd Qu.:2290                    
##  Max.   :45.00   Max.   :45.00   Max.   :2820

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata2$Price, mu= 1900)
## 
##  One Sample t-test
## 
## data:  mydata2$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The mean isn’t included in the confidence interval and p-value is less than alfa (= 0.05) so we can reject the null hypothesis with 5 % risk. The degrees of freedom is 84 and the estimated mean of price is 2018.9.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price~Age,
          data=mydata2)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The estimated coefficient of age is -8,98, which means that when the age of an apartment goes up by 1 year the price decreases by 8.98 euros and if the estimated coefficient is zero means that a starting price of a new apartment is 2185.46 euros. The coefficient of determination is 0.053 so it means that 5.3 % of variability of price is explained by lineary fact of age.

cor(mydata2$Price, mydata2$Age)
## [1] -0.230255

The relationship apartment price and age in negative and the strenght of the relationshhip is weak (it’s between 0.1 and 0.3)

Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(~ Price + Age + Distance,
                  data=mydata2,
                  smooth = FALSE
            )

So the expected realtionship between price and age is negative as we can also see from the scatterplot so here we don’t detect no multicolinearity. Also with distance. Bigger the distance from the city center less is the price of an apartment. And by the looks of the scatterplot between age and distance age doesn’t affect distance. So I would say there is no problem with multicolinearity because regression coefficients aren’t contrary to out expectance.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data=mydata2)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

The values of both VIF statistics are less than 5 so there’s no multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata2$StdResid <- round(rstandard(fit2),3)
mydata2$CooksD <- round(cooks.distance(fit2),3)

hist(mydata2$StdResid,
     xlab = "Standardized residuals",
     ylab="Frequency",
     main= "Histogram of Standardized Residuals")

There are no residuals that are outside of the interval [-3,3] so there isn’t any outliers.

hist(mydata2$CooksD,
     xlab = "Cooks distance",
     ylab= "Frequency",
     main = "Histogram of Cooks distances")

Looks like the values on the right are units with high influence, so we try to drop them.

head(mydata2[order(mydata2$StdResid),],3)
##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013
head(mydata2[order(-mydata2$CooksD),],6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 39  40        2  2400      No     Yes    1.091  0.038

We’ll drop the 38th apartment

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata2 <- mydata2 %>%
  filter(!(Age == 5 & Distance == 45 ))#we choose the exact unique values of the 38th unit
hist(mydata2$CooksD,
     xlab = "Cooks distance",
     ylab="Frequency",
     main= "Histogram of Cook's distance")

Even after droppint 38th unit we still see units that have high influence so we have to drop them too.

head(mydata2[order(-mydata2$CooksD),],6)
##    Age Distance Price Parking Balcony StdResid CooksD
## 54  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 52   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061
## 38  40        2  2400      No     Yes    1.091  0.038
## 57   8        2  2820     Yes      No    1.655  0.037

We still need to drop 54, 33, 52 and 22th apartments.

mydata2$ID <- 1:nrow(mydata2)
head(mydata2) #here we need to drop more units so it's easier to determine them by ID
##   Age Distance Price Parking Balcony StdResid CooksD ID
## 1   7       28  1640      No     Yes   -0.665  0.007  1
## 2  18        1  2800     Yes      No    1.783  0.030  2
## 3   7       28  1660      No      No   -0.594  0.006  3
## 4  28       29  1850      No     Yes    0.754  0.008  4
## 5  18       18  1640     Yes     Yes   -1.073  0.005  5
## 6  28       12  1770      No     Yes   -0.778  0.005  6
library(dplyr)
mydata2 <- mydata2 %>%
  filter(!ID %in% c(54, 33, 52, 22)) #dropping 54th apartment
hist(mydata2$CooksD,
     xlab = "Cooks distance",
     ylab="Frequency",
     main= "Histogram of Cook's distance")

hist(mydata2$StdResid,
     xlab = "Standardized residuals",
     ylab="Frequency",
     main= "Histogram of Standardized Residuals")

Histogram is now continious so we dropped all outliers.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standardized fitted values. Explain the findings.

fit3 <- lm(Price ~ Age + Distance,
           data=mydata2)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13
#had to make a new fit because if i used fit 2 for scatterplot it wouldn't work because there were 85 units and in the data2 because we deleted 38th unit only 84 units

mydata2$StdFitted <- scale(fit3$fitted.values)
library(car)
scatterplot(y= mydata2$StdResid, x= mydata2$StdFitted,
            ylab="Standardized residuals",
            xlab="Standardized fitted values",
            boxplots =FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit3)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

We cannot reject null hypothesis so there no proof of heteroskedasticity, and the model appears to have roughly constant variance of errors.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata2$StdResid,
     xlab = "Standardized residuals",
     ylab="Frequency",
     main= "Histogram of Standardized Residuals")

We test if the distribution of standardized residuals is normally distributed with Shapiro-Wilk test.

shapiro.test(mydata2$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$StdResid
## W = 0.93418, p-value = 0.0004761

Based on the p-value that is less than threshold value (alfa= 0,05), we can reject the null hypothesis, that standard residuals are normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

mydata2_clean <- mydata2

fit2_clean <- lm(Price ~ Age + Distance,#making a new fit2 without the eliminated units
           data=mydata2_clean)
summary(fit2_clean)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Intercept means that if age and distance would be 0 the starting price of a new apartment right in the center of the city would be 2502.5. The relationship between age and price is negative so given the distance, each additional year of apartment age the price decreases by 8.67 euros? and the relationship between distance and price is also negative, so given the age, each additional km from the city center decreases the price of an apartment by approximately 24.1 euros?.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

mydata2$Parking <- as.factor(mydata2$Parking)
mydata2$Balcony <- as.factor(mydata2$Balcony)


fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata2)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2_clean, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Both models fit the data equally well (p=0.1135 is more than alfa and we cannot reject the null hypothesis)

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Given to age, distance and balcony the apartments with parking are on average priced 128.7 times more than the apartments that don’t. Given to the age, distance and parking the apartment with balcony are on average priced 6 times more than the apartments that don’t. The null hypothesis is that all coefficients are equal to 0, so they have no effect on price.

Save fitted values and calculate the residual for apartment ID2.

mydata2$Fitted <- fitted(fit3)
mydata2$Residuals <- resid(fit3)
mydata2[mydata2$ID == 2, c("Price", "Fitted", "Residuals")]
##   Price   Fitted Residuals
## 2  2800 2356.597  443.4026