TASK 1

Explain the data set.

library(readxl)
mydata2 <- read_xlsx("~/Desktop/r task 1.xlsx")
print (mydata2)
## # A tibble: 111 × 5
##    gender `test preparation course` `math score` `reading score` `writing score`
##    <chr>                      <dbl>        <dbl>           <dbl>           <dbl>
##  1 female                         1           72              72              74
##  2 female                         3           69              90              88
##  3 female                        16           90              95              93
##  4 male                           0           47              57              44
##  5 male                           3           76              78              75
##  6 female                         6           71              83              78
##  7 female                        10           88              95              92
##  8 male                           7           40              43              39
##  9 male                          12           64              64              67
## 10 female                         0           38              60              50
## # … with 101 more rows

The data set I will be referring to in the Task 1 was taken from the github webpage and it consists of the sample data regarding Student performance of an unknown highschool in 2020. The sample consists of 111 observations and 5 variables.

Gender is a categorical variable that indicates weather the student is a female or a male.

Test preparation course is a numerical variable that tells us how many days has the student been preparing for all of the following tests.

Math score, reading score and writing score are all numerical variables that tell us how many points (between 0 and 100) has each student achieved on the specific exam.

Perform some data manipulations.

Let’s say that I want to add a new variable called ID of the sudents.

mydata2$ID <- (1:111)

Let’s say I want to correct a value that was mistakenly written for the 36. person. The score obtained for the math test was actually 71 points:

mydata2[36,3] <-71

print(mydata2)
## # A tibble: 111 × 6
##    gender `test preparation course` `math score` `reading score` writing…¹    ID
##    <chr>                      <dbl>        <dbl>           <dbl>     <dbl> <int>
##  1 female                         1           72              72        74     1
##  2 female                         3           69              90        88     2
##  3 female                        16           90              95        93     3
##  4 male                           0           47              57        44     4
##  5 male                           3           76              78        75     5
##  6 female                         6           71              83        78     6
##  7 female                        10           88              95        92     7
##  8 male                           7           40              43        39     8
##  9 male                          12           64              64        67     9
## 10 female                         0           38              60        50    10
## # … with 101 more rows, and abbreviated variable name ¹​`writing score`

Let’s say that the math professor decided that due to a mistake in scoring, each person gets additional 2 points on the exam.

mydata2$`corrected math score`<-mydata2$`math score` + 2

Let’s say I want to remove the incorrect math score variable.

mydata3 <- mydata2[ , -3]

Let’s say I want to change the name of the column that says test preparation course.

colnames(mydata3)[2] ="total studying days"

Let’s say I want to put the variables in such order, that the ID is at the start. (change the order of the columns).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata3 = mydata3 %>% select(ID, gender, `total studying days`,`reading score`,`writing score`,`corrected math score`)

Let’s say I want to show only the first 6 observations

head(mydata3)
## # A tibble: 6 × 6
##      ID gender `total studying days` `reading score` `writing score` corrected…¹
##   <int> <chr>                  <dbl>           <dbl>           <dbl>       <dbl>
## 1     1 female                     1              72              74          74
## 2     2 female                     3              90              88          71
## 3     3 female                    16              95              93          92
## 4     4 male                       0              57              44          49
## 5     5 male                       3              78              75          78
## 6     6 female                     6              83              78          73
## # … with abbreviated variable name ¹​`corrected math score`

Let’s say I wanna show the last 6 observations in my sample.

tail(mydata3)
## # A tibble: 6 × 6
##      ID gender `total studying days` `reading score` `writing score` corrected…¹
##   <int> <chr>                  <dbl>           <dbl>           <dbl>       <dbl>
## 1   106 female                     1              67              72          60
## 2   107 female                    25             100             100          89
## 3   108 male                       0              63              64          68
## 4   109 female                     1              76              70          54
## 5   110 female                     0              64              72          72
## 6   111 female                    10              89              98          79
## # … with abbreviated variable name ¹​`corrected math score`
mydata3$gender <-factor(mydata3$gender)

Let us say I want to create the data.frame arranged by total studying days in a descending order.

mydata5 <- mydata3[order(-mydata3$`total studying days`), ]
head(mydata5)
## # A tibble: 6 × 6
##      ID gender `total studying days` `reading score` `writing score` corrected…¹
##   <int> <fct>                  <dbl>           <dbl>           <dbl>       <dbl>
## 1   107 female                    25             100             100          89
## 2   105 male                      21              86              90         100
## 3    35 male                      20              87              82          99
## 4    30 female                    19              70              75          64
## 5    50 male                      18              84              82          84
## 6    16 female                    17              75              78          71
## # … with abbreviated variable name ¹​`corrected math score`

Let’s say I want to show which students got the positive grade at the reading exam. My assumption is that the positive grade was given to all the students who achieved at least 60 % or more. Therefore my new data.frame will consist of reading score values between 60 and 100.

mydata6 <-mydata3 [mydata3$`reading score` >= 60 & mydata3$`reading score`<= 100, ]
head(mydata6)
## # A tibble: 6 × 6
##      ID gender `total studying days` `reading score` `writing score` corrected…¹
##   <int> <fct>                  <dbl>           <dbl>           <dbl>       <dbl>
## 1     1 female                     1              72              74          74
## 2     2 female                     3              90              88          71
## 3     3 female                    16              95              93          92
## 4     5 male                       3              78              75          78
## 5     6 female                     6              83              78          73
## 6     7 female                    10              95              92          90
## # … with abbreviated variable name ¹​`corrected math score`

Let’s say I want to see what is the structure of my dataframe.

str(mydata3)
## tibble [111 × 6] (S3: tbl_df/tbl/data.frame)
##  $ ID                  : int [1:111] 1 2 3 4 5 6 7 8 9 10 ...
##  $ gender              : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ total studying days : num [1:111] 1 3 16 0 3 6 10 7 12 0 ...
##  $ reading score       : num [1:111] 72 90 95 57 78 83 95 43 64 60 ...
##  $ writing score       : num [1:111] 74 88 93 44 75 78 92 39 67 50 ...
##  $ corrected math score: num [1:111] 74 71 92 49 78 73 90 42 66 40 ...

There are now 111 observations and 6 variables in my dataframe, from which 4 are numerical and one is categorical, converted into factors. I also have an ID variable which I inserted for easier reading of a specific observation.

Present the descriptive statistics.

summary(mydata3)
##        ID           gender   total studying days reading score   
##  Min.   :  1.0   female:56   Min.   : 0.00       Min.   : 17.00  
##  1st Qu.: 28.5   male  :55   1st Qu.: 2.00       1st Qu.: 55.00  
##  Median : 56.0               Median : 5.00       Median : 67.00  
##  Mean   : 56.0               Mean   : 6.45       Mean   : 65.56  
##  3rd Qu.: 83.5               3rd Qu.: 9.00       3rd Qu.: 75.50  
##  Max.   :111.0               Max.   :25.00       Max.   :100.00  
##  writing score    corrected math score
##  Min.   : 10.00   Min.   :  2.00      
##  1st Qu.: 54.00   1st Qu.: 54.50      
##  Median : 67.00   Median : 65.00      
##  Mean   : 64.59   Mean   : 63.74      
##  3rd Qu.: 75.00   3rd Qu.: 73.50      
##  Max.   :100.00   Max.   :100.00

From the summary above, I can conclude that there is 56 female and 55 men in my sample. Since gender is a categorical variable, it will not be needed in the following analysis.

mydata4 <- mydata3[, -2]
library(psych)
describe(mydata4)
##                      vars   n  mean    sd median trimmed   mad min max range
## ID                      1 111 56.00 32.19     56   56.00 41.51   1 111   110
## total studying days     2 111  6.45  5.44      5    5.81  5.93   0  25    25
## reading score           3 111 65.56 16.07     67   66.17 16.31  17 100    83
## writing score           4 111 64.59 17.00     67   65.25 16.31  10 100    90
## corrected math score    5 111 63.74 16.07     65   64.29 14.83   2 100    98
##                       skew kurtosis   se
## ID                    0.00    -1.23 3.06
## total studying days   0.97     0.51 0.52
## reading score        -0.39    -0.13 1.53
## writing score        -0.44     0.06 1.61
## corrected math score -0.59     1.22 1.53
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
round(stat.desc(mydata4)
      , 2)
##                   ID total studying days reading score writing score
## nbr.val       111.00              111.00        111.00        111.00
## nbr.null        0.00               10.00          0.00          0.00
## nbr.na          0.00                0.00          0.00          0.00
## min             1.00                0.00         17.00         10.00
## max           111.00               25.00        100.00        100.00
## range         110.00               25.00         83.00         90.00
## sum          6216.00              716.00       7277.00       7169.00
## median         56.00                5.00         67.00         67.00
## mean           56.00                6.45         65.56         64.59
## SE.mean         3.06                0.52          1.53          1.61
## CI.mean.0.95    6.05                1.02          3.02          3.20
## var          1036.00               29.61        258.34        288.92
## std.dev        32.19                5.44         16.07         17.00
## coef.var        0.57                0.84          0.25          0.26
##              corrected math score
## nbr.val                    111.00
## nbr.null                     0.00
## nbr.na                       0.00
## min                          2.00
## max                        100.00
## range                       98.00
## sum                       7075.00
## median                      65.00
## mean                        63.74
## SE.mean                      1.53
## CI.mean.0.95                 3.02
## var                        258.30
## std.dev                     16.07
## coef.var                     0.25
  • MEAN of the reading scores is 65.56, meaning that on average, the students have scored 65.56 points for the reading exam.
  • MEDIAN of the writing score equals 67, which means that 50% of students have all together achieved less than 67 points and the other half have achieved more than 67 points.
  • The RANGE of the corrected math score is 98. This means that the difference between the maximum value (= 100) and minimum value (=2) is 98. The best performing student regarding the math test has achieved full points, however the worst performing student has achieved 2 points. 100-2=98.
  • Based on the coefficients of variation, total studying days has the highest dispersion around the mean (variability).

Graph the distribution of the variables.

boxplot(mydata4[ , c(-1, -2)])

From the boxplots above, I can conclude that the medians of the reading score variable as well as the writing score variable are the same (67), whereas the mean of the corrected math score is lower than the mean reading and writing scores. The highest overall obtained score was 100, the lowest obtained score overall lays in the outlier of the math test, with the value of approximately 2.

READING SCORE VARIABLE: The minimum value excluding outliers is approximately 25, the maximum value excluding outliers is 100.

WRITING SCORE: The minimum value excluding outliers is approximately 28, the maximum value excluding outliers is 100.

CORRECTED MATH SCORE: The minimum value excluding outliers is approximately 30, the maximum value excluding outliers is 100.

Overall, I can conclude that the dispursion of scores was not that high, meaning that all together, the test scores were similar. The students showed approximately the same amount of knowledge for all three tests and more than 50% of the students have passed ( score more than 60) each subject.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplot(y=mydata4$`total studying days`,
            x=mydata4$`corrected math score`,
            ylab = "Total studying days",
            xlab= "Math score",
            smooth=FALSE,
            regLine= TRUE)

With this scatter plot I wanted to show the relationship between two of the numeric variables, total studying days and corrected math score. Based on this graph, I can conclude that if the math score increases, studying days increased too. There is a positive correlation between these two variables, with some outliers present. Based on that, I can say that more days the students were studying, higher was their math score on the exam, excluding some exceptions.

scatterplotMatrix(mydata4[ , -1],
                  smooth = FALSE)

Using this graphical representation I can see the relationship between the four numerical variables in my data set. The pictures with a curve depict the distribution of variables (ex. total studying days is positively skewed- to the right). In addition, I can see how positively correlated are most score variables with the total studying days (total studying days increased, the score increased as well).

TASK 2

Show the descriptive statistics of body mass and its distribution with the histogram.

#mydata <- Body_mass
mydata <-read.table("~/Desktop/Task 2/Body mass.csv", header= TRUE, 
                    sep= ";", dec= ",")
head(mydata)
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2
str(mydata[-1])
## 'data.frame':    50 obs. of  1 variable:
##  $ Mass: num  62.1 64.5 56.5 53.4 61.3 62.2 62.7 64.5 59.5 68.9 ...

There is 50 units of observation in the data set. Mass is a numerical variable.

summary(mydata[-1])
##       Mass      
##  Min.   :49.70  
##  1st Qu.:60.23  
##  Median :62.80  
##  Mean   :62.88  
##  3rd Qu.:64.50  
##  Max.   :83.20

The mass data, summarized.

#install.packages("psych")
library(psych)
describe(mydata[-1])
##      vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## Mass    1 50 62.88 6.01   62.8   62.56 3.34 49.7 83.2  33.5 0.85     2.11 0.85

Descriptive statistics of mass. An example of the interpretation: MEAN: The mean of the mass variable is 62.88, meaning that on average, the mass of all the units of observation from the sample is 62.88.

hist(mydata$Mass, main= "Histogram of body mass distribution",
     xlab= "Mass",
     ylab="Frequency", 
     breaks= seq(from= 40, to= 90, by= 5))

The distribution of the mass with the histogram.

Test the H0 hypothesis.

t.test(mydata$Mass, mu= 59.5 , alternative = "two.sided" )
## 
##  One Sample t-test
## 
## data:  mydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

Explain the results and interpret the effect size.

As the p value (which is equal to 0.000234) is lower than alpha of 5%, we reject the null hypothesis at the p= 0.000234. The mean (=62,88) falls out of the interval for confirmation of H0.

r <- sqrt (3.9711^2/(3.9711^2 + 49))
print(r)
## [1] 0.4934295

The effect size is 0.4934295, which means that there is a significant change in the weight of the students in 2021/22 compared to the year 2018/19. H0 can therefore be rejected.

TASK 3

Import the dataset Apartments.xlsx

#install.packages("readxl")
library(readxl)
mydata1 <- read_xlsx("~/Desktop/Task 3/Apartments.xlsx")


head(mydata1)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

There are 85 observations and 5 variables in this data set. - Age refers to the age of an apartment, measured in years. It is a numerical variable. - Distance means the distance from the center.It is a numerical variable. - Price is measured per m2. It is a numerical variable. - Parking is a categorical variable, later factored, where 0-No, 1-Yes. - Balcony is a categorical variable, later factored, where 0-No, 1-Yes.

Change categorical variables into factors

mydata1$ParkingFactor <- factor(mydata1$Parking,
                               levels = c(1,0),
                               labels = c("yes", "no"))

mydata1$BalconyFactor <- factor(mydata1$Balcony,
                               levels = c(1,0),
                               labels = c("yes", "no"))

str(mydata1)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Age          : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance     : num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price        : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking      : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony      : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingFactor: Factor w/ 2 levels "yes","no": 2 1 2 2 1 2 2 1 1 1 ...
##  $ BalconyFactor: Factor w/ 2 levels "yes","no": 1 2 2 1 1 1 1 1 2 2 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata1$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata1$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the sample data we can conclude that we can reject H0 with 5% level of significance, with p-value= 0.004731. (p is smaller than alpha) H0 = 1900.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = mydata1)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
sqrt(0.05302)
## [1] 0.2302607

Estimate of regression coefficient (8.975) means that if the apartment age increases for 1 year, then the price per m2 decreases by 8.975 EUR on average, ceteris paribus.

Coefficient of correlation is 0.23, meaning that there is not a strong linear relationship between apartment age and price per m2.

Coefficient of determination (0.05302) means that 5.302 of variability of price per m2 is explained by the linear effect of the apartment age variable.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(mydata1[ ,c(1,2,3)],
                  smooth = TRUE)

Based on the Scatter plot Matrix, there is no potential problem with multicolinearity detected.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata1)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

VIF average should be as close to one as possible (which it is in our case= 1.001845), all VIFs should be less than 5. Our data applies to both of these conditions so we should leave it in the dataset. These variables are therefore not strongly, but just moderately correlated.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

mydata1$stdresid <- round(rstandard(fit2), 2) 
mydata1$cooksD <- round(cooks.distance(fit2), 2)


hist(mydata1$stdresid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

hist(mydata1$cooksD, 
     xlab = "Cooks distance",
     ylab =  "Frequency",
     main = "Histogram of Cooks distance")

head(mydata1[order(-mydata1$stdresid), ])
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor stdresid cooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     5       45  2180       1       1 yes           yes               2.58   0.32
## 2     2       11  2790       1       0 yes           no                2.05   0.07
## 3    18        1  2800       1       0 yes           no                1.78   0.03
## 4    18        1  2800       1       1 yes           yes               1.78   0.03
## 5     8        2  2820       1       0 yes           no                1.66   0.04
## 6    10        1  2810       0       0 no            no                1.6    0.03
head(mydata1 [order(-mydata1$cooksD) , ])
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor stdresid cooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     5       45  2180       1       1 yes           yes               2.58   0.32
## 2    43       37  1740       0       0 no            no                1.44   0.1 
## 3     2       11  2790       1       0 yes           no                2.05   0.07
## 4     7        2  1760       0       1 no            yes              -2.15   0.07
## 5    37        3  2540       1       1 yes           yes               1.58   0.06
## 6    40        2  2400       0       1 no            yes               1.09   0.04

Since the cooks distance is significantly higher at the first above observation, I will get rid of this outlier.

mydata7 <- mydata1[-38, ]
head(mydata7 [order(-mydata7$cooksD) , ])
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor stdresid cooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1    43       37  1740       0       0 no            no                1.44   0.1 
## 2     2       11  2790       1       0 yes           no                2.05   0.07
## 3     7        2  1760       0       1 no            yes              -2.15   0.07
## 4    37        3  2540       1       1 yes           yes               1.58   0.06
## 5    40        2  2400       0       1 no            yes               1.09   0.04
## 6     8        2  2820       1       0 yes           no                1.66   0.04
hist(mydata7$cooksD, 
     xlab = "Cooks distance",
     ylab =  "Frequency",
     main = "Histogram of Cooks distance")

I removed a variable where the cooks distance was 0.32, since it was very different from the rest of them and was an outlier to this case.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydata1$stdfittedvalues <- scale(fit2$fitted.values)
scatterplot(y=mydata1$stdresid, x = mydata1$stdfittedvalues,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted values",
            boxplots = FALSE, 
            regLine = FALSE,
            smooth = FALSE)

The heteroskedasticity is not detected, therefore there is no valuation of the homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata1$stdresid,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata1$stdresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$stdresid
## W = 0.95307, p-value = 0.003667

The distribution seen from the graph has two peaks. Therefore it is not a normal distribution (but bimodal).

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata1)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11
sqrt(summary(fit2)$r.squared)
## [1] 0.6629887

If the age of an apartment increased by 1 year, the price per m2 decreased by 7.934 EUR on average, ceteris paribus. When the distance between the apartment and the center increases by 1 kilometer,price per m2 decreases by 20 667 EUR on average, ceteris paribus.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata1)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2,fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because p is less than 5%, it means that fit3 explains more and is therefore better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2499.770     80.645  30.997  < 2e-16 ***
## Age               -6.799      3.110  -2.186  0.03172 *  
## Distance         -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFactorno -196.168     62.868  -3.120  0.00251 ** 
## BalconyFactorno   -1.935     60.014  -0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11
  • Parking factor, given the other 3 variables (Age, Distance and Balcony Factor), we can conclude that those apartments with a parking place have a higher price for 196 EURon average, when p is 0.00251.
  • Balcony factor, we cannot say whether having a Balcony affects the price or not at our p level, which is 0.97436.

Save fitted values and calculate the residual for apartment ID2.

mydata1$fittedvalues <- fitted.values(fit3)
mydata1$residuals <- residuals(fit3)
mydata1[2,9]
## # A tibble: 1 × 1
##   cooksD
##    <dbl>
## 1   0.03