Exam

#Task 1

I have found the data on Kaggle website in datasets.

mydata1 <- read.csv('~/Desktop/bootcamp R/Exam.csv',
                    header = TRUE,
                    sep = ';',
                    dec = ',')

print(mydata1)

##     gender age Residence_type avg_glucose_level bmi
## 1        0  67          Urban               229  37
## 2        1  80          Rural               106  33
## 3        1  49          Urban               171  34
## 4        1  79          Rural               174  24
## 5        0  81          Urban               186  29
## 6        0  74          Rural                70  27
## 7        0  56          Rural                73  31
## 8        1  88          Rural                88  16
## 9        0  25          Urban               138  27
## 10       1  14          Rural                72  21
## 11       1  53          Rural                69  36
## 12       0   4          Rural               106  17
## 13       1  39          Urban               102  42
## 14       0  53          Rural               110  42
## 15       0   5          Rural                86  16
## 16       0   4          Rural               110  17
## 17       1  37          Urban                92  29
## 18       0  70          Urban               252  27
## 19       0  24          Urban               123  38
## 20       0  55          Rural                97  45
## 21       1  70          Rural                68  23
## 22       0  34          Urban                81  33
## 23       0  26          Rural               120  22
## 24       1  39          Rural                84  26
## 25       0  38          Urban                74  40
## 26       1  20          Urban                92  28
## 27       1  17          Urban                88  39
## 28       0  60          Rural               213  36
## 29       1  75          Rural                75  38
## 30       1  31          Rural               122  40
## 31       0  58          Rural               223  42
## 32       1   5          Rural                85  18
## 33       1  13          Urban                70  21
## 34       1  22          Rural               108  42
## 35       0  71          Urban                95  32
## 36       1  31          Urban               125  24
## 37       0  41          Urban               112  39
## 38       0  65          Rural                95  29
## 39       1  20          Urban               113  28
## 40       1  51          Urban               105  44
## 41       1  37          Urban                84  27
## 42       0  49          Rural               103  29
## 43       1  62          Rural                77  35
## 44       1  28          Rural                87  40
## 45       0  80          Rural                57  27
## 46       0  45          Rural                73  25
## 47       1  37          Urban               106  30
## 48       1   3          Rural                81  16
## 49       0  26          Urban               104  31
## 50       0  58          Urban               194  28
## 51       1  59          Rural               112  36
## 52       0  74          Rural                58  32
## 53       0  17          Rural                69  33
## 54       1  62          Urban                60  28
## 55       0  37          Urban               160  32
## 56       1  59          Rural               237  28
## 57       1  59          Rural                99  23
## 58       0  62          Urban                56  24
## 59       0  80          Urban                84  30
## 60       0  55          Urban               168  24
## 61       1  54          Urban                76  29
## 62       0  73          Rural                82  32
## 63       0  26          Urban               200  32
## 64       0  73          Rural                62  25
## 65       1  38          Rural                84  26
## 66       0  75          Rural               221  33
## 67       0   3          Rural                63  20
## 68       0  60          Rural                71  27
## 69       1  45          Urban                93  44
## 70       0  23          Rural                82  32
## 71       0  21          Rural                88  37
## 72       1  67          Rural                80  23
## 73       0  10          Rural                84  19
## 74       1  24          Rural                70  30
## 75       0  50          Urban               192  43
## 76       0  14          Urban               102  20
## 77       1  13          Rural                78  21
## 78       0  71          Urban               216  39
## 79       0  32          Rural                92  30
## 80       1  38          Rural                94  22
## 81       1  11          Urban                77  19
## 82       0  63          Urban                84  21
## 83       1  18          Urban                96  23
## 84       1  46          Urban               181  23
## 85       1   8          Urban                95  20
## 86       1  53          Rural                61  29
## 87       0  38          Urban               103  22
## 88       0  74          Urban               130  26
## 89       0  24          Urban                65  23
## 90       0  78          Rural                80  26
## 91       1  60          Rural                63  31
## 92       0  12          Urban                70  25
## 93       1  32          Rural               109  24
## 94       0   5          Urban               112  20
## 95       1  40          Rural                92  38
## 96       1  19          Urban                77  27
## 97       1  28          Urban                70  25
## 98       1  44          Rural                70  41
## 99       0  50          Urban                89  35
## 100      1  45          Rural               111  26

Description of the selected variables: - Gender: gender of the individual (female/male) - Age: age of the individual - Residence_type: place where individual lives (urban area/rural area) - Avg_glucose_level: the average glucose level in the blood between 60 and 90 days (units: mg/dL) - BMI: body mass index of individual (from 0 to 50 and above, units:kg/m2)

Here, we see the change in the names.

colnames(mydata1) <- c('Gender','Age','Residence_Type','Avg_Glucose_Level','BMI')

head(mydata1)

##   Gender Age Residence_Type Avg_Glucose_Level BMI
## 1      0  67          Urban               229  37
## 2      1  80          Rural               106  33
## 3      1  49          Urban               171  34
## 4      1  79          Rural               174  24
## 5      0  81          Urban               186  29
## 6      0  74          Rural                70  27

Here we can see that Residence_Type was deleted.

mydata1a <- mydata1[,-3]

print(mydata1a)

##     Gender Age Avg_Glucose_Level BMI
## 1        0  67               229  37
## 2        1  80               106  33
## 3        1  49               171  34
## 4        1  79               174  24
## 5        0  81               186  29
## 6        0  74                70  27
## 7        0  56                73  31
## 8        1  88                88  16
## 9        0  25               138  27
## 10       1  14                72  21
## 11       1  53                69  36
## 12       0   4               106  17
## 13       1  39               102  42
## 14       0  53               110  42
## 15       0   5                86  16
## 16       0   4               110  17
## 17       1  37                92  29
## 18       0  70               252  27
## 19       0  24               123  38
## 20       0  55                97  45
## 21       1  70                68  23
## 22       0  34                81  33
## 23       0  26               120  22
## 24       1  39                84  26
## 25       0  38                74  40
## 26       1  20                92  28
## 27       1  17                88  39
## 28       0  60               213  36
## 29       1  75                75  38
## 30       1  31               122  40
## 31       0  58               223  42
## 32       1   5                85  18
## 33       1  13                70  21
## 34       1  22               108  42
## 35       0  71                95  32
## 36       1  31               125  24
## 37       0  41               112  39
## 38       0  65                95  29
## 39       1  20               113  28
## 40       1  51               105  44
## 41       1  37                84  27
## 42       0  49               103  29
## 43       1  62                77  35
## 44       1  28                87  40
## 45       0  80                57  27
## 46       0  45                73  25
## 47       1  37               106  30
## 48       1   3                81  16
## 49       0  26               104  31
## 50       0  58               194  28
## 51       1  59               112  36
## 52       0  74                58  32
## 53       0  17                69  33
## 54       1  62                60  28
## 55       0  37               160  32
## 56       1  59               237  28
## 57       1  59                99  23
## 58       0  62                56  24
## 59       0  80                84  30
## 60       0  55               168  24
## 61       1  54                76  29
## 62       0  73                82  32
## 63       0  26               200  32
## 64       0  73                62  25
## 65       1  38                84  26
## 66       0  75               221  33
## 67       0   3                63  20
## 68       0  60                71  27
## 69       1  45                93  44
## 70       0  23                82  32
## 71       0  21                88  37
## 72       1  67                80  23
## 73       0  10                84  19
## 74       1  24                70  30
## 75       0  50               192  43
## 76       0  14               102  20
## 77       1  13                78  21
## 78       0  71               216  39
## 79       0  32                92  30
## 80       1  38                94  22
## 81       1  11                77  19
## 82       0  63                84  21
## 83       1  18                96  23
## 84       1  46               181  23
## 85       1   8                95  20
## 86       1  53                61  29
## 87       0  38               103  22
## 88       0  74               130  26
## 89       0  24                65  23
## 90       0  78                80  26
## 91       1  60                63  31
## 92       0  12                70  25
## 93       1  32               109  24
## 94       0   5               112  20
## 95       1  40                92  38
## 96       1  19                77  27
## 97       1  28                70  25
## 98       1  44                70  41
## 99       0  50                89  35
## 100      1  45               111  26

Here we can see that the new variable was introduced (GenderF).

mydata1a$GenderF <- factor(mydata1a$Gender,
                           levels = c(0,1),
                           labels = c('M','F'))
head(mydata1a)

##   Gender Age Avg_Glucose_Level BMI GenderF
## 1      0  67               229  37       M
## 2      1  80               106  33       F
## 3      1  49               171  34       F
## 4      1  79               174  24       F
## 5      0  81               186  29       M
## 6      0  74                70  27       M

Below we can see that I introduced the condition (BMI >= 25) to see how many individuals in the sample have higher BMI. Here are displayed 6 rows, however, there are 70 individuals that have higher BMI.

mydata1aa <- mydata1 [(mydata1$BMI >= 25), ]

head(mydata1aa[order(mydata1aa$BMI), ])

##    Gender Age Residence_Type Avg_Glucose_Level BMI
## 46      0  45          Rural                73  25
## 64      0  73          Rural                62  25
## 92      0  12          Urban                70  25
## 97      1  28          Urban                70  25
## 24      1  39          Rural                84  26
## 65      1  38          Rural                84  26

Here the summary of all three variables was created.

summary(mydata1[ ,c(-1,-3)])

##       Age        Avg_Glucose_Level      BMI       
##  Min.   : 3.00   Min.   : 56.00    Min.   :16.00  
##  1st Qu.:24.00   1st Qu.: 76.75    1st Qu.:23.75  
##  Median :42.50   Median : 92.00    Median :28.00  
##  Mean   :42.91   Mean   :106.36    Mean   :29.18  
##  3rd Qu.:60.50   3rd Qu.:112.00    3rd Qu.:34.25  
##  Max.   :88.00   Max.   :252.00    Max.   :45.00

Age: We can see that the youngest individual in the sample is 3 years old and the oldest 88 years old. If the age was the same for each individual in the sample, the average would be 42.91 years. Half of the individuals have up to or equal to 42.50 years and the other half is above 42.50 years old.

Avg_Glucose_Level: We can see that the minimum average glucose level that an individual in the sample has is 56 and the maximum average glucose level is 252. If the average glucose level was the same for each individual in the sample, the average would be 106.36 mg/dL. Half of the individuals have up to or equal to 92 mg/dL average glucose level and other half has above 92 average glucose level.

BMI: We can see that the minimum BMI in the sample is 16 and the maximum is 45. If the BMI was the same for each individual in the sample, the average would be 29.18 BMI. Half of the individuals in the sample have up to or equal to 28 BMI and the other half has above 28 BMI.

library(pastecs)
round(stat.desc(mydata1[ ,c(-1,-3)]),2)

##                  Age Avg_Glucose_Level     BMI
## nbr.val       100.00            100.00  100.00
## nbr.null        0.00              0.00    0.00
## nbr.na          0.00              0.00    0.00
## min             3.00             56.00   16.00
## max            88.00            252.00   45.00
## range          85.00            196.00   29.00
## sum          4291.00          10636.00 2918.00
## median         42.50             92.00   28.00
## mean           42.91            106.36   29.18
## SE.mean         2.32              4.61    0.74
## CI.mean.0.95    4.60              9.16    1.47
## var           536.43           2129.61   55.12
## std.dev        23.16             46.15    7.42
## coef.var        0.54              0.43    0.25

Here is the grafical representation of the data for Average Glucose Level.

hist(mydata1a$Avg_Glucose_Level,
     main = 'Histogram of frequency distribution of average glucose level',
     xlab = 'Average glucose level',
     ylab = 'Frequency',
     breaks = seq(55,255,10),
     right = FALSE)

From the histogram we can see that the distribution is asymmetrical and therefore skewed to the right (positive skew). We can conclude from the histogram that the majority of individuals in the sample have around normal average glucose level in the blood (72 to 99 mg/dL).

#Task 2

mydata2 <- read.table('~/Desktop/bootcamp R/R Take Home Exam/Task 2/Body mass.csv',
                      header = TRUE,
                      sep = ';',
                      dec = ',')
print(mydata2)

##    ID Mass
## 1   1 62.1
## 2   2 64.5
## 3   3 56.5
## 4   4 53.4
## 5   5 61.3
## 6   6 62.2
## 7   7 62.7
## 8   8 64.5
## 9   9 59.5
## 10 10 68.9
## 11 11 64.5
## 12 12 62.8
## 13 13 63.7
## 14 14 52.3
## 15 15 64.1
## 16 16 64.2
## 17 17 64.2
## 18 18 60.2
## 19 19 59.6
## 20 20 60.9
## 21 21 54.2
## 22 22 72.1
## 23 23 60.3
## 24 24 63.1
## 25 25 59.8
## 26 26 49.7
## 27 27 58.6
## 28 28 58.7
## 29 29 62.0
## 30 30 65.7
## 31 31 64.2
## 32 32 62.7
## 33 33 78.5
## 34 34 61.6
## 35 35 60.4
## 36 36 63.3
## 37 37 69.3
## 38 38 63.0
## 39 39 60.8
## 40 40 64.9
## 41 41 62.8
## 42 42 57.3
## 43 43 69.3
## 44 44 62.8
## 45 45 62.9
## 46 46 53.6
## 47 47 65.2
## 48 48 75.3
## 49 49 83.2
## 50 50 66.4

Description of given variables: - ID: the number of each ninth grader - Mass: the body mass of individual (units:kg)

summary(mydata2[ ,c(-1)])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.70   60.23   62.80   62.88   64.50   83.20

Here are displayed just summary variables for Mass because ID is a categorical number. If the body mass was the same for each individual in the sample, the average would be 62.88 kg. We can see that the minimum body mass in the sample is 49.70 kg and the maximum body sample is 83.20 kg. Half of the individuals in the sample have less than or equal to 62.80 kg and the other half has more than 62.80 kg.

library(psych)
describe(mydata2)

##      vars  n  mean    sd median trimmed   mad  min  max range skew kurtosis
## ID      1 50 25.50 14.58   25.5   25.50 18.53  1.0 50.0  49.0 0.00    -1.27
## Mass    2 50 62.88  6.01   62.8   62.56  3.34 49.7 83.2  33.5 0.85     2.11
##        se
## ID   2.06
## Mass 0.85

The ID variable can be excluded from the table because it is a categorical variable (see below). Mass variable is then changed into X1.

describe(mydata2[ ,c(-1)])

##    vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 50 62.88 6.01   62.8   62.56 3.34 49.7 83.2  33.5 0.85     2.11 0.85

We can see from the data above that we have 50 variables with the standard deviation of 6.01. It means that the weight of ninth grader deviates from the mean by 6 kg. It has a relatively high dispersion of the data values and therefore the sample is heterogenic.

Here we have graphical representation of body mass of ninth graders in histogram.

hist(mydata2$Mass,
     main = 'Histogram of body mass of 50 ninth graders',
     xlab = 'Body mass',
     ylab = 'Frequency',
     breaks = seq(45,85,5))

Histogram represents the sample, which is a little bit asymmetric to the right that is also why is positive (kurtosis = 2.11).

Here we have the computed t.test for the body mass variable for the sample.

t.test(mydata2$Mass,
       mu = 59.5,
       alternative = 'two.sided')

## 
##  One Sample t-test
## 
## data:  mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

We have used the two sided t-test because we have one arithmetic mean and also standard deviation is unknown. We can see that degrees of freedom are 49 because the formula is that we should subtract one from the sample size. We have p-value= 0.000234 and it is less than 5%, which brings us to the conclusion that we reject Ho and accept the alternative/research hypothesis. We can be 95% certain that true population mean lies between 61.16758 and 64.58442. We can see that the mean of the selected sample is 62.876, which is the same as in the calculations above.

For the effect size, we needed to look for correlation coefficient and the formula is -> r= squareroot of t2/(t2+df) and the result is 0.493429485, so this represents medium effect for the sample data.

#Task 3

knitr::opts_chunk$set(echo = TRUE)
options(width=120)
#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

#install.packages("ggplot2")
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

#install.packages("reshape2")
library(reshape2)

mydata3 <- read.csv('~/Desktop/bootcamp R/R Take Home Exam/Task 3/Apartments.csv',
                    header = TRUE,
                    sep = ';',
                    dec = ',')

print(mydata3)

##    Age Distance Price Parking Balcony
## 1    7       28  1640       0       1
## 2   18        1  2800       1       0
## 3    7       28  1660       0       0
## 4   28       29  1850       0       1
## 5   18       18  1640       1       1
## 6   28       12  1770       0       1
## 7   14       20  1850       0       1
## 8   18        6  1970       1       1
## 9   22        7  2270       1       0
## 10  25        2  2570       1       0
## 11  12        5  2700       1       0
## 12  18       26  1680       1       0
## 13  12       14  1650       0       1
## 14  10       12  2000       0       1
## 15  24        3  2290       0       0
## 16  11       14  1800       0       0
## 17  26        3  2010       1       1
## 18   7       11  2340       1       1
## 19  19       33  1400       0       0
## 20  13        8  1800       0       0
## 21   6       12  2060       1       1
## 22  37        3  2540       1       1
## 23  22       23  1730       0       1
## 24  18        4  2110       1       1
## 25   8       26  2300       1       1
## 26  12        3  2480       1       0
## 27  16        1  2750       1       0
## 28  18       19  1620       1       0
## 29  13        2  2170       1       0
## 30   1        5  2420       1       1
## 31  45       21  1910       0       1
## 32  12        5  2390       1       0
## 33   2       11  2790       1       0
## 34   9        3  2140       0       1
## 35  14       16  1660       0       1
## 36  24        5  1830       1       0
## 37  18        4  1950       1       0
## 38   5       45  2180       1       1
## 39  40        2  2400       0       1
## 40  33       37  1510       1       1
## 41  18       12  1770       1       0
## 42  24        1  2220       1       1
## 43   1        3  2460       1       1
## 44  28        1  2250       0       1
## 45  18       40  1400       0       1
## 46  14       25  2250       1       0
## 47  32       17  1840       1       0
## 48  38       13  1610       0       1
## 49   9       15  2290       0       1
## 50  14        8  1990       1       0
## 51  18       39  1620       0       0
## 52  22       16  1710       0       1
## 53   7        2  1760       0       1
## 54  30       17  1560       0       0
## 55  43       37  1740       0       0
## 56  17       18  1860       0       0
## 57  10        1  2810       0       0
## 58   8        2  2820       1       0
## 59  18       16  1830       0       0
## 60  30        9  2130       0       0
## 61  18        1  2800       1       1
## 62   7       28  1660       0       0
## 63  28       29  1850       0       0
## 64  18       18  1640       1       1
## 65  28       12  1770       0       0
## 66  14       20  1850       0       0
## 67  18        6  1970       1       1
## 68  22        7  2270       1       0
## 69  25        2  2570       1       0
## 70  12        5  2700       1       1
## 71  18       26  1680       1       0
## 72  12       14  1650       0       0
## 73  10       12  2000       0       0
## 74  24        3  2290       0       0
## 75  11       14  1800       0       0
## 76  26        3  2010       1       0
## 77  18       40  1400       0       0
## 78  14       25  2250       1       1
## 79  32       17  1840       1       0
## 80  38       13  1610       0       0
## 81   9       15  2290       0       1
## 82  14        8  1990       1       0
## 83  18       39  1620       0       0
## 84  22       16  1710       0       1
## 85  32       17  1840       1       0

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Categorical variables are put into factors - first for Parking and then also for Balcony.

mydata3$ParkingF <- factor(mydata3$Parking,
                           levels = c(0,1),
                           labels = c('No','Yes'))
head(mydata3)

##   Age Distance Price Parking Balcony ParkingF
## 1   7       28  1640       0       1       No
## 2  18        1  2800       1       0      Yes
## 3   7       28  1660       0       0       No
## 4  28       29  1850       0       1       No
## 5  18       18  1640       1       1      Yes
## 6  28       12  1770       0       1       No

mydata3$BalconyF <- factor(mydata3$Balcony,
                           levels = c(0,1),
                           labels = c('No','Yes'))
head(mydata3)

##   Age Distance Price Parking Balcony ParkingF BalconyF
## 1   7       28  1640       0       1       No      Yes
## 2  18        1  2800       1       0      Yes       No
## 3   7       28  1660       0       0       No       No
## 4  28       29  1850       0       1       No      Yes
## 5  18       18  1640       1       1      Yes      Yes
## 6  28       12  1770       0       1       No      Yes

Testing the hypothesis H0: Mu_Price = 1900 eur.

t.test(mydata3$Price,
       mu = 1900,
       alternative = 'two.sided')

## 
##  One Sample t-test
## 
## data:  mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We have used the two-sided t-test. We can see that degrees of freedom are 84 because the formula is that we should subtract one from the sample size. We have p-value= 0.004731 and it is less than 5%, which brings us to the conclusion that we reject Ho and accept the alternative/research hypothesis. We can be 95% certain that true population mean lies between 1937.443 and 2100.440.

Here is presented the regression line of Price = f(Age).

fit1 <- lm(Price ~ Age,
           data = mydata3)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

From this summary can we see that the regression function can be described as Price= 2185.455 - 8.975Age -> so if price increases by 1 % point,then age on average decreases by 8.975 % point.For the t-test the null hypothesis is B1 = 0 and alternative hypothesis is that it is not equal. Based on the p-value, which is 0.03401, we can reject the null hypothesis and accept the alternative because the p-value is less than 5%. Furthermore, we also have the coefficient of determination, which is 0.05302 and it means that 5.302 % of variability of Price is explained by linear effect of Age.

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata3[,c(-2,-4,-5,-6,-7)]))

##         Age Price
## Age    1.00 -0.23
## Price -0.23  1.00
## 
## n= 85 
## 
## 
## P
##       Age   Price
## Age         0.034
## Price 0.034

cor(mydata3$Age,mydata3$Price)

## [1] -0.230255

From the correlation coefficient we can see that the relationship between Age and Price is negative and weak.

Here we can see that the order of columns is changed so the order is first Price, then Age and lastly, Distance. Then we can see the graphical representation of scatterplot matrix between these three variables.

mydata3 <- mydata3[c(3,1,2,4,5,6,7)]

head(mydata3)

##   Price Age Distance Parking Balcony ParkingF BalconyF
## 1  1640   7       28       0       1       No      Yes
## 2  2800  18        1       1       0      Yes       No
## 3  1660   7       28       0       0       No       No
## 4  1850  28       29       0       1       No      Yes
## 5  1640  18       18       1       1      Yes      Yes
## 6  1770  28       12       0       1       No      Yes

library(car)
scatterplotMatrix(mydata3[ ,c(-4,-5,-6,-7)],
                  smooth = FALSE)

From this scatterplot matrix we can see that the graph for Price, Age and Distance are asymmetrical to the right side and the relationship between Price and Age and then Price and Distance is negative. This is logical because the older the apartment is the price is falling and also if the distance is larger from city center than price is falling.

library(Hmisc)
rcorr(as.matrix(mydata3[ ,c(-4,-5,-6,-7)]))

##          Price   Age Distance
## Price     1.00 -0.23    -0.63
## Age      -0.23  1.00     0.04
## Distance -0.63  0.04     1.00
## 
## n= 85 
## 
## 
## P
##          Price  Age    Distance
## Price           0.0340 0.0000  
## Age      0.0340        0.6966  
## Distance 0.0000 0.6966

From the correlation coefficient we can see that the relationship between Price and Age is negative and it is weak. The relationship between Price and Distance is negative and it is moderate.

Estimate the multiple regression function: Price = f(Age, Distance).

fit2 <- lm(Price ~ Age + Distance,
           data = mydata3)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

From this summary can we see that the multiple regression function can be described as Price= 2460.101 -7.934Age -20.667Distance -> so if price increases by 1 % point,then Age on average decreases by 7.934 % point assuming that everything is unchanged and then if Price is increased by 1 % point, Distance would decrease by 20.667 % point assuming that everything is unchanged. For the t-test, the null hypothesis is B1 = 0 and alternative hypothesis is that it is not equal. Based on the p-value, which is less than less than 5%, we can reject the null hypothesis and accept the alternative. Furthermore, we also have the coefficient of determination, which is 0.4396 and it means that 43.96 % of variability of Price is explained by effect of Age and Distance.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

From the VIF (1.002), we can see that the variables are not correlated because we will round it to 1.

Here we can see calculated standardized residuals and Cooks Distances for model fit2.

mydata3$StdResid <- round(rstandard(fit2),3)

shapiro.test(mydata3$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.95303, p-value = 0.003645

H0: Variable is normally distributed, H1: Variable is NOT normally distributed and we can see that the p-value is lower than 5% so we can reject the null hypothesis and accept the alternative one, which says that variable is not normally distributed, which can be also seen from the histogram below.

hist(mydata3$StdResid,
     xlab = 'Standardized residuals',
     ylab = 'Frequency',
     main = 'Histogram of standardized residuals')

mydata3$CooksD <- round(cooks.distance(fit2),3)

hist(mydata3$CooksD,
     xlab = 'Cooks distance',
     ylab = 'Frequency',
     main = 'Histogram of Cooks distances')

Histogram of Cooks distances is as seen asymmetric to the right.The majority of units have low Cooks distances and low impact on the function.

head(mydata3[order(-mydata3$StdResid),],3)

##    Price Age Distance Parking Balcony ParkingF BalconyF StdResid CooksD
## 38  2180   5       45       1       1      Yes      Yes    2.577  0.320
## 33  2790   2       11       1       0      Yes       No    2.051  0.069
## 2   2800  18        1       1       0      Yes       No    1.783  0.030

head(mydata3[order(-mydata3$CooksD),],6)

##    Price Age Distance Parking Balcony ParkingF BalconyF StdResid CooksD
## 38  2180   5       45       1       1      Yes      Yes    2.577  0.320
## 55  1740  43       37       0       0       No       No    1.445  0.104
## 33  2790   2       11       1       0      Yes       No    2.051  0.069
## 53  1760   7        2       0       1       No      Yes   -2.152  0.066
## 22  2540  37        3       1       1      Yes      Yes    1.576  0.061
## 39  2400  40        2       0       1       No      Yes    1.091  0.038

Outlier (38.row) will be removed when the fit2 will be estimated again.

mydata3$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFittedValues,
            ylab = 'Standardized residuals',
            xlab = 'Standardized fitted values',
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

We can see that linearity is met. Below there is an explanation of heteroskedasticity.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

H0: Variance is constant (homoskedasticity). The p-value is 0.3631 and it is much more than 5%, so we cannot reject the null hypothesis. So here we have variance constant and homoskedasticity present (no heteroskedasticity is not present).

mydata3 <- mydata3[-38, ]

fit2 <- lm(Price ~ Age + Distance,
           data = mydata3)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

If Price is increased by 1 % point, the Age decreases on average by 6.464 % point, c.p. If Price is increase by 1 % point, the Price decreases on average by 22.955 % point, assuming that all other variables remain unchanged.

sqrt(summary(fit2)$r.squared)

## [1] 0.6955609

Multiple coefficient of correlation is 0.6955609, which means that linear relationship between Price and other two explanatory variables is strong.

library(lm.beta)
lm.beta(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Standardized Coefficients::
## (Intercept)         Age    Distance 
##          NA  -0.1640478  -0.6607329

The largest effect has the distance because the absolute number is the highest.

Here is the regression function fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
           data = mydata3)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

From the regression coefficients, we can see if Price is increased by 1 % point, the Parking increases by 167.531 % point, assuming that all other variables remain unchanged. Furthermore, if Price is increased by 1 % point, the Balcony is increased by 15.207 % point, assuming that all other variables remain unchanged. R2 is 0.5275 -> 53% of variability of Price is explained by linear effect of Age, Distance, ParkingF and BalconyF.

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: fit2 is more appropriate H1: fit3 is more appropriate We can reject the H0 because p-value (0.03051) is less than 5%, so we can say that fit3 is more appropriate than fit2.

mydata3$FittedValues <- fitted.values(fit3)

mydata3$Residuals <- residuals(fit3)

head(mydata3[,colnames(mydata3) %in% c('Price', 'FittedValues', 'Residuals')])

##   Price FittedValues  Residuals
## 1  1640     1705.952  -65.95206
## 2  2800     2372.197  427.80292
## 3  1660     1721.159  -61.15894
## 4  1850     1563.431  286.56890
## 5  1640     2012.244 -372.24396
## 6  1770     1908.177 -138.17733

The residual for the ID2 apartment is 427.80292.

Exam

Nina Jankovič

2022-09-25