#Task 1
I have found the data on Kaggle website in datasets.
mydata1 <- read.csv('~/Desktop/bootcamp R/Exam.csv',
header = TRUE,
sep = ';',
dec = ',')
print(mydata1)
## gender age Residence_type avg_glucose_level bmi
## 1 0 67 Urban 229 37
## 2 1 80 Rural 106 33
## 3 1 49 Urban 171 34
## 4 1 79 Rural 174 24
## 5 0 81 Urban 186 29
## 6 0 74 Rural 70 27
## 7 0 56 Rural 73 31
## 8 1 88 Rural 88 16
## 9 0 25 Urban 138 27
## 10 1 14 Rural 72 21
## 11 1 53 Rural 69 36
## 12 0 4 Rural 106 17
## 13 1 39 Urban 102 42
## 14 0 53 Rural 110 42
## 15 0 5 Rural 86 16
## 16 0 4 Rural 110 17
## 17 1 37 Urban 92 29
## 18 0 70 Urban 252 27
## 19 0 24 Urban 123 38
## 20 0 55 Rural 97 45
## 21 1 70 Rural 68 23
## 22 0 34 Urban 81 33
## 23 0 26 Rural 120 22
## 24 1 39 Rural 84 26
## 25 0 38 Urban 74 40
## 26 1 20 Urban 92 28
## 27 1 17 Urban 88 39
## 28 0 60 Rural 213 36
## 29 1 75 Rural 75 38
## 30 1 31 Rural 122 40
## 31 0 58 Rural 223 42
## 32 1 5 Rural 85 18
## 33 1 13 Urban 70 21
## 34 1 22 Rural 108 42
## 35 0 71 Urban 95 32
## 36 1 31 Urban 125 24
## 37 0 41 Urban 112 39
## 38 0 65 Rural 95 29
## 39 1 20 Urban 113 28
## 40 1 51 Urban 105 44
## 41 1 37 Urban 84 27
## 42 0 49 Rural 103 29
## 43 1 62 Rural 77 35
## 44 1 28 Rural 87 40
## 45 0 80 Rural 57 27
## 46 0 45 Rural 73 25
## 47 1 37 Urban 106 30
## 48 1 3 Rural 81 16
## 49 0 26 Urban 104 31
## 50 0 58 Urban 194 28
## 51 1 59 Rural 112 36
## 52 0 74 Rural 58 32
## 53 0 17 Rural 69 33
## 54 1 62 Urban 60 28
## 55 0 37 Urban 160 32
## 56 1 59 Rural 237 28
## 57 1 59 Rural 99 23
## 58 0 62 Urban 56 24
## 59 0 80 Urban 84 30
## 60 0 55 Urban 168 24
## 61 1 54 Urban 76 29
## 62 0 73 Rural 82 32
## 63 0 26 Urban 200 32
## 64 0 73 Rural 62 25
## 65 1 38 Rural 84 26
## 66 0 75 Rural 221 33
## 67 0 3 Rural 63 20
## 68 0 60 Rural 71 27
## 69 1 45 Urban 93 44
## 70 0 23 Rural 82 32
## 71 0 21 Rural 88 37
## 72 1 67 Rural 80 23
## 73 0 10 Rural 84 19
## 74 1 24 Rural 70 30
## 75 0 50 Urban 192 43
## 76 0 14 Urban 102 20
## 77 1 13 Rural 78 21
## 78 0 71 Urban 216 39
## 79 0 32 Rural 92 30
## 80 1 38 Rural 94 22
## 81 1 11 Urban 77 19
## 82 0 63 Urban 84 21
## 83 1 18 Urban 96 23
## 84 1 46 Urban 181 23
## 85 1 8 Urban 95 20
## 86 1 53 Rural 61 29
## 87 0 38 Urban 103 22
## 88 0 74 Urban 130 26
## 89 0 24 Urban 65 23
## 90 0 78 Rural 80 26
## 91 1 60 Rural 63 31
## 92 0 12 Urban 70 25
## 93 1 32 Rural 109 24
## 94 0 5 Urban 112 20
## 95 1 40 Rural 92 38
## 96 1 19 Urban 77 27
## 97 1 28 Urban 70 25
## 98 1 44 Rural 70 41
## 99 0 50 Urban 89 35
## 100 1 45 Rural 111 26
Description of the selected variables: - Gender: gender of the individual (female/male) - Age: age of the individual - Residence_type: place where individual lives (urban area/rural area) - Avg_glucose_level: the average glucose level in the blood between 60 and 90 days (units: mg/dL) - BMI: body mass index of individual (from 0 to 50 and above, units:kg/m2)
Here, we see the change in the names.
colnames(mydata1) <- c('Gender','Age','Residence_Type','Avg_Glucose_Level','BMI')
head(mydata1)
## Gender Age Residence_Type Avg_Glucose_Level BMI
## 1 0 67 Urban 229 37
## 2 1 80 Rural 106 33
## 3 1 49 Urban 171 34
## 4 1 79 Rural 174 24
## 5 0 81 Urban 186 29
## 6 0 74 Rural 70 27
Here we can see that Residence_Type was deleted.
mydata1a <- mydata1[,-3]
print(mydata1a)
## Gender Age Avg_Glucose_Level BMI
## 1 0 67 229 37
## 2 1 80 106 33
## 3 1 49 171 34
## 4 1 79 174 24
## 5 0 81 186 29
## 6 0 74 70 27
## 7 0 56 73 31
## 8 1 88 88 16
## 9 0 25 138 27
## 10 1 14 72 21
## 11 1 53 69 36
## 12 0 4 106 17
## 13 1 39 102 42
## 14 0 53 110 42
## 15 0 5 86 16
## 16 0 4 110 17
## 17 1 37 92 29
## 18 0 70 252 27
## 19 0 24 123 38
## 20 0 55 97 45
## 21 1 70 68 23
## 22 0 34 81 33
## 23 0 26 120 22
## 24 1 39 84 26
## 25 0 38 74 40
## 26 1 20 92 28
## 27 1 17 88 39
## 28 0 60 213 36
## 29 1 75 75 38
## 30 1 31 122 40
## 31 0 58 223 42
## 32 1 5 85 18
## 33 1 13 70 21
## 34 1 22 108 42
## 35 0 71 95 32
## 36 1 31 125 24
## 37 0 41 112 39
## 38 0 65 95 29
## 39 1 20 113 28
## 40 1 51 105 44
## 41 1 37 84 27
## 42 0 49 103 29
## 43 1 62 77 35
## 44 1 28 87 40
## 45 0 80 57 27
## 46 0 45 73 25
## 47 1 37 106 30
## 48 1 3 81 16
## 49 0 26 104 31
## 50 0 58 194 28
## 51 1 59 112 36
## 52 0 74 58 32
## 53 0 17 69 33
## 54 1 62 60 28
## 55 0 37 160 32
## 56 1 59 237 28
## 57 1 59 99 23
## 58 0 62 56 24
## 59 0 80 84 30
## 60 0 55 168 24
## 61 1 54 76 29
## 62 0 73 82 32
## 63 0 26 200 32
## 64 0 73 62 25
## 65 1 38 84 26
## 66 0 75 221 33
## 67 0 3 63 20
## 68 0 60 71 27
## 69 1 45 93 44
## 70 0 23 82 32
## 71 0 21 88 37
## 72 1 67 80 23
## 73 0 10 84 19
## 74 1 24 70 30
## 75 0 50 192 43
## 76 0 14 102 20
## 77 1 13 78 21
## 78 0 71 216 39
## 79 0 32 92 30
## 80 1 38 94 22
## 81 1 11 77 19
## 82 0 63 84 21
## 83 1 18 96 23
## 84 1 46 181 23
## 85 1 8 95 20
## 86 1 53 61 29
## 87 0 38 103 22
## 88 0 74 130 26
## 89 0 24 65 23
## 90 0 78 80 26
## 91 1 60 63 31
## 92 0 12 70 25
## 93 1 32 109 24
## 94 0 5 112 20
## 95 1 40 92 38
## 96 1 19 77 27
## 97 1 28 70 25
## 98 1 44 70 41
## 99 0 50 89 35
## 100 1 45 111 26
Here we can see that the new variable was introduced (GenderF).
mydata1a$GenderF <- factor(mydata1a$Gender,
levels = c(0,1),
labels = c('M','F'))
head(mydata1a)
## Gender Age Avg_Glucose_Level BMI GenderF
## 1 0 67 229 37 M
## 2 1 80 106 33 F
## 3 1 49 171 34 F
## 4 1 79 174 24 F
## 5 0 81 186 29 M
## 6 0 74 70 27 M
Below we can see that I introduced the condition (BMI >= 25) to see how many individuals in the sample have higher BMI. Here are displayed 6 rows, however, there are 70 individuals that have higher BMI.
mydata1aa <- mydata1 [(mydata1$BMI >= 25), ]
head(mydata1aa[order(mydata1aa$BMI), ])
## Gender Age Residence_Type Avg_Glucose_Level BMI
## 46 0 45 Rural 73 25
## 64 0 73 Rural 62 25
## 92 0 12 Urban 70 25
## 97 1 28 Urban 70 25
## 24 1 39 Rural 84 26
## 65 1 38 Rural 84 26
Here the summary of all three variables was created.
summary(mydata1[ ,c(-1,-3)])
## Age Avg_Glucose_Level BMI
## Min. : 3.00 Min. : 56.00 Min. :16.00
## 1st Qu.:24.00 1st Qu.: 76.75 1st Qu.:23.75
## Median :42.50 Median : 92.00 Median :28.00
## Mean :42.91 Mean :106.36 Mean :29.18
## 3rd Qu.:60.50 3rd Qu.:112.00 3rd Qu.:34.25
## Max. :88.00 Max. :252.00 Max. :45.00
Age: We can see that the youngest individual in the sample is 3 years old and the oldest 88 years old. If the age was the same for each individual in the sample, the average would be 42.91 years. Half of the individuals have up to or equal to 42.50 years and the other half is above 42.50 years old.
Avg_Glucose_Level: We can see that the minimum average glucose level that an individual in the sample has is 56 and the maximum average glucose level is 252. If the average glucose level was the same for each individual in the sample, the average would be 106.36 mg/dL. Half of the individuals have up to or equal to 92 mg/dL average glucose level and other half has above 92 average glucose level.
BMI: We can see that the minimum BMI in the sample is 16 and the maximum is 45. If the BMI was the same for each individual in the sample, the average would be 29.18 BMI. Half of the individuals in the sample have up to or equal to 28 BMI and the other half has above 28 BMI.
library(pastecs)
round(stat.desc(mydata1[ ,c(-1,-3)]),2)
## Age Avg_Glucose_Level BMI
## nbr.val 100.00 100.00 100.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 3.00 56.00 16.00
## max 88.00 252.00 45.00
## range 85.00 196.00 29.00
## sum 4291.00 10636.00 2918.00
## median 42.50 92.00 28.00
## mean 42.91 106.36 29.18
## SE.mean 2.32 4.61 0.74
## CI.mean.0.95 4.60 9.16 1.47
## var 536.43 2129.61 55.12
## std.dev 23.16 46.15 7.42
## coef.var 0.54 0.43 0.25
Here is the grafical representation of the data for Average Glucose Level.
hist(mydata1a$Avg_Glucose_Level,
main = 'Histogram of frequency distribution of average glucose level',
xlab = 'Average glucose level',
ylab = 'Frequency',
breaks = seq(55,255,10),
right = FALSE)
From the histogram we can see that the distribution is asymmetrical and
therefore skewed to the right (positive skew). We can conclude from the
histogram that the majority of individuals in the sample have around
normal average glucose level in the blood (72 to 99 mg/dL).
#Task 2
mydata2 <- read.table('~/Desktop/bootcamp R/R Take Home Exam/Task 2/Body mass.csv',
header = TRUE,
sep = ';',
dec = ',')
print(mydata2)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
## 7 7 62.7
## 8 8 64.5
## 9 9 59.5
## 10 10 68.9
## 11 11 64.5
## 12 12 62.8
## 13 13 63.7
## 14 14 52.3
## 15 15 64.1
## 16 16 64.2
## 17 17 64.2
## 18 18 60.2
## 19 19 59.6
## 20 20 60.9
## 21 21 54.2
## 22 22 72.1
## 23 23 60.3
## 24 24 63.1
## 25 25 59.8
## 26 26 49.7
## 27 27 58.6
## 28 28 58.7
## 29 29 62.0
## 30 30 65.7
## 31 31 64.2
## 32 32 62.7
## 33 33 78.5
## 34 34 61.6
## 35 35 60.4
## 36 36 63.3
## 37 37 69.3
## 38 38 63.0
## 39 39 60.8
## 40 40 64.9
## 41 41 62.8
## 42 42 57.3
## 43 43 69.3
## 44 44 62.8
## 45 45 62.9
## 46 46 53.6
## 47 47 65.2
## 48 48 75.3
## 49 49 83.2
## 50 50 66.4
Description of given variables: - ID: the number of each ninth grader - Mass: the body mass of individual (units:kg)
summary(mydata2[ ,c(-1)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.70 60.23 62.80 62.88 64.50 83.20
Here are displayed just summary variables for Mass because ID is a categorical number. If the body mass was the same for each individual in the sample, the average would be 62.88 kg. We can see that the minimum body mass in the sample is 49.70 kg and the maximum body sample is 83.20 kg. Half of the individuals in the sample have less than or equal to 62.80 kg and the other half has more than 62.80 kg.
library(psych)
describe(mydata2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## ID 1 50 25.50 14.58 25.5 25.50 18.53 1.0 50.0 49.0 0.00 -1.27
## Mass 2 50 62.88 6.01 62.8 62.56 3.34 49.7 83.2 33.5 0.85 2.11
## se
## ID 2.06
## Mass 0.85
The ID variable can be excluded from the table because it is a categorical variable (see below). Mass variable is then changed into X1.
describe(mydata2[ ,c(-1)])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 62.88 6.01 62.8 62.56 3.34 49.7 83.2 33.5 0.85 2.11 0.85
We can see from the data above that we have 50 variables with the standard deviation of 6.01. It means that the weight of ninth grader deviates from the mean by 6 kg. It has a relatively high dispersion of the data values and therefore the sample is heterogenic.
Here we have graphical representation of body mass of ninth graders in histogram.
hist(mydata2$Mass,
main = 'Histogram of body mass of 50 ninth graders',
xlab = 'Body mass',
ylab = 'Frequency',
breaks = seq(45,85,5))
Histogram represents the sample, which is a little bit asymmetric to the
right that is also why is positive (kurtosis = 2.11).
Here we have the computed t.test for the body mass variable for the sample.
t.test(mydata2$Mass,
mu = 59.5,
alternative = 'two.sided')
##
## One Sample t-test
##
## data: mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
We have used the two sided t-test because we have one arithmetic mean and also standard deviation is unknown. We can see that degrees of freedom are 49 because the formula is that we should subtract one from the sample size. We have p-value= 0.000234 and it is less than 5%, which brings us to the conclusion that we reject Ho and accept the alternative/research hypothesis. We can be 95% certain that true population mean lies between 61.16758 and 64.58442. We can see that the mean of the selected sample is 62.876, which is the same as in the calculations above.
For the effect size, we needed to look for correlation coefficient and the formula is -> r= squareroot of t2/(t2+df) and the result is 0.493429485, so this represents medium effect for the sample data.
#Task 3
knitr::opts_chunk$set(echo = TRUE)
options(width=120)
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
#install.packages("reshape2")
library(reshape2)
mydata3 <- read.csv('~/Desktop/bootcamp R/R Take Home Exam/Task 3/Apartments.csv',
header = TRUE,
sep = ';',
dec = ',')
print(mydata3)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
## 7 14 20 1850 0 1
## 8 18 6 1970 1 1
## 9 22 7 2270 1 0
## 10 25 2 2570 1 0
## 11 12 5 2700 1 0
## 12 18 26 1680 1 0
## 13 12 14 1650 0 1
## 14 10 12 2000 0 1
## 15 24 3 2290 0 0
## 16 11 14 1800 0 0
## 17 26 3 2010 1 1
## 18 7 11 2340 1 1
## 19 19 33 1400 0 0
## 20 13 8 1800 0 0
## 21 6 12 2060 1 1
## 22 37 3 2540 1 1
## 23 22 23 1730 0 1
## 24 18 4 2110 1 1
## 25 8 26 2300 1 1
## 26 12 3 2480 1 0
## 27 16 1 2750 1 0
## 28 18 19 1620 1 0
## 29 13 2 2170 1 0
## 30 1 5 2420 1 1
## 31 45 21 1910 0 1
## 32 12 5 2390 1 0
## 33 2 11 2790 1 0
## 34 9 3 2140 0 1
## 35 14 16 1660 0 1
## 36 24 5 1830 1 0
## 37 18 4 1950 1 0
## 38 5 45 2180 1 1
## 39 40 2 2400 0 1
## 40 33 37 1510 1 1
## 41 18 12 1770 1 0
## 42 24 1 2220 1 1
## 43 1 3 2460 1 1
## 44 28 1 2250 0 1
## 45 18 40 1400 0 1
## 46 14 25 2250 1 0
## 47 32 17 1840 1 0
## 48 38 13 1610 0 1
## 49 9 15 2290 0 1
## 50 14 8 1990 1 0
## 51 18 39 1620 0 0
## 52 22 16 1710 0 1
## 53 7 2 1760 0 1
## 54 30 17 1560 0 0
## 55 43 37 1740 0 0
## 56 17 18 1860 0 0
## 57 10 1 2810 0 0
## 58 8 2 2820 1 0
## 59 18 16 1830 0 0
## 60 30 9 2130 0 0
## 61 18 1 2800 1 1
## 62 7 28 1660 0 0
## 63 28 29 1850 0 0
## 64 18 18 1640 1 1
## 65 28 12 1770 0 0
## 66 14 20 1850 0 0
## 67 18 6 1970 1 1
## 68 22 7 2270 1 0
## 69 25 2 2570 1 0
## 70 12 5 2700 1 1
## 71 18 26 1680 1 0
## 72 12 14 1650 0 0
## 73 10 12 2000 0 0
## 74 24 3 2290 0 0
## 75 11 14 1800 0 0
## 76 26 3 2010 1 0
## 77 18 40 1400 0 0
## 78 14 25 2250 1 1
## 79 32 17 1840 1 0
## 80 38 13 1610 0 0
## 81 9 15 2290 0 1
## 82 14 8 1990 1 0
## 83 18 39 1620 0 0
## 84 22 16 1710 0 1
## 85 32 17 1840 1 0
Description:
Categorical variables are put into factors - first for Parking and then also for Balcony.
mydata3$ParkingF <- factor(mydata3$Parking,
levels = c(0,1),
labels = c('No','Yes'))
head(mydata3)
## Age Distance Price Parking Balcony ParkingF
## 1 7 28 1640 0 1 No
## 2 18 1 2800 1 0 Yes
## 3 7 28 1660 0 0 No
## 4 28 29 1850 0 1 No
## 5 18 18 1640 1 1 Yes
## 6 28 12 1770 0 1 No
mydata3$BalconyF <- factor(mydata3$Balcony,
levels = c(0,1),
labels = c('No','Yes'))
head(mydata3)
## Age Distance Price Parking Balcony ParkingF BalconyF
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
Testing the hypothesis H0: Mu_Price = 1900 eur.
t.test(mydata3$Price,
mu = 1900,
alternative = 'two.sided')
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We have used the two-sided t-test. We can see that degrees of freedom are 84 because the formula is that we should subtract one from the sample size. We have p-value= 0.004731 and it is less than 5%, which brings us to the conclusion that we reject Ho and accept the alternative/research hypothesis. We can be 95% certain that true population mean lies between 1937.443 and 2100.440.
Here is presented the regression line of Price = f(Age).
fit1 <- lm(Price ~ Age,
data = mydata3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
From this summary can we see that the regression function can be described as Price= 2185.455 - 8.975Age -> so if price increases by 1 % point,then age on average decreases by 8.975 % point.For the t-test the null hypothesis is B1 = 0 and alternative hypothesis is that it is not equal. Based on the p-value, which is 0.03401, we can reject the null hypothesis and accept the alternative because the p-value is less than 5%. Furthermore, we also have the coefficient of determination, which is 0.05302 and it means that 5.302 % of variability of Price is explained by linear effect of Age.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata3[,c(-2,-4,-5,-6,-7)]))
## Age Price
## Age 1.00 -0.23
## Price -0.23 1.00
##
## n= 85
##
##
## P
## Age Price
## Age 0.034
## Price 0.034
cor(mydata3$Age,mydata3$Price)
## [1] -0.230255
From the correlation coefficient we can see that the relationship between Age and Price is negative and weak.
Here we can see that the order of columns is changed so the order is first Price, then Age and lastly, Distance. Then we can see the graphical representation of scatterplot matrix between these three variables.
mydata3 <- mydata3[c(3,1,2,4,5,6,7)]
head(mydata3)
## Price Age Distance Parking Balcony ParkingF BalconyF
## 1 1640 7 28 0 1 No Yes
## 2 2800 18 1 1 0 Yes No
## 3 1660 7 28 0 0 No No
## 4 1850 28 29 0 1 No Yes
## 5 1640 18 18 1 1 Yes Yes
## 6 1770 28 12 0 1 No Yes
library(car)
scatterplotMatrix(mydata3[ ,c(-4,-5,-6,-7)],
smooth = FALSE)
From this scatterplot matrix we can see that the graph for Price, Age
and Distance are asymmetrical to the right side and the relationship
between Price and Age and then Price and Distance is negative. This is
logical because the older the apartment is the price is falling and also
if the distance is larger from city center than price is falling.
library(Hmisc)
rcorr(as.matrix(mydata3[ ,c(-4,-5,-6,-7)]))
## Price Age Distance
## Price 1.00 -0.23 -0.63
## Age -0.23 1.00 0.04
## Distance -0.63 0.04 1.00
##
## n= 85
##
##
## P
## Price Age Distance
## Price 0.0340 0.0000
## Age 0.0340 0.6966
## Distance 0.0000 0.6966
From the correlation coefficient we can see that the relationship between Price and Age is negative and it is weak. The relationship between Price and Distance is negative and it is moderate.
Estimate the multiple regression function: Price = f(Age, Distance).
fit2 <- lm(Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
From this summary can we see that the multiple regression function can be described as Price= 2460.101 -7.934Age -20.667Distance -> so if price increases by 1 % point,then Age on average decreases by 7.934 % point assuming that everything is unchanged and then if Price is increased by 1 % point, Distance would decrease by 20.667 % point assuming that everything is unchanged. For the t-test, the null hypothesis is B1 = 0 and alternative hypothesis is that it is not equal. Based on the p-value, which is less than less than 5%, we can reject the null hypothesis and accept the alternative. Furthermore, we also have the coefficient of determination, which is 0.4396 and it means that 43.96 % of variability of Price is explained by effect of Age and Distance.
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
From the VIF (1.002), we can see that the variables are not correlated because we will round it to 1.
Here we can see calculated standardized residuals and Cooks Distances for model fit2.
mydata3$StdResid <- round(rstandard(fit2),3)
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.95303, p-value = 0.003645
H0: Variable is normally distributed, H1: Variable is NOT normally distributed and we can see that the p-value is lower than 5% so we can reject the null hypothesis and accept the alternative one, which says that variable is not normally distributed, which can be also seen from the histogram below.
hist(mydata3$StdResid,
xlab = 'Standardized residuals',
ylab = 'Frequency',
main = 'Histogram of standardized residuals')
mydata3$CooksD <- round(cooks.distance(fit2),3)
hist(mydata3$CooksD,
xlab = 'Cooks distance',
ylab = 'Frequency',
main = 'Histogram of Cooks distances')
Histogram of Cooks distances is as seen asymmetric to the right.The
majority of units have low Cooks distances and low impact on the
function.
head(mydata3[order(-mydata3$StdResid),],3)
## Price Age Distance Parking Balcony ParkingF BalconyF StdResid CooksD
## 38 2180 5 45 1 1 Yes Yes 2.577 0.320
## 33 2790 2 11 1 0 Yes No 2.051 0.069
## 2 2800 18 1 1 0 Yes No 1.783 0.030
head(mydata3[order(-mydata3$CooksD),],6)
## Price Age Distance Parking Balcony ParkingF BalconyF StdResid CooksD
## 38 2180 5 45 1 1 Yes Yes 2.577 0.320
## 55 1740 43 37 0 0 No No 1.445 0.104
## 33 2790 2 11 1 0 Yes No 2.051 0.069
## 53 1760 7 2 0 1 No Yes -2.152 0.066
## 22 2540 37 3 1 1 Yes Yes 1.576 0.061
## 39 2400 40 2 0 1 No Yes 1.091 0.038
Outlier (38.row) will be removed when the fit2 will be estimated again.
mydata3$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFittedValues,
ylab = 'Standardized residuals',
xlab = 'Standardized fitted values',
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
We can see that linearity is met. Below there is an explanation of
heteroskedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
H0: Variance is constant (homoskedasticity). The p-value is 0.3631 and it is much more than 5%, so we cannot reject the null hypothesis. So here we have variance constant and homoskedasticity present (no heteroskedasticity is not present).
mydata3 <- mydata3[-38, ]
fit2 <- lm(Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
If Price is increased by 1 % point, the Age decreases on average by 6.464 % point, c.p. If Price is increase by 1 % point, the Price decreases on average by 22.955 % point, assuming that all other variables remain unchanged.
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
Multiple coefficient of correlation is 0.6955609, which means that linear relationship between Price and other two explanatory variables is strong.
library(lm.beta)
lm.beta(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Standardized Coefficients::
## (Intercept) Age Distance
## NA -0.1640478 -0.6607329
The largest effect has the distance because the absolute number is the highest.
Here is the regression function fit3.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = mydata3)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFYes 167.531 62.864 2.665 0.00933 **
## BalconyFYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
From the regression coefficients, we can see if Price is increased by 1 % point, the Parking increases by 167.531 % point, assuming that all other variables remain unchanged. Furthermore, if Price is increased by 1 % point, the Balcony is increased by 15.207 % point, assuming that all other variables remain unchanged. R2 is 0.5275 -> 53% of variability of Price is explained by linear effect of Age, Distance, ParkingF and BalconyF.
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: fit2 is more appropriate H1: fit3 is more appropriate We can reject the H0 because p-value (0.03051) is less than 5%, so we can say that fit3 is more appropriate than fit2.
mydata3$FittedValues <- fitted.values(fit3)
mydata3$Residuals <- residuals(fit3)
head(mydata3[,colnames(mydata3) %in% c('Price', 'FittedValues', 'Residuals')])
## Price FittedValues Residuals
## 1 1640 1705.952 -65.95206
## 2 2800 2372.197 427.80292
## 3 1660 1721.159 -61.15894
## 4 1850 1563.431 286.56890
## 5 1640 2012.244 -372.24396
## 6 1770 1908.177 -138.17733
The residual for the ID2 apartment is 427.80292.