library(knitr )
library(psych)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(lm.beta)
library(naniar)
library(Hmisc)
library(carData)
library("readxl")
#Tine Dobnik
##Task 1
mydata13 = read_excel("~/Desktop/BOOTCAMP R/Zvezek.xlsx")
mydata13$ID <- seq(1, 748, by=1)
head(mydata13)
## # A tibble: 6 × 6
## `Recency (months)` `Frequency (times)` Monetary (c.c. …¹ Time …² wheth…³ ID
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 50 12500 98 1 1
## 2 0 13 3250 28 1 2
## 3 1 16 4000 35 1 3
## 4 2 20 5000 45 1 4
## 5 1 24 6000 77 0 5
## 6 4 4 1000 4 0 6
## # … with abbreviated variable names ¹​`Monetary (c.c. blood)`, ²​`Time (months)`,
## # ³​`whether he/she donated blood in March 2007`
Because I found my statistics outside of R, I had to export it in Excel and than import it in here.
ID - Identification number of every input row. Recency (Months) - How many months have past since someone has donated blood Frequency (times) - Summed up, how many times the donor has donated blood Monetary (c.c. blood) - Amount of blood donated by donor in cubic centimeters Time (months) - How many months have past, from when the donor has donated blood for the first time Whether he/she donated blood in March 2007 - variable which shows as 1 that they did donate blood and as 0 that they did not donate blood
mydata13 <- subset(mydata13, select=c(6,1,2,3,4,5))
colnames(mydata13) <- c("ID", "Recency", "Frequency", "CC_of_Blood", "Time_from_first_donation", "Donation in 2007")
In this section I switched the order of my variables so that it suits me best. I put ID variable on the first spot and afterwards I also renamed my columns so that is is more plesant to the eye.
mydata14 <- mydata13[ ,c( -6, -1)]
library(pastecs)
round(stat.desc(mydata14))
## Recency Frequency CC_of_Blood Time_from_first_donation
## nbr.val 748 748 748 748
## nbr.null 5 0 0 0
## nbr.na 0 0 0 0
## min 0 1 250 2
## max 74 50 12500 98
## range 74 49 12250 96
## sum 7111 4125 1031250 25643
## median 7 4 1000 28
## mean 10 6 1379 34
## SE.mean 0 0 53 1
## CI.mean.0.95 1 0 105 2
## var 66 34 2131094 594
## std.dev 8 6 1460 24
## coef.var 1 1 1 1
Here I firstly took out the ID column and afterwards the column which is a categorical value that shows if he/she has donated blood in 2007.
With the commands i get to these results:
Mean of time that past since last donation was 9.51 months. (Recency) Mean of times that someone has donated blood is 5.51 times. (Frequency) Mean of cubic meters of donated blood was 1379. (CC of Blood) Mean of the last time that someone has donated blood was 34.28 months ago. (Time_from_first_donation)
50% of all observations of data sample have last lastly donated blood 7 months ago. Other 50% have donated blood after that. (Recency) 50% of all observations of the data sample have donated blood 4 times or lower. Other 50% have donated blood more than that (Frequency) 50 % of all observations of the data sample have donated less than 1000 cc of blood. (CC of Blood) 50% of all obsercations donated blood for the first time 28 months ago or before. The other 50% have donated blood for the first time after that (Time_from_first_donation)
min The minimum time that has past from the last person which donated blood was 0 months. min The minimum times that someone has donated blood is 1 min The minimum cubic meters of blood that someone has donated is 250 cc. min The minimum time that has past from a persons first donation was 2 months
max The maximum time that has past from someone donating blood was 74 months max The maximum times that someone has donated blood was 50 max The maximum cubic meters of blood that someone has donated is 12500 max The maximum time has that has past from a persons first donation was 98 months
`
hist(mydata14$CC_of_Blood,
main = "CC of donated Blood",
xlab = "CC of Blood",
ylab = "Frequency",
breaks = seq(from = 0, to = 12500, by = 100))
From here we can see that the histogram of CC of donated blood is skewed
to the right. But after the 6000CC of donated blood we have quite a few
outliers. In the next section I shall correct that.
mydata14 <- subset(mydata14, CC_of_Blood < 8000 )
hist(mydata14$CC_of_Blood,
main = "CC of donated Blood",
xlab = "CC of Blood",
ylab = "Frequency",
breaks = seq(from = 0, to = 10000, by = 100))
hist(mydata14$CC_of_Blood,
main = "CC of donated Blood",
xlab = "CC of Blood",
ylab = "Frequency",
breaks = seq(from = 0, to = 8000, by = 100))
I started off with setting a subset that after 8000 CC of blood the outliers will be deleted. So now in the 2 graphs it is visible that from the 8000 CC onwards we dont have any more outliers.
##Task 2
mydata <- read.table("~/Desktop/BOOTCAMP R/R Take Home Exam/Task 2/Body mass.csv",
header = TRUE,
sep = ";",
dec = ",")
head(mydata)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
ID: The first column depicts the ID of apartment Mass: It depicts the bodymass of a person in kilograms
library(psych)
describe(mydata[ , -1])
## mydata[, -1]
## n missing distinct Info Mean Gmd .05 .10
## 50 0 42 0.999 62.88 6.259 53.49 56.27
## .25 .50 .75 .90 .95
## 60.23 62.80 64.50 69.30 73.86
##
## lowest : 49.7 52.3 53.4 53.6 54.2, highest: 69.3 72.1 75.3 78.5 83.2
Here we can see that the mean of all observations was 62.88. Which means that average of all given body masses is 62.88. From the description we can also see that the minimal body mass was 49.7 And the biggest body mass was 83.2.
sapply(mydata, FUN = mean)
## ID Mass
## 25.500 62.876
This is another statistic that showed us the mean of our variables
sd(mydata$Mass)
## [1] 6.011403
This shows us how much variation there is from the average (mean) - 6.01
hist(mydata$Mass,
main = "Weight of randomly selected 50 ninth graders",
xlab = "Body Mass",
ylab = "Frequency",
breaks = seq(from = 40, to = 85, by = 3),
col = "yellow")
If there is any small skewness it would only be a tad bit to the right.
library(ggplot2)
ggplot(NULL, aes(c(-4, 4))) +
geom_line(stat = "function", fun = dt, args = list (df = 49)) +
ylab("Mass") +
xlab("Sample estimates") +
labs(title="Distribution of body mass")
qt(p = 0.025, df = 49, lower.tail = FALSE)
## [1] 2.009575
qt(p = 0.025, df = 49, lower.tail = TRUE)
## [1] -2.009575
t.test(mydata$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
P value shows us how likely is Ho. p = 0.0002 which means 0,05 %. It is extremely unlikely that true mean is equal to 59.5. I reject the Ho hypothesis.
To calculate the effect size we need to calculate the correlation efficient (r). The following equation gets us the needed results.
t=3.9711
df=49
r = sqrt((t^2) / (t^2 + 49))
The r in our case is 0.493
##Task3
mydata33 <- read_excel("~/Desktop/BOOTCAMP R/R Take Home Exam/Task 3/Apartments.xlsx")
head(mydata33)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata33$ParkingFactor <- factor(mydata33$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata33$BalconyFactor <- factor(mydata33$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
Here I changed my variables to factor. To yes and no instead of 0 and 1.
t.test(mydata33$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata33$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
P value shows us how likely is Ho. p in this case is 0.004 which means 0,04 %. Because it is less than 5% it is extremely unlikely that true mean is equal to 1900. I reject the Ho hypothesis.
fit1 <- lm( Age ~ Price,
data = mydata33)
summary(fit1)
##
## Call:
## lm(formula = Age ~ Price, data = mydata33)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.184 -6.673 -2.555 7.048 25.803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.479211 5.627511 5.416 5.82e-07 ***
## Price -0.005907 0.002740 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.49 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata33$Age, mydata33$Price)
## [1] -0.230255
Coefficient of determination in my calculations is visible in R-squared which is 0.05302. From this we can see that 5% of our dependent variable is predicted by the independent variable. Coefficient of correlation in the statistics show that it is -0.230255
scatterplotMatrix(mydata33[ , c(-4, -5, -6, -7)],
smooth = FALSE)
The graph shows us that there is no visible problem with the multicolinearity because they are not all close to the drawn line
fit2 <- lm( Price ~ Age + Distance,
data = mydata33)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Above we proved with the help of vif statistics that there is no multicolinearity. This can be visible due to the vif being smaller than 5 for the calculated variables.
mydata33$StdResid <- round(rstandard(fit2), 3)
mydata33$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata33$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata33$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata33$StdResid
## W = 0.95303, p-value = 0.003645
p value 0,003645, lower than 5 %
hist(mydata33$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram - Cooks distance")
head(mydata33[order(mydata33$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15 0.066
## 2 12 14 1650 0 1 No Yes -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
## # … with abbreviated variable names ¹​BalconyFactor, ²​StdResid
head(mydata33[order(-mydata33$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
## # … with abbreviated variable names ¹​BalconyFactor, ²​StdResid
mydata34 <- mydata33
mydata33 <- mydata33[c(-53),]
#Given that there arent any tremendous outliers. But I still removed the one that was the closest to being the problematic case. This is 53!
fit2 <- lm(formula = Price ~ Age + Distance, data = mydata33)
mydata33$StdResid <- round(rstandard(fit2), 3)
mydata33$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata33$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of Standardized residuals")
shapiro.test(mydata33$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata33$StdResid
## W = 0.93901, p-value = 0.0006104
p < 5%
hist(mydata33$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram - Cooks distances")
head(mydata33[order(mydata33$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 12 14 1650 0 1 No Yes -1.58 0.015
## 2 12 14 1650 0 0 No No -1.58 0.015
## 3 13 8 1800 0 0 No No -1.47 0.014
## # … with abbreviated variable names ¹​BalconyFactor, ²​StdResid
head(mydata33[order(-mydata33$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.64 0.336
## 2 43 37 1740 0 0 No No 1.59 0.129
## 3 2 11 2790 1 0 Yes No 2.01 0.069
## 4 37 3 2540 1 1 Yes Yes 1.62 0.064
## 5 40 2 2400 0 1 No Yes 1.13 0.04
## 6 45 21 1910 0 1 No Yes 0.988 0.038
## # … with abbreviated variable names ¹​BalconyFactor, ²​StdResid
library(car)
scatterplot(y = fit2$residuals, x = fit2$fitted.values,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = TRUE,
smooth = FALSE)
We checked for potential heteroskedasticity with a scatterplot. It can be visible from the scatterplot that variance isnt growing. So there is no heteroskedasticity
hist(mydata33$StdResid,
xlab = "Standarized Residuals",
ylab = "Frequency",
main = "Histogram of strandardized residuals")
shapiro.test(mydata33$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata33$StdResid
## W = 0.93901, p-value = 0.0006104
We tested the graph with the shapiro-wilk test where our H0 was that variable is normally distributed. So thist means that our H1 is gonna be that it is ont normally distributed.
From the test it is visible that the p value is smaller than 5 % (p=0.0006). This means that we can reject H0 and confirm our H1. Furthermore the standardized residuals are not distributed normally.
fit2 <- lm(formula = Price ~ Age + Distance,
data = mydata34)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata34)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
beta 1 is equal to -7.934, which can be interpreted that every additional year on average on apartments price will decrease by 7.934, if everything else is the same.
beta 2 is equal to -20.667, which can be interpreted that every kilometer from city centre on average apartments price will decrease by 20.667, if everything else stays the same.
fit3 <- lm(formula =Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata34)
Instead of only adding numerical factors (ag and distance) I also added categorical ones (parking factor and balcony factor)
anova(fit3, fit2)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Model 2: Price ~ Age + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5991088
## 2 82 6720983 -2 -729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata34)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactorYes 196.168 62.868 3.120 0.00251 **
## BalconyFactorYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
With hypthesis the following hypothesises were tested:
H0: p squared is equal to 0 H1: p squared is not equal to 0
P value is lower than 0, this means we can reject HO at 0.001. We did find a linear relation between price and at least one explanatory variable. Our model was okay
mydata34 <- mydata34 %>% mutate(Fittedvalues = fitted.values(fit3))
myresiduals <- residuals(fit3)
print(myresiduals)
## 1 2 3 4 5 6
## -110.741499 442.588928 -88.806740 260.078971 -412.575790 -126.691071
## 7 8 9 10 11 12
## 2.487853 -299.119349 48.055978 278.225949 373.977207 -226.278658
## 13 14 15 16 17 18
## -319.381563 -19.069791 205.640748 -174.245621 -258.864697 86.320138
## 19 20 21 22 23 24
## -176.994443 -268.919765 -182.433383 345.922299 -8.985715 -195.209942
## 25 26 27 28 29 30
## 323.798405 117.886614 378.991292 -412.595734 -203.359865 17.255452
## 31 32 33 34 35 36
## 291.296500 63.977207 504.260809 -48.276278 -259.693334 -414.436979
## 37 38 39 40 41 42
## -353.275183 526.262587 404.441776 -97.732888 -388.912810 -98.552925
## 43 44 45 46 47 48
## 21.164859 154.810666 -59.410945 298.480775 -133.502878 -200.657597
## 49 50 51 52 53 54
## 318.267281 -268.289267 144.478518 -155.302792 -459.919210 -230.932193
## 55 56 57 58 59 60
## 398.358370 -1.271528 594.366706 412.646046 -60.563303 194.705434
## 61 62 63 64 65 66
## 440.654168 -88.806740 262.013730 -412.575790 -124.756312 4.422612
## 67 68 69 70 71 72
## -299.119349 48.055978 278.225949 372.042448 -226.278658 -317.446803
## 73 74 75 76 77 78
## -17.135032 205.640748 -174.245621 -256.929937 -57.476185 296.546015
## 79 80 81 82 83 84
## -133.502878 -198.722838 318.267281 -268.289267 144.478518 -155.302792
## 85
## -133.502878
For ID2 the residual is 442.58