Exam

library(knitr )
library(psych)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(Hmisc)

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

library(lm.beta)
library(naniar)
library(Hmisc)
library(carData)
library("readxl")

#Tine Dobnik

##Task 1

mydata13 = read_excel("~/Desktop/BOOTCAMP R/Zvezek.xlsx")

mydata13$ID <- seq(1, 748, by=1)


head(mydata13)

## # A tibble: 6 × 6
##   `Recency (months)` `Frequency (times)` Monetary (c.c. …¹ Time …² wheth…³    ID
##                <dbl>               <dbl>             <dbl>   <dbl>   <dbl> <dbl>
## 1                  2                  50             12500      98       1     1
## 2                  0                  13              3250      28       1     2
## 3                  1                  16              4000      35       1     3
## 4                  2                  20              5000      45       1     4
## 5                  1                  24              6000      77       0     5
## 6                  4                   4              1000       4       0     6
## # … with abbreviated variable names ¹`Monetary (c.c. blood)`, ²`Time (months)`,
## #   ³`whether he/she donated blood in March 2007`

Because I found my statistics outside of R, I had to export it in Excel and than import it in here.

ID - Identification number of every input row. Recency (Months) - How many months have past since someone has donated blood Frequency (times) - Summed up, how many times the donor has donated blood Monetary (c.c. blood) - Amount of blood donated by donor in cubic centimeters Time (months) - How many months have past, from when the donor has donated blood for the first time Whether he/she donated blood in March 2007 - variable which shows as 1 that they did donate blood and as 0 that they did not donate blood

mydata13 <- subset(mydata13, select=c(6,1,2,3,4,5))
colnames(mydata13) <- c("ID", "Recency", "Frequency", "CC_of_Blood", "Time_from_first_donation", "Donation in 2007")

In this section I switched the order of my variables so that it suits me best. I put ID variable on the first spot and afterwards I also renamed my columns so that is is more plesant to the eye.

mydata14 <- mydata13[ ,c( -6, -1)]
library(pastecs)
round(stat.desc(mydata14))

##              Recency Frequency CC_of_Blood Time_from_first_donation
## nbr.val          748       748         748                      748
## nbr.null           5         0           0                        0
## nbr.na             0         0           0                        0
## min                0         1         250                        2
## max               74        50       12500                       98
## range             74        49       12250                       96
## sum             7111      4125     1031250                    25643
## median             7         4        1000                       28
## mean              10         6        1379                       34
## SE.mean            0         0          53                        1
## CI.mean.0.95       1         0         105                        2
## var               66        34     2131094                      594
## std.dev            8         6        1460                       24
## coef.var           1         1           1                        1

Here I firstly took out the ID column and afterwards the column which is a categorical value that shows if he/she has donated blood in 2007.

With the commands i get to these results:

Mean of time that past since last donation was 9.51 months. (Recency) Mean of times that someone has donated blood is 5.51 times. (Frequency) Mean of cubic meters of donated blood was 1379. (CC of Blood) Mean of the last time that someone has donated blood was 34.28 months ago. (Time_from_first_donation)

50% of all observations of data sample have last lastly donated blood 7 months ago. Other 50% have donated blood after that. (Recency) 50% of all observations of the data sample have donated blood 4 times or lower. Other 50% have donated blood more than that (Frequency) 50 % of all observations of the data sample have donated less than 1000 cc of blood. (CC of Blood) 50% of all obsercations donated blood for the first time 28 months ago or before. The other 50% have donated blood for the first time after that (Time_from_first_donation)

min The minimum time that has past from the last person which donated blood was 0 months. min The minimum times that someone has donated blood is 1 min The minimum cubic meters of blood that someone has donated is 250 cc. min The minimum time that has past from a persons first donation was 2 months

max The maximum time that has past from someone donating blood was 74 months max The maximum times that someone has donated blood was 50 max The maximum cubic meters of blood that someone has donated is 12500 max The maximum time has that has past from a persons first donation was 98 months

hist(mydata14$CC_of_Blood, 
     main = "CC of donated Blood", 
     xlab = "CC of Blood", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 12500, by = 100))

From here we can see that the histogram of CC of donated blood is skewed to the right. But after the 6000CC of donated blood we have quite a few outliers. In the next section I shall correct that.

mydata14 <- subset(mydata14, CC_of_Blood < 8000 )

hist(mydata14$CC_of_Blood, 
     main = "CC of donated Blood", 
     xlab = "CC of Blood", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 10000, by = 100))

hist(mydata14$CC_of_Blood, 
     main = "CC of donated Blood", 
     xlab = "CC of Blood", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 8000, by = 100))

I started off with setting a subset that after 8000 CC of blood the outliers will be deleted. So now in the 2 graphs it is visible that from the 8000 CC onwards we dont have any more outliers.

##Task 2

mydata <- read.table("~/Desktop/BOOTCAMP R/R Take Home Exam/Task 2/Body mass.csv", 
                     header = TRUE, 
                     sep = ";", 
                    dec = ",")
head(mydata)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

ID: The first column depicts the ID of apartment Mass: It depicts the bodymass of a person in kilograms

library(psych)
describe(mydata[ , -1])

## mydata[, -1] 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       50        0       42    0.999    62.88    6.259    53.49    56.27 
##      .25      .50      .75      .90      .95 
##    60.23    62.80    64.50    69.30    73.86 
## 
## lowest : 49.7 52.3 53.4 53.6 54.2, highest: 69.3 72.1 75.3 78.5 83.2

Here we can see that the mean of all observations was 62.88. Which means that average of all given body masses is 62.88. From the description we can also see that the minimal body mass was 49.7 And the biggest body mass was 83.2.

sapply(mydata, FUN = mean)

##     ID   Mass 
## 25.500 62.876

This is another statistic that showed us the mean of our variables

sd(mydata$Mass)

## [1] 6.011403

This shows us how much variation there is from the average (mean) - 6.01

hist(mydata$Mass,
     main = "Weight of randomly selected 50 ninth graders", 
     xlab = "Body Mass", 
     ylab = "Frequency", 
     breaks = seq(from = 40, to = 85, by = 3),
     col = "yellow")

If there is any small skewness it would only be a tad bit to the right.

library(ggplot2)
ggplot(NULL, aes(c(-4, 4))) +
  geom_line(stat = "function", fun = dt, args = list (df = 49)) +
  ylab("Mass") + 
  xlab("Sample estimates") +
  labs(title="Distribution of body mass")

qt(p = 0.025, df = 49, lower.tail = FALSE)

## [1] 2.009575

qt(p = 0.025, df = 49, lower.tail = TRUE)

## [1] -2.009575

t.test(mydata$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

P value shows us how likely is Ho. p = 0.0002 which means 0,05 %. It is extremely unlikely that true mean is equal to 59.5. I reject the Ho hypothesis.

To calculate the effect size we need to calculate the correlation efficient (r). The following equation gets us the needed results.

t=3.9711
df=49

r = sqrt((t^2) / (t^2 + 49))

The r in our case is 0.493

##Task3

Import the dataset Apartments.xlsx

mydata33 <- read_excel("~/Desktop/BOOTCAMP R/R Take Home Exam/Task 3/Apartments.xlsx")
head(mydata33)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors

mydata33$ParkingFactor <- factor(mydata33$Parking, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

mydata33$BalconyFactor <- factor(mydata33$Balcony, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

Here I changed my variables to factor. To yes and no instead of 0 and 1.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata33$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata33$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

P value shows us how likely is Ho. p in this case is 0.004 which means 0,04 %. Because it is less than 5% it is extremely unlikely that true mean is equal to 1900. I reject the Ho hypothesis.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm( Age ~  Price,
           data = mydata33)


summary(fit1)

## 
## Call:
## lm(formula = Age ~ Price, data = mydata33)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.184  -6.673  -2.555   7.048  25.803 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.479211   5.627511   5.416 5.82e-07 ***
## Price       -0.005907   0.002740  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.49 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(mydata33$Age, mydata33$Price)

## [1] -0.230255

Coefficient of determination in my calculations is visible in R-squared which is 0.05302. From this we can see that 5% of our dependent variable is predicted by the independent variable. Coefficient of correlation in the statistics show that it is -0.230255

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(mydata33[ , c(-4, -5, -6, -7)], 
                  smooth = FALSE)

The graph shows us that there is no visible problem with the multicolinearity because they are not all close to the drawn line

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm( Price ~  Age + Distance,
           data = mydata33)

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Above we proved with the help of vif statistics that there is no multicolinearity. This can be visible due to the vif being smaller than 5 for the calculated variables.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

mydata33$StdResid <- round(rstandard(fit2), 3) 
mydata33$CooksD <- round(cooks.distance(fit2), 3) 

hist(mydata33$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata33$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata33$StdResid
## W = 0.95303, p-value = 0.003645

p value 0,003645, lower than 5 %

hist(mydata33$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram - Cooks distance")

head(mydata33[order(mydata33$StdResid),], 3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>          <dbl>  <dbl>
## 1     7        2  1760       0       1 No            Yes            -2.15  0.066
## 2    12       14  1650       0       1 No            Yes            -1.50  0.013
## 3    12       14  1650       0       0 No            No             -1.50  0.013
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid

head(mydata33[order(-mydata33$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>          <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes           Yes             2.58  0.32 
## 2    43       37  1740       0       0 No            No              1.44  0.104
## 3     2       11  2790       1       0 Yes           No              2.05  0.069
## 4     7        2  1760       0       1 No            Yes            -2.15  0.066
## 5    37        3  2540       1       1 Yes           Yes             1.58  0.061
## 6    40        2  2400       0       1 No            Yes             1.09  0.038
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid

mydata34 <- mydata33
mydata33 <- mydata33[c(-53),]

#Given that there arent any tremendous outliers. But I still removed the one that was the closest to being the problematic case. This is 53!

fit2 <- lm(formula = Price ~ Age + Distance, data = mydata33)

mydata33$StdResid <- round(rstandard(fit2), 3)
mydata33$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata33$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized residuals")

shapiro.test(mydata33$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata33$StdResid
## W = 0.93901, p-value = 0.0006104

p < 5%

hist(mydata33$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram - Cooks distances")

head(mydata33[order(mydata33$StdResid),], 3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>          <dbl>  <dbl>
## 1    12       14  1650       0       1 No            Yes            -1.58  0.015
## 2    12       14  1650       0       0 No            No             -1.58  0.015
## 3    13        8  1800       0       0 No            No             -1.47  0.014
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid

head(mydata33[order(-mydata33$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>          <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes           Yes            2.64   0.336
## 2    43       37  1740       0       0 No            No             1.59   0.129
## 3     2       11  2790       1       0 Yes           No             2.01   0.069
## 4    37        3  2540       1       1 Yes           Yes            1.62   0.064
## 5    40        2  2400       0       1 No            Yes            1.13   0.04 
## 6    45       21  1910       0       1 No            Yes            0.988  0.038
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

library(car)
scatterplot(y = fit2$residuals, x = fit2$fitted.values,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

We checked for potential heteroskedasticity with a scatterplot. It can be visible from the scatterplot that variance isnt growing. So there is no heteroskedasticity

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata33$StdResid,
     xlab = "Standarized Residuals",
     ylab = "Frequency",
     main = "Histogram of strandardized residuals")

shapiro.test(mydata33$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata33$StdResid
## W = 0.93901, p-value = 0.0006104

We tested the graph with the shapiro-wilk test where our H0 was that variable is normally distributed. So thist means that our H1 is gonna be that it is ont normally distributed.

From the test it is visible that the p value is smaller than 5 % (p=0.0006). This means that we can reject H0 and confirm our H1. Furthermore the standardized residuals are not distributed normally.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(formula = Price ~ Age + Distance,
           data = mydata34)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata34)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

beta 1 is equal to -7.934, which can be interpreted that every additional year on average on apartments price will decrease by 7.934, if everything else is the same.

beta 2 is equal to -20.667, which can be interpreted that every kilometer from city centre on average apartments price will decrease by 20.667, if everything else stays the same.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(formula =Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata34)

Instead of only adding numerical factors (ag and distance) I also added categorical ones (parking factor and balcony factor)

With function anova check if model fit3 fits data better than model fit2.

anova(fit3, fit2)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Model 2: Price ~ Age + Distance
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5991088                              
## 2     82 6720983 -2   -729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata34)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2301.667     94.271  24.415  < 2e-16 ***
## Age                -6.799      3.110  -2.186  0.03172 *  
## Distance          -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFactorYes  196.168     62.868   3.120  0.00251 ** 
## BalconyFactorYes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With hypthesis the following hypothesises were tested:

H0: p squared is equal to 0 H1: p squared is not equal to 0

P value is lower than 0, this means we can reject HO at 0.001. We did find a linear relation between price and at least one explanatory variable. Our model was okay

Save fitted values and calculate the residual for apartment ID2.

mydata34 <- mydata34 %>% mutate(Fittedvalues = fitted.values(fit3))
myresiduals <- residuals(fit3)

print(myresiduals)

##           1           2           3           4           5           6 
## -110.741499  442.588928  -88.806740  260.078971 -412.575790 -126.691071 
##           7           8           9          10          11          12 
##    2.487853 -299.119349   48.055978  278.225949  373.977207 -226.278658 
##          13          14          15          16          17          18 
## -319.381563  -19.069791  205.640748 -174.245621 -258.864697   86.320138 
##          19          20          21          22          23          24 
## -176.994443 -268.919765 -182.433383  345.922299   -8.985715 -195.209942 
##          25          26          27          28          29          30 
##  323.798405  117.886614  378.991292 -412.595734 -203.359865   17.255452 
##          31          32          33          34          35          36 
##  291.296500   63.977207  504.260809  -48.276278 -259.693334 -414.436979 
##          37          38          39          40          41          42 
## -353.275183  526.262587  404.441776  -97.732888 -388.912810  -98.552925 
##          43          44          45          46          47          48 
##   21.164859  154.810666  -59.410945  298.480775 -133.502878 -200.657597 
##          49          50          51          52          53          54 
##  318.267281 -268.289267  144.478518 -155.302792 -459.919210 -230.932193 
##          55          56          57          58          59          60 
##  398.358370   -1.271528  594.366706  412.646046  -60.563303  194.705434 
##          61          62          63          64          65          66 
##  440.654168  -88.806740  262.013730 -412.575790 -124.756312    4.422612 
##          67          68          69          70          71          72 
## -299.119349   48.055978  278.225949  372.042448 -226.278658 -317.446803 
##          73          74          75          76          77          78 
##  -17.135032  205.640748 -174.245621 -256.929937  -57.476185  296.546015 
##          79          80          81          82          83          84 
## -133.502878 -198.722838  318.267281 -268.289267  144.478518 -155.302792 
##          85 
## -133.502878

For ID2 the residual is 442.58