Task 1

library(carData)
mydata <- Salaries
rownames(mydata) <- NULL


head(mydata)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000
  1. Explain the data set (the variables used in the analysis)

Rank: a factor with levels AssocProf AsstProf Prof Discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments). Yrs.since.phd: years since PhD Yrs.service: years of service Sex: a factor with levels Female Male Salary: nine-month salary, in dollars.

  1. Perform some data manipulations
colnames(mydata) <- c("Academic_title", "Discipline", "Years since PhD", "Years of service", "Gender", "Salary")
head(mydata)
##   Academic_title Discipline Years since PhD Years of service Gender Salary
## 1           Prof          B              19               18   Male 139750
## 2           Prof          B              20               16   Male 173200
## 3       AsstProf          B               4                3   Male  79750
## 4           Prof          B              45               39   Male 115000
## 5           Prof          B              40               41   Male 141500
## 6      AssocProf          B               6                6   Male  97000
  1. Present the descriptive statistics for the selected variables and explain at least 3 sample statistics (mean, median, etc.)
round(stat.desc(mydata[ , -c (1, 2, 5)]), 2)
##              Years since PhD Years of service       Salary
## nbr.val               397.00           397.00       397.00
## nbr.null                0.00            11.00         0.00
## nbr.na                  0.00             0.00         0.00
## min                     1.00             0.00     57800.00
## max                    56.00            60.00    231545.00
## range                  55.00            60.00    173745.00
## sum                  8859.00          6993.00  45141464.00
## median                 21.00            16.00    107300.00
## mean                   22.31            17.61    113706.46
## SE.mean                 0.65             0.65      1520.16
## CI.mean.0.95            1.27             1.28      2988.60
## var                   166.07           169.16 917425865.05
## std.dev                12.89            13.01     30289.04
## coef.var                0.58             0.74         0.27
  1. Mean: From the output, we can infer that the average years since PhD is 22.3 years, the average years of service is 17,61, and the average nine-month salary in dollars is 113706.46.

  2. Median: From the output, we can infer that the median (the middle most value of a variable) years since PhD is 21 years, the median years of service is 16, and the median nine-month salary in dollars is 107300.00.

  3. Range: From the output, we can see that the difference between the highest and the lowest number of years since PhD is 55, the difference between the highest and the lowest number of years of service is 60. We can also conclude that the difference between the highest and lowest ninth-month salary in dollars is 173745.00.

hist(mydata$Salary, 
     main = "Distribution of Salaries for Professors", 
     xlab = "Nine-month salary, in dollars", 
     ylab = "Frequency", 
     breaks = seq(from = 55000, to = 250000, by = 5000))

We can see that the distribution of nine-month salary is skewed to the right however not by much. The distribution is close to normal.

hist(mydata$"Years of service", 
     main = "Distribution of Years of Service", 
     xlab = "Years of service", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 60, by = 2))

We can see that the distribution Years of service is slightly skeweed to the right.

ggplot(mydata, aes(x= Academic_title, y= Salary))+
  geom_boxplot()+
  theme_bw()

As expected, the salary increases in proportion to Academic title (levels).

ggplot(mydata, aes(x=Gender, y= Salary))+
  geom_boxplot()+
  theme_bw()

There are some cases of an outliers which apply to the deviating salary. Therefore I removed the three most outstanding cases.

head(mydata[order(-mydata$Salary),])
##     Academic_title Discipline Years since PhD Years of service Gender Salary
## 44            Prof          B              38               38   Male 231545
## 365           Prof          A              43               43   Male 205500
## 250           Prof          A              29                7   Male 204000
## 272           Prof          A              42               18   Male 194800
## 78            Prof          B              26               19   Male 193000
## 331           Prof          B              49               60   Male 192253
newdata<-mydata[c(-44, -365, -250),]
ggplot(newdata, aes(x=Gender, y= Salary))+
  geom_boxplot()+
  theme_bw()

In the graphs below there is shown a dependance among Years since PhD, Years of service and Salary. It is evident from the graph that the Years since PhD and Years of Service are in rather strong correlation, while there is no strong correlation between salary and years since PhD and Years of Service.

scatterplotMatrix(mydata[ , c(-1, -2, -5)], 
                  smooth = FALSE) 

Task 2

mydata2 <- read.table("~/Downloads/R Take Home Exam/Task 2/Body mass.csv", header=TRUE, sep=";", dec=",")

head (mydata2)
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2
  1. Show the descriptive statistics of body mass and its distribution with the histogram.

Description:

#install.packages("pastecs")
library(pastecs)

round(stat.desc (mydata2 [(-1)]), 2)
##                 Mass
## nbr.val        50.00
## nbr.null        0.00
## nbr.na          0.00
## min            49.70
## max            83.20
## range          33.50
## sum          3143.80
## median         62.80
## mean           62.88
## SE.mean         0.85
## CI.mean.0.95    1.71
## var            36.14
## std.dev         6.01
## coef.var        0.10
hist(mydata2$Mass,
     main = "Distribution of Mass", 
     xlab = "Kilograms", 
     ylab = "Frequency", 
     breaks = seq(from = 40, to = 90, by = 1))

  1. Test the following hypothesis with program R: 𝐻0:𝜇 = 59.5
#install.packages("psych")
library(ggplot2)

t.test(mydata2$Mass, mu=59.5, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

We can reject the hypothesis that average body mass is 59,5 at p-value 0.000234, because we can see that mean is 62.876, also CI is between 61.16758 and 64.58442, which does not intercept 59,5

  1. Explain the results and interpret the effect size
t <- 3.9711
r <- sqrt ((t^2)/(t^2 + 49))
r
## [1] 0.4934295

r = 0.4934295, so the effect is medium.

Task 3

Import the dataset Apartments.xlsx

#install.packages("readxl")
library("readxl")
mydata3 <- read_excel("~/Downloads/R Take Home Exam/Task 3/Apartments.xlsx")


head(mydata3)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors

mydata3$Parking<- factor(mydata3$Parking, levels= c (0,1), labels = c ("No","Yes"))
mydata3$Balcony<- factor(mydata3$Balcony, levels= c (0,1), labels = c ("No", "Yes"))

head(mydata3)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 No      Yes    
## 2    18        1  2800 Yes     No     
## 3     7       28  1660 No      No     
## 4    28       29  1850 No      Yes    
## 5    18       18  1640 Yes     Yes    
## 6    28       12  1770 No      Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

We can reject the hypothesis that average price is 1900 at p-value 0.004731, because we can see that mean is 2018.941, also CI is between 1937.443 and 2100.440, which does not intercept 1900

#install.packages("psych")
library(ggplot2)

t.test(mydata3$Price, mu=1900, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

  • The estimate of regression coefficient: The estimation for b1 is b_1 = -8.975. With every additional year on average price of the apartment per m2 will decrease by 8.975 (p=0.01), if everything else remains unchanged.
  • Coefficient of correlation is -0.230255
  • Coefficient of determination is 0.05302
fit1 <- lm(Price~Age, data = mydata3)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(mydata3$Age, mydata3$Price)
## [1] -0.230255

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

There is a significant distance between independent variables and the drawn line, therefore we can assume the multicolinearity is not present. To know for certain, we would have to calculate the VIF.

library(car)
scatterplotMatrix(mydata3[, c(3, 1, 2)], smooth = FALSE)

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata3)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

VIF is less than 5 so it is okay and we have no multicolinearity

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

There is none very problematic case of an outlier or a case with a big influence, however i removed one that was “sticking out” the most.

library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()     masks psych::%+%()
## ✖ ggplot2::alpha()   masks psych::alpha()
## ✖ plyr::arrange()    masks dplyr::arrange()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ plyr::count()      masks dplyr::count()
## ✖ pastecs::extract() masks tidyr::extract()
## ✖ plyr::failwith()   masks dplyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ pastecs::first()   masks dplyr::first()
## ✖ plyr::id()         masks dplyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ pastecs::last()    masks dplyr::last()
## ✖ plyr::mutate()     masks dplyr::mutate()
## ✖ car::recode()      masks dplyr::recode()
## ✖ plyr::rename()     masks dplyr::rename()
## ✖ purrr::some()      masks car::some()
## ✖ Hmisc::src()       masks dplyr::src()
## ✖ plyr::summarise()  masks dplyr::summarise()
## ✖ Hmisc::summarize() masks plyr::summarize(), dplyr::summarize()
mydata3$StdResid <- round(rstandard(fit2), 3)
mydata3$CooksD <- round(cooks.distance(fit2), 3)
view(mydata3)

hist(mydata3$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

hist(mydata3$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histagram of Cooks distances")

head(mydata3[order(mydata3$StdResid),], 3)
## # A tibble: 3 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     7        2  1760 No      Yes        -2.15  0.066
## 2    12       14  1650 No      Yes        -1.50  0.013
## 3    12       14  1650 No      No         -1.50  0.013
head(mydata3[order(-mydata3$CooksD),], 6) 
## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32 
## 2    43       37  1740 No      No          1.44  0.104
## 3     2       11  2790 Yes     No          2.05  0.069
## 4     7        2  1760 No      Yes        -2.15  0.066
## 5    37        3  2540 Yes     Yes         1.58  0.061
## 6    40        2  2400 No      Yes         1.09  0.038
newdata1<-mydata3[c(-53),]

fit2 <- lm(Price ~ Age + Distance, data = newdata1)

newdata1$StdResid <- round(rstandard(fit2), 3)
newdata1$CooksD <- round(cooks.distance(fit2), 3)

hist(newdata1$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

hist(newdata1$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histagram of Cooks distances")

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

We check the assumption with the help of a scatter plot between standardized residuals and (standardized) fitted values. Variance is not growing, so we expect that assumption of homoskedasticity is not violated.

newdata1$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = newdata1$StdResid, x = newdata1$StdFittedValues, 
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.1801242 
##  Prob > Chi2   =    0.6712665

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Assumptions:

H0: variable is normally distributed H1: variable is not normally distributed

p-value = 0.0008604 –> p-value is smaller than alpha = 5%, so we can reject H0 and accept H1 With p=0.0008604 with alpha = 0.05. So it means the standardized residuals are not distributed normally.

Assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.

#install.packages("olsrr")
library(olsrr)
shapiro.test(newdata1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  newdata1$StdResid
## W = 0.93901, p-value = 0.0006104
hist(newdata1$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

  • b1 is estimated -7.934 With every additional year on average apartment’s price per m2 will decrease by -8.821 (p=0.05), if everything else remains unchanged
  • b2 is estimated -20.667 With every additional km from city centre on average apartment’s price per m2 will decrease by -21.342 (p=0.001), if everything else remains unchanged.
fit2 <- lm(Price ~ Age + Distance, data = mydata3)
vif(fit2)
##      Age Distance 
## 1.001845 1.001845
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata3)

With function anova check if model fit3 fits data better than model fit2.

Fit3 has 2 more Df and p-value 0.03 so we can say that including more variables fits data better than in fit2.

anova (fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

Parking: Given Age, Distance and Balcony Factor, the apartments with parking place have on average price higher by 196 euro with p = 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not claim that the presence of a balcony affects the price (can not reject the null hypothesis).

Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0

p-value < 0.001 therefore we reject H0 at p< 0.001. It means we have found the linear relationship between dependent (Price) and at least one explanatory variable. Model is good (ρ squared > 0)

summary (fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Save fitted values and claculate the residual for apartment ID2.

mydata3$FittedValues <- fitted.values(fit3) 
mydata3$Residuals <- residuals(fit3) 

head(mydata3)
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony StdResid CooksD FittedValues Residuals
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>        <dbl>     <dbl>
## 1     7       28  1640 No      Yes       -0.665  0.007        1751.    -111. 
## 2    18        1  2800 Yes     No         1.78   0.03         2357.     443. 
## 3     7       28  1660 No      No        -0.594  0.006        1749.     -88.8
## 4    28       29  1850 No      Yes        0.754  0.008        1590.     260. 
## 5    18       18  1640 Yes     Yes       -1.07   0.005        2053.    -413. 
## 6    28       12  1770 No      Yes       -0.778  0.005        1897.    -127.

Residual = 442,588