Take home exam - Kristjan Puhek

#Task one.

library(car)

## Loading required package: carData

data(package = .packages(all.available = TRUE))

mydata <- Prestige

summary(Prestige)

##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

head(Prestige)

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

Description of variables:

The Prestige data frame has 102 rows and 6 columns. The observations are occupations.

This data frame contains the following columns:

education: Average education of occupational incumbents, years, in 1971.

income: Average income of incumbents, dollars, in 1971.

women: Percentage of incumbents who are women.

prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.

census: Canadian Census occupational code.

type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

Prestige1 <- (Prestige[, c(-5, -3)])

Deleted variableS 5 (Census) and 3 (Women - percent of females in the sample), because it was irrelevant to our conclusions.

library(pastecs)
round(stat.desc(Prestige1[, -4]),2)

##              education      income prestige
## nbr.val         102.00      102.00   102.00
## nbr.null          0.00        0.00     0.00
## nbr.na            0.00        0.00     0.00
## min               6.38      611.00    14.80
## max              15.97    25879.00    87.20
## range             9.59    25268.00    72.40
## sum            1095.28   693386.00  4777.00
## median           10.54     5930.50    43.60
## mean             10.74     6797.90    46.83
## SE.mean           0.27      420.41     1.70
## CI.mean.0.95      0.54      833.98     3.38
## var               7.44 18027855.55   295.99
## std.dev           2.73     4245.92    17.20
## coef.var          0.25        0.62     0.37

The median income (or the income higher and lower of which is 50% of the observations) is 5930 dollars, the median for years of education is 10,54 and the median for the level of an occupation prestige is 43,60

The arithmetic mean of education is 10,74 years, the mean for income is 6797,90 dollars, and the mean for the prestige level is 46,83

The variation coeficient is highest in the “income” variable (0,62), meaning it has the dispersion around the mean. It it’s followed by v. c. of prestige (0,37) and education (0,25)

library(car)
scatterplot(y = Prestige1$income, 
            x = Prestige1$prestige, 
            ylab = "prestige score for occupation", 
            xlab = "Average income", 
            smooth = FALSE)

library(car)
scatterplot(y = Prestige1$income, 
            x = Prestige1$education, 
            ylab = "years of education", 
            xlab = "Average income", 
            smooth = FALSE)

By examining the scatterplots we can conclude the following:

Scattaerplot 1 (Comparing the correlation between prestige score for occupation and income): Income is positively corelated to the prestige score - in other words - the higher the prestige score, the higher the income.

Scatterplot 2 (Comparing the correlation between years of education and income): Income is also positively corelated to years of education - in other words - the longer the years of education, the higher the income.

#Task 2:

mydata <- read.table("~/IMB/BootcampR/Takehomeexam/R Take Home Exam/Task 2/Body mass.csv", header=TRUE, sep=";", dec=",")

head(mydata)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

library(pastecs)
round(stat.desc(mydata),2)

##                   ID    Mass
## nbr.val        50.00   50.00
## nbr.null        0.00    0.00
## nbr.na          0.00    0.00
## min             1.00   49.70
## max            50.00   83.20
## range          49.00   33.50
## sum          1275.00 3143.80
## median         25.50   62.80
## mean           25.50   62.88
## SE.mean         2.06    0.85
## CI.mean.0.95    4.14    1.71
## var           212.50   36.14
## std.dev        14.58    6.01
## coef.var        0.57    0.10

Description: ID: -ID of 9th graders Mass: -Mass of specific 9th graders

hist(mydata$Mass, 
     main = "Distribution of the variable Mass", 
     ylab = "Frequency", 
     xlab = "Mass", 
     breaks = seq(40, 90, by =5 ), 
     right = FALSE)

Histogram showing the distribution of frequencies of body mass by specific 9th graders.

library(ggplot2)
ggplot(NULL, aes(c(-4, 4))) +
  geom_line(stat = "function", fun = dt, args = list (df =49 )) +
  ylab("Density") + 
  xlab("Sample estimates") +
  labs(title="Distribution of sample estimates")

qt(p = 0.025, df = 49, lower.tail = FALSE)

## [1] 2.009575

qt(p = 0.025, df = 49, lower.tail = TRUE)

## [1] -2.009575

t.test(mydata$Mass,
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

H0:𝜇=59.5 H1:𝜇≠ 59.5

We can reject the null hypothisis for 2 reasons. #1 the T-value is greater than 2.009575 - it is 3.9711 #2 the true mean of the population 59.5 taking into account the 95% confidence interval - it is actually 62.876

#Task 3


####  Import the dataset Apartments.xlsx

```r
#install.packages("readxl")
library(readxl)
Apartments <- read_excel("~/IMB/BootcampR/Takehomeexam/R Take Home Exam/Task 3/Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors

Apartments$Balconyf <- factor(Apartments$Balcony, levels = c ("0", "1"), labels = c("No", "Yes") )
Apartments$Parkingf <- factor(Apartments$Parking, levels = c ("0", "1"), labels = c("No", "Yes") )

Apartments1 <- Apartments

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject the null hypothisis and conclude with a 95% confidence interval that the mean (price) of the population is not equal to 1900. #### Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(formula = Price ~ Age, data = Apartments1)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(Apartments1$Price, Apartments1$Age)

## [1] -0.230255

The estimation for b1 is b1 = -8.975. With every additional year on average price of the apartment per m2 will decrease by 8.975, under the condition everything else remains unchanged.
Correlation coefficient is -0.230255
Determination coeficient is 0.05302

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(Apartments[,c(1,2,3)], 
                  smooth = FALSE)

We can assume the multicolinearity is not present due to the values being dispersed around the regression line accordingly.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Apartments$Price ~ Apartments$Age + Apartments$Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Apartments$Price ~ Apartments$Age + Apartments$Distance, 
##     data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2460.101     76.632   32.10  < 2e-16 ***
## Apartments$Age        -7.934      3.225   -2.46    0.016 *  
## Apartments$Distance  -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Check the multicolinearity with VIF statistics. Explain the findings.

fit2 <-lm(formula = Price ~ Age + Distance, data = Apartments)
vif (fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF statistic is less than 5 so we can conclude there is no multicoliniarity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

#install.packages("olsrr")
Apartments$StdResid <- round(rstandard(fit2), 3) 
Apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(Apartments$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645

hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

The p value is less than 5%. There is one outlier that skews the conclusions about the regression. We have to remove the outlier in order to have valueable conclusions. Because we still had residuals on the Cooks distance, therefore we removed the variables that had the Cooks distance higher than 4/N.

excluded_cases <- as.numeric(names(Apartments1$CooksD)[(Apartments$CooksD > (4/85))])

## Warning: Unknown or uninitialised column: `CooksD`.

Apartments1<-Apartments[c(-22,-33,-38,-53,-55) , ]
head(Apartments[order(Apartments$StdResid),], 3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     7        2  1760       0       1 Yes      No          -2.15  0.066
## 2    12       14  1650       0       1 Yes      No          -1.50  0.013
## 3    12       14  1650       0       0 No       No          -1.50  0.013

head(Apartments[order(-Apartments1$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    24        5  1830       1       0 No       Yes        -1.19   0.012
## 2     7        2  1760       0       1 Yes      No         -2.15   0.066
## 3    18        4  2110       1       1 Yes      Yes        -0.44   0.001
## 4    22       16  1710       0       1 Yes      No         -0.861  0.003
## 5    18        1  2800       1       0 No       Yes         1.78   0.03 
## 6     1        5  2420       1       1 Yes      Yes         0.256  0.001

I removed the outlier from the data, and will have to run the regression again to have an unbiased regression result. I removed outliers that had a cooks value greater than 4/N

fit2 <- lm(formula = Price ~ Age + Distance, data = Apartments1)

Apartments1$StdResid <- round(rstandard(fit2), 3)
Apartments1$CooksD <- round(cooks.distance(fit2), 3)

hist(Apartments1$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

hist(Apartments1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We can observe that the histogram show that there is no outliers left, therefore we can proceed to run the shapiro test again.

shapiro.test(Apartments1$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments1$StdResid
## W = 0.94154, p-value = 0.001166

hist(Apartments1$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

head(Apartments1[order(Apartments1$StdResid),], 3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    12       14  1650       0       1 Yes      No          -1.62  0.017
## 2    12       14  1650       0       0 No       No          -1.62  0.017
## 3    13        8  1800       0       0 No       No          -1.56  0.017

head(Apartments1[order(-Apartments1$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    45       21  1910       0       1 Yes      No           1.26  0.077
## 2     8       26  2300       1       1 Yes      Yes          1.96  0.064
## 3    40        2  2400       0       1 Yes      No           1.20  0.055
## 4     8        2  2820       1       0 No       Yes          1.73  0.046
## 5    10        1  2810       0       0 No       No           1.66  0.04 
## 6    18        1  2800       1       0 No       Yes          1.89  0.038

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Apartments1$StdfittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = Apartments1$StdResid, x= Apartments1$StdfittedValues,
            ylab = "Standarized residuals",
            xlab = "Standarized fitted values",
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

We checked the possibility of homo/heteroskedasticity with the help of a scatter plot between standardized residuals and (standardized) fitted values. According to the scatterplot the Variance does not appear to be growing, the conclusion is that homoskedasticity is not violated.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments1$StdResid,
     xlab = "Standarized Residuals",
     ylab = "Frequency",
     main = "Histogram of standarized residuals")

shapiro.test(Apartments1$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments1$StdResid
## W = 0.94154, p-value = 0.001166

H0: The variables are normally distributed H1: The variables are not normally distributed

p-value is less than alpha = 5% (p-value = 0.001568), so we can reject H0 and accept H1

The assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(formula = Price ~ Age + Distance, data = Apartments)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

b1 is estimated with -7.934, which implies that With every additional year on average price of an apartment/m2 will decrease by 7.934 p=0.05, under the condition that all other factors remain unchanged.
b2 is estimated with -20.667, which implies that With every additional km from city center on average apartment’s pric/m2 will decrease by -20.667 assuming the p=0.001, under the condition that all other factors remain unchanged.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(formula =Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
           data = Apartments)
fit3

## 
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf, 
##     data = Apartments)
## 
## Coefficients:
##            (Intercept)                     Age                Distance  
##               2301.667                  -6.799                 -18.045  
## Apartments$BalconyfYes  Apartments$ParkingfYes  
##                  1.935                 196.168

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf, 
##     data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2301.667     94.271  24.415  < 2e-16 ***
## Age                      -6.799      3.110  -2.186  0.03172 *  
## Distance                -18.045      2.758  -6.543 5.28e-09 ***
## Apartments$BalconyfYes    1.935     60.014   0.032  0.97436    
## Apartments$ParkingfYes  196.168     62.868   3.120  0.00251 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Parking: Given Age, Distance and Balcony Factor, the apartments with parking space have on average price higher by 196 € given the p-value 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not reject the null hypothesis (if a balcony is available in the apartment it effects the price of it)

F-statistics: H0: ρ squared = 0 H1: ρ squared > 0

p-value < 0.001 therefore we reject H0 at p< 0.001. It means there is a linear correlation between Price and at least one explanatory variable. The model is appropriate due to the r squared being higher than 0

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted(fit2)
Apartments$FittedRes <- Apartments$Fitted - Apartments$Price
print(Apartments1[2,])

## # A tibble: 1 × 10
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD Stdfi…¹
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>   <dbl>
## 1    18        1  2800       1       0 No       Yes          1.89  0.038    1.16
## # … with abbreviated variable name ¹StdfittedValues[,1]