Take-Home Exam - Blanka Pucer

Exercise 1

I will first choose the data set.

mydata <- force(swiss)
head(mydata)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Description of the chosen data set:

Fertility: common standardized fertility measure
Agriculture: percentage (%) of males involved in agriculture as occupation
Examination: percentage (%) of draftees receiving highest mark on army examination
Education: percentage (%) of draftees educated beyond primary school
Catholic: percentage (%) of those who are catholic (as opposed to protestant)
Infant.Morality: live births who live less than 1 year

summary(mydata)

##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

I will now round my data to two decimal points.

library(pastecs)
round(stat.desc(mydata), 2)

##              Fertility Agriculture Examination Education Catholic
## nbr.val          47.00       47.00       47.00     47.00    47.00
## nbr.null          0.00        0.00        0.00      0.00     0.00
## nbr.na            0.00        0.00        0.00      0.00     0.00
## min              35.00        1.20        3.00      1.00     2.15
## max              92.50       89.70       37.00     53.00   100.00
## range            57.50       88.50       34.00     52.00    97.85
## sum            3296.70     2381.00      775.00    516.00  1933.76
## median           70.40       54.10       16.00      8.00    15.14
## mean             70.14       50.66       16.49     10.98    41.14
## SE.mean           1.82        3.31        1.16      1.40     6.08
## CI.mean.0.95      3.67        6.67        2.34      2.82    12.25
## var             156.04      515.80       63.65     92.46  1739.29
## std.dev          12.49       22.71        7.98      9.62    41.70
## coef.var          0.18        0.45        0.48      0.88     1.01
##              Infant.Mortality
## nbr.val                 47.00
## nbr.null                 0.00
## nbr.na                   0.00
## min                     10.80
## max                     26.60
## range                   15.80
## sum                    937.30
## median                  20.00
## mean                    19.94
## SE.mean                  0.42
## CI.mean.0.95             0.86
## var                      8.48
## std.dev                  2.91
## coef.var                 0.15

I will now rename the variable “Agriculture” to “Farming”.

colnames(mydata) [2] <- "Farming"
head(mydata)

##              Fertility Farming Examination Education Catholic Infant.Mortality
## Courtelary        80.2    17.0          15        12     9.96             22.2
## Delemont          83.1    45.1           6         9    84.84             22.2
## Franches-Mnt      92.5    39.7           5         5    93.40             20.2
## Moutier           85.8    36.5          12         7    33.77             20.3
## Neuveville        76.9    43.5          17        15     5.16             20.6
## Porrentruy        76.1    35.3           9         7    90.57             26.6

I will now create new data.frame, which includes only those cantons where infant mortality is between 18 and 20.

mydata2 <- mydata[mydata$Infant.Mortality >= 18 & mydata$Infant.Mortality <= 20 , ]

I will now calculate the arithmetic mean for Fertility. As we can see below, it is 70.14.

mean(mydata$Fertility)

## [1] 70.14255

I will now calculate the standard deviation for Infant.Mortality. As we can see below, it is 2.91.

sd(mydata$Infant.Mortality)

## [1] 2.912697

Let’s calculate the median for Infant.Mortality. As we can see below, it is 20.

median(mydata$Infant.Mortality)

## [1] 20

library(ggplot2)

ggplot(mydata, aes(y = Examination, x = Education)) +
  geom_point()

I have chosen to observe the correlation between Education (percentage of draftees educated beyond primary school), which is on x-axis and is my independent variable, and Examination (percentage of draftees receiving highest mark on army examination), which is on y-axis and is my dependent variable. Observing the graph, there seems to be a positive correlation between the two variables.

Exercise 2

I have renamed the data set “Business-School” to “mydata3” so that it is easier to work with.

library(readxl)
Business_School <- read_excel("C:/Users/blapu/OneDrive/Desktop/IMB 2024 - R Program/Bootcamp/R data/Business School.xlsx")

mydata3 <- Business_School
head(mydata3)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

library(ggplot2)
ggplot(mydata3, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "orange", color = "yellow")

Now I will show the descriptive statistics of the Annual Salary.

summary(mydata3$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

We can see that the average annual salary is 109,058.

library(ggplot2)
ggplot(mydata3, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 5000, colour = "orange") +
  ylab("Frequency")

options(scipen = 999)

The distribution is skewed to the right. Most graduates have an average annual salary of 100,000 euros.

t.test(mydata3$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata3$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Our H0 (null hypothesis) is that the average MBA grade is 74, and our H1 (alternative hypothesis) is that the average MBA grade is not 74.

p-value = 0.01. Because our p-value is smaller than 0.05 (our significance value), we can reject the null hypothesis. Reading the “mean of x”, we can see that the average MBA grade is higher than 74 - it is around 76.

I will now test the effect size.

library(effectsize)

effectsize::cohens_d(mydata3$`MBA Grade`,
                     mu = 74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

effectsize::interpret_cohens_d(0.27, rules = "sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

The effect size is small.

Exercise 3

Import the dataset Apartments.xlsx

I have renamed “Apartments” to data4 so that it is easier to work with. I have displayed the first six columns.

library(readxl)
Apartments <- read_excel("C:/Users/blapu/OneDrive/Desktop/IMB 2024 - R Program/Bootcamp/R data/Task 3/Apartments.xlsx")

mydata4 <- Apartments
head(mydata4)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata4$Parking <- factor (mydata4$Parking,
                           levels = c(0,1),
                           labels = c("No", "Yes"))

mydata4$Balcony <- factor (mydata4$Balcony,
                           levels = c(0,1),
                           labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata4$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata4$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

If H0: Mu_Price = 1900 eur, we can say that H1: Mu-Price =/= 1900 eur. We can see that the p-value = 0.005, which is less than 0.05 (our significance level). Thus, we can reject H0. By “mean of x”, we can see that the average price of an apartment is around 2019 eur.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit <- lm(Price ~ Age,
          data = mydata4)
summary(fit)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The regression coefficient for age is -8.975. If the age of the apartment increases by one year, the price of the apartment will be on average 8.98 euros lower, assuming other explanatory variables are constant.

The coefficient of determination is 5.3%. This means that 5.3% of variability in price is explained by linear effect of age, distance, parking, and balcony.

sqrt(summary(fit)$r.squared)

## [1] 0.230255

The coefficient of correlation is 0.230. This indicates a weak positive correlation between price and age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplotMatrix(mydata4[ , c(3, 1, 2)],
                  smooth = FALSE)

There is no problem with potential multicolinearity, meaning that there is no correlation of explanatory variables, which are in our case age and distance. The graph (third row, second column) is not that steep, so I think there is no problem of multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata4)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

As said in the lectures, the higher the VIF statistic, the more strongly the explanatory variable is linked to other explanatory variables. VIF is not larger than 5 for either of the two values.Thus, there is no strong multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata4$StdResid <- round(rstandard(fit2), 2)
mydata4$CooksD <- round (cooks.distance(fit2), 2)

hist(mydata4$StdResid,
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

hist(mydata4$CooksD,
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances")

There are no outliers as all units are within the interval [-3, +3]. There are, however, some units with high influence, which we can see by the large gap between 0.10 and 0.35. Let’s remove those units.

head(mydata4[order(-mydata4$CooksD), ], 6)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58   0.32
## 2    43       37  1740 No      No          1.44   0.1 
## 3     2       11  2790 Yes     No          2.05   0.07
## 4     7        2  1760 No      Yes        -2.15   0.07
## 5    37        3  2540 Yes     Yes         1.58   0.06
## 6    40        2  2400 No      Yes         1.09   0.04

Going to mydata4 and clicking the arrow at CooksD, I have realized that the unit with high influence is apartment number 38. I will now remove it.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata4 <- mydata4 %>%
  filter (!Price %in% 2180)

hist(mydata4$CooksD,
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances")

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata4)

mydata4$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot (y = mydata4$StdResid, x = mydata4$StdFitted,
             ylab = "Standardised residuals",
             xlab = "Standardised fitted values",
             boxplots = FALSE, 
             regline = FALSE,
             smooth = FALSE)

## Warning in plot.window(...): "regline" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "regline" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter

## Warning in box(...): "regline" is not a graphical parameter

## Warning in title(...): "regline" is not a graphical parameter

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

Let’s check heteroskedasticity with Breusch-Pagan heteroskedasticity test. We can say: H0 = the variance is constant, and H1 = the variance is not constant.

We can see that the p-value = 0.33. Thus, we cannot reject H0. p-value is larger than 0.05.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata4$StdResid,
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

As said above, there are no units outside of the [-3, +3] interval. Let’s formally test this assumption. We can say that H0 = errors are distributed normally, and H1 = errors are not distributed normally.

shapiro.test(mydata4$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata4$StdResid
## W = 0.9488, p-value = 0.00219

We can see that the p-value = 0.002, which is less than 0.05 (our chosen significance level). This means that we can reject H0.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221 < 0.0000000000000002 ***
## Age           -6.464      3.159  -2.046                0.044 *  
## Distance     -22.955      2.786  -8.240     0.00000000000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 0.000000000002339

If age increases by one year, the price of the apartment per m2 will on average be about 7.93 euros lower (p-value = 0.02), assuming other explanatory variables are constant.

If distance increases by 1 km from the city center, the price of the apartment per m2 will on average be 20.67 euros lower (p-value < 0.01), assuming other explanatory variables are constant.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = mydata4)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033 < 0.0000000000000002 ***
## Age           -5.821      3.074  -1.894              0.06190 .  
## Distance     -20.279      2.886  -7.026       0.000000000666 ***
## ParkingYes   167.531     62.864   2.665              0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257              0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 0.000000000003018

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s write our hypotheses: H0: both models are equally good, and H1: one model is better than the other.

We can see that p = 0.03, meaning that it is smaller than 0.05 (our significance level). Thus, we can reject H0.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = mydata4)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033 < 0.0000000000000002 ***
## Age           -5.821      3.074  -1.894              0.06190 .  
## Distance     -20.279      2.886  -7.026       0.000000000666 ***
## ParkingYes   167.531     62.864   2.665              0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257              0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 0.000000000003018

Explanation for Parking: Given the age, distance, and balcony, apartments with parking are on average 167.531 euros more expensive per m2 compared to those without parking.

Explanation for Balcony: Given the age, distance, and parking, apartments with a balcony are on average 15.207 euros cheaper per m2 compared to those without a balcony.

Our hypotheses for F-statistic are: H0: our model does not explain any variability of the dependent variable, and H1: our model explains a proportion of the variability of the dependent variable.

We check the p-value and see that it is smaller than 0.01. Thus, we can reject H0.

Save fitted values and claculate the residual for apartment ID2.

mydata4$Fitted <- fitted.values(fit3)
mydata4$Residuals <- residuals(fit3)
head(mydata4[colnames(mydata4) %in% c("Price", "Fitted", "Residuals")])

## # A tibble: 6 × 3
##   Price Fitted Residuals
##   <dbl>  <dbl>     <dbl>
## 1  1640  1706.     -66.0
## 2  2800  2372.     428. 
## 3  1660  1721.     -61.2
## 4  1850  1563.     287. 
## 5  1640  2012.    -372. 
## 6  1770  1908.    -138.

According to the estimated regression function, the expected price for the second apartment would be around 2372 euros per m2. The actual price of the second apartment is, however, 2800 euros. The difference between the two prices (428 euros) is a residual.