Task 1

Explain the data set (the variables used in the analysis).

mydata <- force(airquality)
head(mydata)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

This data set measures the daily air quality in New York from May to September 1973.

Explanation of variables:

  • Mean ozone in parts per billion from 13:00 to 15:00 hours at Roosevelt Island
  • Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 08:00 to 12:00 hours at Central Park
  • Average wind speed in miles per hour at 07:00 and 10:00 hours at LaGuardia Airport
  • Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.)
  • Sampling month (1–12)
  • Sampling day (1–31)
summary(mydata)
##      Ozone           Solar.R           Wind             Temp           Month            Day      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.0  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.0  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.8  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.0  
##  NA's   :37       NA's   :7

Perform some data manipulations.

mydata$Temp <- (mydata$Temp - 32)/(9/5)
Temp_round <- round(mydata$Temp)
mydata$Temp <- Temp_round

mydata$Wind <- (mydata$Wind * 1.6)
Wind_round <- round(mydata$Wind,1)
mydata$Wind <- Wind_round
head(mydata)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190 11.8   19     5   1
## 2    36     118 12.8   22     5   2
## 3    12     149 20.2   23     5   3
## 4    18     313 18.4   17     5   4
## 5    NA      NA 22.9   13     5   5
## 6    28      NA 23.8   19     5   6

We changed the unit of Temperature to degrees Celsius and Wind to kph for easier understanding.

library(tidyr)

mydata_clean <- drop_na(mydata)
head(mydata_clean)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190 11.8   19     5   1
## 2    36     118 12.8   22     5   2
## 3    12     149 20.2   23     5   3
## 4    18     313 18.4   17     5   4
## 5    23     299 13.8   18     5   7
## 6    19      99 22.1   15     5   8

We removed NA from the sample.

mydata_clean$WindChill <- round((10 * sqrt(mydata_clean$Wind) - mydata_clean$Wind + 10.5) * (33 - mydata_clean$Temp), 2)

We makd a new variable based on sampled data.

mydata_clean$MonthFactor <- factor(mydata_clean$Month, 
                              levels = c(5, 6, 7, 8, 9), 
                              labels = c("May", "June", "July", "August", "September"))

We can change the Month variables into factorial.

mydata2 <- mydata_clean[,c(1,2,3,4, 7, 6, 5, 8)]
head(mydata2)
##   Ozone Solar.R Wind Temp WindChill Day Month MonthFactor
## 1    41     190 11.8   19    462.72   1     5         May
## 2    36     118 12.8   22    368.25   2     5         May
## 3    12     149 20.2   23    352.44   3     5         May
## 4    18     313 18.4   17    559.92   4     5         May
## 5    23     299 13.8   18    507.73   7     5         May
## 6    19      99 22.1   15    637.39   8     5         May

Present the descriptive statistics for the selected variables and explain at least 3 sample statistics (mean, median, etc.).

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
round(stat.desc(mydata2[ , c(-6, -7, -8)]), 2)
##                Ozone  Solar.R    Wind    Temp WindChill
## nbr.val       111.00   111.00  111.00  111.00    111.00
## nbr.null        0.00     0.00    0.00    0.00      4.00
## nbr.na          0.00     0.00    0.00    0.00      0.00
## min             1.00     7.00    3.70   14.00   -103.11
## max           168.00   334.00   33.10   36.00    671.36
## range         167.00   327.00   29.40   22.00    774.47
## sum          4673.00 20513.00 1764.90 2821.00  28989.97
## median         31.00   207.00   15.50   26.00    225.26
## mean           42.10   184.80   15.90   25.41    261.17
## SE.mean         3.16     8.65    0.54    0.50     17.55
## CI.mean.0.95    6.26    17.15    1.07    1.00     34.78
## var          1107.29  8308.74   32.41   28.04  34187.75
## std.dev        33.28    91.15    5.69    5.30    184.90
## coef.var        0.79     0.49    0.36    0.21      0.71

The mean of the ozone level is between 35.84 and 48.36 with a 95% significance level.

The temperature when the sample was taken fluctuated between 14°C and 36°C.

The mean temperature is 25.41°C and the median temperature is 26°C. Since they are not much different we can presume normal distribution of samples.

For the 50% of the time Wind Chill was below 225.26.

The maximum difference between two levels of Wind Chill is 774.47.

Ozone level had the biggest variability of the tested variables.

Graph the distribution of the variables using histograms, scatterplots, and/or boxplots. Explain the results.

library(ggplot2)
ggplot(mydata, aes(y=Temp, x=seq(from = 1, to = 153))) +
  geom_line() +
  ylab("Temperature") +
  xlab("Day of study")

The graph shows Temperature fluctuations during the days of study.

library(ggplot2)
ggplot(mydata2, aes(x=Ozone)) +
  geom_histogram(binwidth = 5, color="grey", fill="orange") +
  ylab("Frequency")

We can see that Ozone level is asymmetrical to the right. The higher Ozone level is less frequent. We can spot potential outliers.

library(car)
## Loading required package: carData
scatterplot(y = mydata2$Temp, 
            x = mydata2$Wind, 
            ylab = "Temperature [°C]", 
            xlab = "Wind speed [kph]", 
            smooth = FALSE)

We can see a negative correlation between Wind speed and Temperature.

The median Temperature is 26°C. The third quartile is °29°C.

We have some potential outliers for Wind speed on the boxplot. If we look at them in the scatterplot they aren’t problematic.

Task 2

library(readxl)
mydata <- read_xlsx("./Take home exam/Task 2/Business School.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before) Employability (After)
## 1          1         Business            68.4      90.2              No                    252                   276
## 2          2 Computer Science            70.2      68.7             Yes                    101                   119
## 3          3          Finance            76.4      83.3              No                    401                   462
## 4          4         Business            82.6      88.7              No                    287                   342
## 5          5          Finance            76.9      75.4              No                    275                   347
## 6          6 Computer Science            83.3      82.1              No                    254                   313
##   Status Annual Salary
## 1 Placed        111000
## 2 Placed        107000
## 3 Placed        109000
## 4 Placed        148000
## 5 Placed        255500
## 6 Placed        103500

Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?

library(ggplot2)
ggplot(mydata, aes(x=`Undergrad Degree`)) +
  geom_bar(fill = "chocolate") +
  ylab("Frequency")

The most common undergrad degree is Business.

Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution.

library(pastecs)
round(stat.desc(mydata$`Annual Salary`), 0)
##      nbr.val     nbr.null       nbr.na          min          max        range          sum       median         mean 
##          100            0            0        20000       340000       320000     10905800       103500       109058 
##      SE.mean CI.mean.0.95          var      std.dev     coef.var 
##         4150         8235   1722373475        41501            0
library(ggplot2)
ggplot(mydata, aes(x=`Annual Salary`)) +
  geom_histogram(binwidth = 10000, color="grey", fill="forestgreen") +
  ylab("Frequency")

In the graph we can see the distribution of Annual Salary. It is slightly asymmetrical to the right, so it has positive skewness. We can spot some outliers.

Test the following hypothesis: 𝐻0: μMBA Grade = 74. Explain the result and interpret the effect size.

t.test(mydata$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

We can reject the null hypothesis (p = 0,009). The average grade is between 74.52 and 77.56 with a 95% confidence interval.

library(effectsize)

effectsize::cohens_d(mydata$`MBA Grade`,
                     mu = 74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.
effectsize::interpret_cohens_d(0.27,
                               rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)

With the usage of the Sawilowsky (2009) rule we can state that the difference in the grade is small.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Take home exam/Task 3/Apartments.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata$ParkingF <- factor(mydata$Parking,
                         levels = c (0, 1),
                         labels = c ("No", "Yes"))

mydata$BalconyF <- factor(mydata$Balcony,
                         levels = c (0, 1),
                         labels = c ("No", "Yes"))

head(mydata)
##   Age Distance Price Parking Balcony ParkingF BalconyF
## 1   7       28  1640       0       1       No      Yes
## 2  18        1  2800       1       0      Yes       No
## 3   7       28  1660       0       0       No       No
## 4  28       29  1850       0       1       No      Yes
## 5  18       18  1640       1       1      Yes      Yes
## 6  28       12  1770       0       1       No      Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject the null hypothesis (p = 0,005). The average apartment price is between 1937€ and 2101€ with a 95% confidence interval.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
          data = mydata)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255

If the age of an apartment increases by 1 year, on average the price decreases by 8.975€ per m2 (p = 0,034).

5.3% of variability in price of apartments is affected by linear effect of age.

Linear relationship between price and age is weak.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(mydata[ , c(3, 1, 2)],
                  smooth = FALSE) 

There is not a potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
          data = mydata)

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

With VIF we check the multicolinearity, how the explanatory variables are connected. VIF ranges from 1 to 5 as the cut-off value. All values in this case are close to 1 so we can continue.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We don’t have any outliers. All standardised residuals are between -3 and 3.

We have some units with high influence. Their value is substantially greater than others. We need to remove them.

head(mydata[order(-mydata$CooksD), ])
##    Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## 38   5       45  2180       1       1      Yes      Yes    2.577  0.320
## 55  43       37  1740       0       0       No       No    1.445  0.104
## 33   2       11  2790       1       0      Yes       No    2.051  0.069
## 53   7        2  1760       0       1       No      Yes   -2.152  0.066
## 22  37        3  2540       1       1      Yes      Yes    1.576  0.061
## 39  40        2  2400       0       1       No      Yes    1.091  0.038
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!CooksD > 0.07)

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance,
          data = mydata)
mydata$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

It seems that there can be signs of heteroskedacity. The variance isn’t constant.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

We cannot reject the null hypothesis (p = 0,052). The variance is constant so there is no heteroskedaticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

mydata$StdResid <- round(rstandard(fit2), 3)

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95952, p-value = 0.01044

We reject the null hypothesis (p = 0,01). The distribution of standard residuals is not normal. This doesn’t affect our test because we have a sample bigger than 30 (n = 83), because of the standard limit theorem.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
          data = mydata)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

If the age of the apartments increases by 1 year, on average the price of the apartment decreases by 7.85€ per m2 (p = 0.018) assuming constant distance from city center.

If the distance from city center increases by 1 km, on average the price of the apartment decreases by 23.945€ per m2 (p < 0.001) assuming age is constant.

49.68% of the variability of price of apartments is affected by linear effect of age and distance from city center.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
          data = mydata)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis (p = 0.028). The model fit3 fits the data better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Given the age and distance from the city center of apartments, apartments with parking have on average 168.921€ higher price per m2 (p = 0.008) than apartments without parking.

We cannot say that balcony would effect the expected price per m2.

F-statistics is the test of Significance of regression. The hypothesis is:

  • H0: ρ² = 0 Regression coefficients are equal to zero.
  • H1: ρ² > 0 At least one regression coefficient is statistically significant.

Save fitted values and claculate the residual for apartment ID2.

mydata$Fitted <- fitted.values(fit3) 
mydata$Residuals <- residuals(fit3)

head(mydata$Residuals[2])
## [1] 422.9572

The residual for apartment ID2 is 422.96€.