TASK 1

This dataset contains information about football teams from different leagues and their performance statistics. The following variables are included:

Team: Name of the football team.

Tournament: The league the team participates in

Goals: The total number of goals scored

Shots pg: Average number of shots per game

yellow_cards: Total yellow cards received

red_cards: Total red cards received

Possession%: Average percentage of ball possession

Pass%: Passing accuracy percentage

AerialsWon: Average number of aerial duels won per game

Rating: The overall performance rating

Read csv

mydata <- read.csv("Football teams.csv")
library(knitr)
kable(head(mydata))
Team Tournament Goals Shots.pg yellow_cards red_cards Possession. Pass. AerialsWon Rating
Manchester City Premier League 83 15.8 46 2 60.8 89.4 12.8 7.01
Bayern Munich Bundesliga 99 17.1 44 3 58.1 85.5 12.9 6.95
Paris Saint-Germain Ligue 1 86 15.0 73 7 60.1 89.5 9.5 6.88
Barcelona LaLiga 85 15.3 68 2 62.4 89.7 10.6 6.87
Real Madrid LaLiga 67 14.4 57 2 57.7 87.7 11.8 6.86
Manchester United Premier League 73 13.8 64 1 54.5 84.8 14.5 6.85

Rename columns

colnames(mydata) <- c("Team", "League", "Goals", "Shots_pg", "Yellow_Cards", 
                      "Red_Cards", "Possession", "Pass_Accuracy", "Aerials_Won", "Rating")

New column “Discipline”

mydata$Discipline <- mydata$Yellow_Cards + mydata$Red_Cards

Exclude teams with fewer than 60 goals

mydata2 <- mydata[mydata$Goals >= 60, ]
library(knitr)
kable(head(mydata2))
Team League Goals Shots_pg Yellow_Cards Red_Cards Possession Pass_Accuracy Aerials_Won Rating Discipline
Manchester City Premier League 83 15.8 46 2 60.8 89.4 12.8 7.01 48
Bayern Munich Bundesliga 99 17.1 44 3 58.1 85.5 12.9 6.95 47
Paris Saint-Germain Ligue 1 86 15.0 73 7 60.1 89.5 9.5 6.88 80
Barcelona LaLiga 85 15.3 68 2 62.4 89.7 10.6 6.87 70
Real Madrid LaLiga 67 14.4 57 2 57.7 87.7 11.8 6.86 59
Manchester United Premier League 73 13.8 64 1 54.5 84.8 14.5 6.85 65

Exclude Laliga teams

mydata3 <- mydata2[c(-3,-4),]

Exclude Aerials won and Rating

mydata4 <- mydata3[,c(-9,-10)]
library(knitr)
kable(mydata4)
Team League Goals Shots_pg Yellow_Cards Red_Cards Possession Pass_Accuracy Discipline
1 Manchester City Premier League 83 15.8 46 2 60.8 89.4 48
2 Bayern Munich Bundesliga 99 17.1 44 3 58.1 85.5 47
5 Real Madrid LaLiga 67 14.4 57 2 57.7 87.7 59
6 Manchester United Premier League 73 13.8 64 1 54.5 84.8 65
7 Juventus Serie A 77 15.7 76 6 55.4 88.3 82
9 Borussia Dortmund Bundesliga 75 14.6 43 1 57.5 85.5 44
10 Atletico Madrid LaLiga 67 12.1 100 0 51.8 83.1 100
11 Atalanta Serie A 90 16.3 66 3 53.5 83.5 69
13 Liverpool Premier League 68 16.0 40 0 59.0 85.7 40
14 AC Milan Serie A 74 14.7 80 4 51.4 84.0 84
15 Lille Ligue 1 64 12.8 67 2 52.6 83.5 69
16 Tottenham Premier League 68 11.7 53 2 51.3 81.8 55
17 Napoli Serie A 86 17.0 71 3 54.1 87.0 74
18 Leicester Premier League 68 12.8 61 0 53.2 82.1 61
19 Wolfsburg Bundesliga 61 14.1 56 3 51.0 78.0 59
20 Inter Serie A 89 14.5 59 2 52.0 87.0 61
21 Lyon Ligue 1 81 16.1 60 10 53.6 84.7 70
22 RB Leipzig Bundesliga 60 16.0 57 0 57.3 83.2 57
23 Leeds Premier League 62 13.7 61 1 55.1 80.8 62
24 West Ham Premier League 62 12.3 48 3 44.5 77.8 51
27 Eintracht Frankfurt Bundesliga 69 13.2 80 1 52.4 79.6 81
28 Monaco Ligue 1 76 12.8 74 7 54.2 82.7 81
29 Roma Serie A 68 14.3 84 3 51.5 84.5 87
31 Borussia M.Gladbach Bundesliga 64 13.4 61 2 51.5 82.0 63
36 Sassuolo Serie A 64 13.9 74 4 58.2 87.8 78
38 Villarreal LaLiga 60 10.7 65 5 54.3 84.4 70
48 Montpellier Ligue 1 60 12.2 65 7 46.4 78.8 72
53 Lazio Serie A 61 13.8 100 5 52.2 83.8 105

Descriptive statistics

summary(mydata4[,c(-1,-2)])
##      Goals          Shots_pg      Yellow_Cards      Red_Cards     
##  Min.   :60.00   Min.   :10.70   Min.   : 40.00   Min.   : 0.000  
##  1st Qu.:63.50   1st Qu.:12.80   1st Qu.: 56.75   1st Qu.: 1.000  
##  Median :68.00   Median :14.00   Median : 62.50   Median : 2.500  
##  Mean   :71.29   Mean   :14.14   Mean   : 64.71   Mean   : 2.929  
##  3rd Qu.:76.25   3rd Qu.:15.72   3rd Qu.: 74.00   3rd Qu.: 4.000  
##  Max.   :99.00   Max.   :17.10   Max.   :100.00   Max.   :10.000  
##    Possession    Pass_Accuracy     Discipline    
##  Min.   :44.50   Min.   :77.80   Min.   : 40.00  
##  1st Qu.:51.73   1st Qu.:82.08   1st Qu.: 58.50  
##  Median :53.55   Median :83.90   Median : 67.00  
##  Mean   :53.75   Mean   :83.82   Mean   : 67.64  
##  3rd Qu.:55.88   3rd Qu.:85.55   3rd Qu.: 78.75  
##  Max.   :60.80   Max.   :89.40   Max.   :105.00

Goals; Mean = 71.29 On average, the football teams in the sample scored approximately 71.29 goals during the season.

Yellow cards; Median = 62.50 Half of the football teams in the sample received fewer than 62.5 yellow cards, while the other half received more than 62.5 yellow cards.

Red cards; Max = 10.00 The maximum number of red cards that a single club in this sample received in a season is 10.

Possession; 3rd Qu. = 55.88% 75% of the football teams in this sample had an average ball possession of 56.98% or lower, and the top 25% of teams had possession percentages higher than 55.88%.

Pass Accuracy; Min = 77.80% The minimum average pass accuracy by a single club in a season in this sample is equal to 77.80%.

Distribution of goals scored

hist(mydata4$Goals, 
     breaks = 5, 
     col = "green", 
     border = "black", 
     main = "Distribution of Goals Scored", 
     xlab = "Goals Scored", 
     ylab = "Frequency")

This histogram shows the distribution of goals scored by teams with at least 60 goals. Histogram is unimodal, it’s highest peak is between 60 and 70 golas scored, which means frequency of teams scoring between 60-70 goals is the highest. The histogram is skewed to the right.

Scatterplot of Shots per game vs Goals

plot(mydata4$Shots_pg, mydata4$Goals,
     main = "Scatterplot of Shots pg vs Goals",
     xlab = "Shots pg",
     ylab = "Goals Scored",
     col = "blue")     

Scatterplot illustrates the relationship between the team’s number of average shots per game and goals scored. We can see a positive corelation, suggesting higher the average number of shots per game, higher the number of goals scored.

Scatterplot of Possession percentage vs Yellow cards

plot(mydata4$Possession, mydata4$Yellow_Cards,
     main = "Scatterplot of Possession% vs Yellow Cards",
     xlab = "Possession%",
     ylab = "Yellow Cards",
     col = "red") 

Scatterplot illustrates the relationship between the team’s possession percentage and yellow cards received. We can see a negative corelation, suggesting that higher possession percentage leads to less yellow cards received.

TASK 2

Import and read the data

mydata <- readxl::read_excel("Business School.xlsx")
mydata2 <- as.data.frame(mydata)
head(mydata2)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500

Distribution of Undergraduate degrees

library(ggplot2)
ggplot(mydata2, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "blue") +
  ylab("Frequency")

This is a histogram displaying the distribution of undergrad degrees among all 100 MBA students. The most common degree in the sample is the Undergraduate Degree in Business.

Descriptive statistics of the annual salary

summary(mydata2$`Annual Salary`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

Histogram of annual salary distribution in one hundred thousands

ggplot(mydata2, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 10000, color = "black") +
  ylab("Count")

The distribution is skewed to the right, which shows that there are only a few individuals with very high salaries. The histogram is unimodal, it’s only high is around 100000. The outliers on the right side of the distribution represent individuals with very high salaries that are significantly different from the majority.

T-test

t_test <- t.test(mydata2$`MBA Grade`,
                 mu = 74,
                 alternative = "two.sided")
t_test
## 
##  One Sample t-test
## 
## data:  mydata2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

I tested H0:𝜇MBA Grade=74, H1:𝜇MBA Grade≠74. We reject the null hypothesis (p=0,00915). That means that the true mean MBA grade differ from 74. Sample mean of the MBA grades is 76.04055, greater than 74.

Cohen’s d

library(effectsize)
cohens_d(mydata2$`MBA Grade`, mu = 74)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

A Cohen’s d of 0.27 indicates (with 95% confidence) a small effect size. The MBA grades in this generation deviate only slightly from the grades of last year’s generation.

TASK 3

Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_excel("Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata$Balcony <- factor(mydata$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
mydata$Parking <- factor(mydata$Parking,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
head(mydata)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 No      Yes    
## 2    18        1  2800 Yes     No     
## 3     7       28  1660 No      No     
## 4    28       29  1850 No      Yes    
## 5    18       18  1640 Yes     Yes    
## 6    28       12  1770 No      Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We reject the H0 (p=0.004731).The true mean is significantly different from 1900. With 95% certainty we can say that true mean of apartment prices per m2 lies in the interval from 1937.44 to 2100.44.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = mydata)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Regression coefficient: For every additional year, the price of an apartment per m2 decreases by 8,975€ on average (p=0,034).

Coefficient of determination: 5.3 % of the variability of the price per m2 is explained by the linear effect of age.

sqrt(summary(fit1)$r.squared)
## [1] 0.230255

Correlation coefficient: The linear correlation between price and age of an apartment is weak, because the value of correlation coefficient is equal to 0.23.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## Loading required package: carData
scatterplotMatrix(mydata[, c("Price", "Age", "Distance")], smooth = FALSE)

The scatter between Age and Distance does not show any kind of relationship between the two variables, there seems to be no problem of multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

Since vif < 5 and is close to 1 there is no sign of multicolinearity between Age and Distance.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata$Stdresid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2),3)
hist(mydata$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances")

Printing the 10 highest Cooks distances

head(mydata[order(-mydata$CooksD), "CooksD"], 10)
## # A tibble: 10 × 1
##    CooksD
##     <dbl>
##  1  0.32 
##  2  0.104
##  3  0.069
##  4  0.066
##  5  0.061
##  6  0.038
##  7  0.037
##  8  0.034
##  9  0.032
## 10  0.03

Removing high influence units

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Creating fit2 again without excluded units

fit2 <- lm(Price ~ Age + Distance, data = mydata)
mydata$StdFittedValues <- scale(fit2$fitted.values)
library(car)

Scatterplot between standarized residuals and standardized fitted values

scatterplot(y = mydata$Stdresid, x = mydata$StdFittedValues,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted Values",
            boxplots = FALSE,
            regline = FALSE,
            smooth = FALSE)
## Warning in plot.window(...): "regline" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "regline" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in box(...): "regline" is not a graphical parameter
## Warning in title(...): "regline" is not a graphical parameter

The points in the scatterplot are more or less randomly distributed, which would indicate no heteroskedasticity. I cannot say that with utmost certainty.

Breuch-Pagan test

library("olsrr")
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

We cannot reject H0 (p=0.1873174), we assume homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Histogram of standardized residuals

hist(mydata$Stdresid,
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

The histogram is slightly right skewed.

Shapiro test

shapiro.test(mydata$Stdresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Stdresid
## W = 0.93418, p-value = 0.0004761

We can reject H0 (p=0.0004761). Standard residuals are not distributed normally.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

The coefficient of determination = 0.5361 Approximately 54% of the variability of the price per m2 is explained by the linear effect of age and distance.

Estimate of regression coefficient = -8.674 For every additional year of an appartment, the price per m2 decreases by 8.674 euros on average (p = 0.00869), assuming that distance is constant.

Estimate of regression coefficient = -24.063 If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24.063 (p < 0.001), assuming that age is constant.

We can reject H0 (p<0,001). That means that this model is a good model and it makes sense.

sqrt(summary(fit2)$r.squared)
## [1] 0.732187

Correlation coefficient = 0.73 The linear correlation between price, age and distance of an apartment is strong.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

fit3 model does not fit the data better than fit 2, because p=0.1135.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

On average, apartments with parking have a higher price per m2 of €128.70 than apartments without parking, when Age and Distance are constant (p = 0.0376).

We found no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not, because the P-value is greater than 0.05 (p=0.9165).

The F-statistic is used to test the overall significance of the regression model. H0 is all the regression coefficients (except the intercept) are equal to zero.

Save fitted values and claculate the residual for apartment ID2.

mydata$StdFittedValues <- fitted.values(fit3)
mydata$Stdresid <- residuals(fit3)
head(mydata[ , colnames(mydata) %in% c("ID", "Price", "StdFittedValues", "Stdresid")])
## # A tibble: 6 × 3
##   Price Stdresid StdFittedValues
##   <dbl>    <dbl>           <dbl>
## 1  1640    -88.6           1729.
## 2  2800    443.            2357.
## 3  1660    -62.6           1723.
## 4  1850    311.            1539.
## 5  1640   -349.            1989.
## 6  1770   -143.            1913.

Since based on the estimated regression function, we expected the price per m2 for this apartment to be 2356.597 euros, but the actual price was 2800 euros, the residual is 443.4 euros.