This dataset contains information about football teams from different leagues and their performance statistics. The following variables are included:
Team: Name of the football team.
Tournament: The league the team participates in
Goals: The total number of goals scored
Shots pg: Average number of shots per game
yellow_cards: Total yellow cards received
red_cards: Total red cards received
Possession%: Average percentage of ball possession
Pass%: Passing accuracy percentage
AerialsWon: Average number of aerial duels won per game
Rating: The overall performance rating
mydata <- read.csv("Football teams.csv")
library(knitr)
kable(head(mydata))
| Team | Tournament | Goals | Shots.pg | yellow_cards | red_cards | Possession. | Pass. | AerialsWon | Rating |
|---|---|---|---|---|---|---|---|---|---|
| Manchester City | Premier League | 83 | 15.8 | 46 | 2 | 60.8 | 89.4 | 12.8 | 7.01 |
| Bayern Munich | Bundesliga | 99 | 17.1 | 44 | 3 | 58.1 | 85.5 | 12.9 | 6.95 |
| Paris Saint-Germain | Ligue 1 | 86 | 15.0 | 73 | 7 | 60.1 | 89.5 | 9.5 | 6.88 |
| Barcelona | LaLiga | 85 | 15.3 | 68 | 2 | 62.4 | 89.7 | 10.6 | 6.87 |
| Real Madrid | LaLiga | 67 | 14.4 | 57 | 2 | 57.7 | 87.7 | 11.8 | 6.86 |
| Manchester United | Premier League | 73 | 13.8 | 64 | 1 | 54.5 | 84.8 | 14.5 | 6.85 |
colnames(mydata) <- c("Team", "League", "Goals", "Shots_pg", "Yellow_Cards",
"Red_Cards", "Possession", "Pass_Accuracy", "Aerials_Won", "Rating")
mydata$Discipline <- mydata$Yellow_Cards + mydata$Red_Cards
mydata2 <- mydata[mydata$Goals >= 60, ]
library(knitr)
kable(head(mydata2))
| Team | League | Goals | Shots_pg | Yellow_Cards | Red_Cards | Possession | Pass_Accuracy | Aerials_Won | Rating | Discipline |
|---|---|---|---|---|---|---|---|---|---|---|
| Manchester City | Premier League | 83 | 15.8 | 46 | 2 | 60.8 | 89.4 | 12.8 | 7.01 | 48 |
| Bayern Munich | Bundesliga | 99 | 17.1 | 44 | 3 | 58.1 | 85.5 | 12.9 | 6.95 | 47 |
| Paris Saint-Germain | Ligue 1 | 86 | 15.0 | 73 | 7 | 60.1 | 89.5 | 9.5 | 6.88 | 80 |
| Barcelona | LaLiga | 85 | 15.3 | 68 | 2 | 62.4 | 89.7 | 10.6 | 6.87 | 70 |
| Real Madrid | LaLiga | 67 | 14.4 | 57 | 2 | 57.7 | 87.7 | 11.8 | 6.86 | 59 |
| Manchester United | Premier League | 73 | 13.8 | 64 | 1 | 54.5 | 84.8 | 14.5 | 6.85 | 65 |
mydata3 <- mydata2[c(-3,-4),]
mydata4 <- mydata3[,c(-9,-10)]
library(knitr)
kable(mydata4)
| Team | League | Goals | Shots_pg | Yellow_Cards | Red_Cards | Possession | Pass_Accuracy | Discipline | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Manchester City | Premier League | 83 | 15.8 | 46 | 2 | 60.8 | 89.4 | 48 |
| 2 | Bayern Munich | Bundesliga | 99 | 17.1 | 44 | 3 | 58.1 | 85.5 | 47 |
| 5 | Real Madrid | LaLiga | 67 | 14.4 | 57 | 2 | 57.7 | 87.7 | 59 |
| 6 | Manchester United | Premier League | 73 | 13.8 | 64 | 1 | 54.5 | 84.8 | 65 |
| 7 | Juventus | Serie A | 77 | 15.7 | 76 | 6 | 55.4 | 88.3 | 82 |
| 9 | Borussia Dortmund | Bundesliga | 75 | 14.6 | 43 | 1 | 57.5 | 85.5 | 44 |
| 10 | Atletico Madrid | LaLiga | 67 | 12.1 | 100 | 0 | 51.8 | 83.1 | 100 |
| 11 | Atalanta | Serie A | 90 | 16.3 | 66 | 3 | 53.5 | 83.5 | 69 |
| 13 | Liverpool | Premier League | 68 | 16.0 | 40 | 0 | 59.0 | 85.7 | 40 |
| 14 | AC Milan | Serie A | 74 | 14.7 | 80 | 4 | 51.4 | 84.0 | 84 |
| 15 | Lille | Ligue 1 | 64 | 12.8 | 67 | 2 | 52.6 | 83.5 | 69 |
| 16 | Tottenham | Premier League | 68 | 11.7 | 53 | 2 | 51.3 | 81.8 | 55 |
| 17 | Napoli | Serie A | 86 | 17.0 | 71 | 3 | 54.1 | 87.0 | 74 |
| 18 | Leicester | Premier League | 68 | 12.8 | 61 | 0 | 53.2 | 82.1 | 61 |
| 19 | Wolfsburg | Bundesliga | 61 | 14.1 | 56 | 3 | 51.0 | 78.0 | 59 |
| 20 | Inter | Serie A | 89 | 14.5 | 59 | 2 | 52.0 | 87.0 | 61 |
| 21 | Lyon | Ligue 1 | 81 | 16.1 | 60 | 10 | 53.6 | 84.7 | 70 |
| 22 | RB Leipzig | Bundesliga | 60 | 16.0 | 57 | 0 | 57.3 | 83.2 | 57 |
| 23 | Leeds | Premier League | 62 | 13.7 | 61 | 1 | 55.1 | 80.8 | 62 |
| 24 | West Ham | Premier League | 62 | 12.3 | 48 | 3 | 44.5 | 77.8 | 51 |
| 27 | Eintracht Frankfurt | Bundesliga | 69 | 13.2 | 80 | 1 | 52.4 | 79.6 | 81 |
| 28 | Monaco | Ligue 1 | 76 | 12.8 | 74 | 7 | 54.2 | 82.7 | 81 |
| 29 | Roma | Serie A | 68 | 14.3 | 84 | 3 | 51.5 | 84.5 | 87 |
| 31 | Borussia M.Gladbach | Bundesliga | 64 | 13.4 | 61 | 2 | 51.5 | 82.0 | 63 |
| 36 | Sassuolo | Serie A | 64 | 13.9 | 74 | 4 | 58.2 | 87.8 | 78 |
| 38 | Villarreal | LaLiga | 60 | 10.7 | 65 | 5 | 54.3 | 84.4 | 70 |
| 48 | Montpellier | Ligue 1 | 60 | 12.2 | 65 | 7 | 46.4 | 78.8 | 72 |
| 53 | Lazio | Serie A | 61 | 13.8 | 100 | 5 | 52.2 | 83.8 | 105 |
summary(mydata4[,c(-1,-2)])
## Goals Shots_pg Yellow_Cards Red_Cards
## Min. :60.00 Min. :10.70 Min. : 40.00 Min. : 0.000
## 1st Qu.:63.50 1st Qu.:12.80 1st Qu.: 56.75 1st Qu.: 1.000
## Median :68.00 Median :14.00 Median : 62.50 Median : 2.500
## Mean :71.29 Mean :14.14 Mean : 64.71 Mean : 2.929
## 3rd Qu.:76.25 3rd Qu.:15.72 3rd Qu.: 74.00 3rd Qu.: 4.000
## Max. :99.00 Max. :17.10 Max. :100.00 Max. :10.000
## Possession Pass_Accuracy Discipline
## Min. :44.50 Min. :77.80 Min. : 40.00
## 1st Qu.:51.73 1st Qu.:82.08 1st Qu.: 58.50
## Median :53.55 Median :83.90 Median : 67.00
## Mean :53.75 Mean :83.82 Mean : 67.64
## 3rd Qu.:55.88 3rd Qu.:85.55 3rd Qu.: 78.75
## Max. :60.80 Max. :89.40 Max. :105.00
Goals; Mean = 71.29 On average, the football teams in the sample scored approximately 71.29 goals during the season.
Yellow cards; Median = 62.50 Half of the football teams in the sample received fewer than 62.5 yellow cards, while the other half received more than 62.5 yellow cards.
Red cards; Max = 10.00 The maximum number of red cards that a single club in this sample received in a season is 10.
Possession; 3rd Qu. = 55.88% 75% of the football teams in this sample had an average ball possession of 56.98% or lower, and the top 25% of teams had possession percentages higher than 55.88%.
Pass Accuracy; Min = 77.80% The minimum average pass accuracy by a single club in a season in this sample is equal to 77.80%.
hist(mydata4$Goals,
breaks = 5,
col = "green",
border = "black",
main = "Distribution of Goals Scored",
xlab = "Goals Scored",
ylab = "Frequency")
This histogram shows the distribution of goals scored by teams with at least 60 goals. Histogram is unimodal, it’s highest peak is between 60 and 70 golas scored, which means frequency of teams scoring between 60-70 goals is the highest. The histogram is skewed to the right.
plot(mydata4$Shots_pg, mydata4$Goals,
main = "Scatterplot of Shots pg vs Goals",
xlab = "Shots pg",
ylab = "Goals Scored",
col = "blue")
Scatterplot illustrates the relationship between the team’s number of average shots per game and goals scored. We can see a positive corelation, suggesting higher the average number of shots per game, higher the number of goals scored.
plot(mydata4$Possession, mydata4$Yellow_Cards,
main = "Scatterplot of Possession% vs Yellow Cards",
xlab = "Possession%",
ylab = "Yellow Cards",
col = "red")
Scatterplot illustrates the relationship between the team’s possession percentage and yellow cards received. We can see a negative corelation, suggesting that higher possession percentage leads to less yellow cards received.
mydata <- readxl::read_excel("Business School.xlsx")
mydata2 <- as.data.frame(mydata)
head(mydata2)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
library(ggplot2)
ggplot(mydata2, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "blue") +
ylab("Frequency")
This is a histogram displaying the distribution of undergrad degrees among all 100 MBA students. The most common degree in the sample is the Undergraduate Degree in Business.
summary(mydata2$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
ggplot(mydata2, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 10000, color = "black") +
ylab("Count")
The distribution is skewed to the right, which shows that there are only a few individuals with very high salaries. The histogram is unimodal, it’s only high is around 100000. The outliers on the right side of the distribution represent individuals with very high salaries that are significantly different from the majority.
t_test <- t.test(mydata2$`MBA Grade`,
mu = 74,
alternative = "two.sided")
t_test
##
## One Sample t-test
##
## data: mydata2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
I tested H0:𝜇MBA Grade=74, H1:𝜇MBA Grade≠74. We reject the null hypothesis (p=0,00915). That means that the true mean MBA grade differ from 74. Sample mean of the MBA grades is 76.04055, greater than 74.
library(effectsize)
cohens_d(mydata2$`MBA Grade`, mu = 74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
A Cohen’s d of 0.27 indicates (with 95% confidence) a small effect size. The MBA grades in this generation deviate only slightly from the grades of last year’s generation.
library(readxl)
mydata <- read_excel("Apartments.xlsx")
Description:
mydata$Balcony <- factor(mydata$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$Parking <- factor(mydata$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
head(mydata)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We reject the H0 (p=0.004731).The true mean is significantly different from 1900. With 95% certainty we can say that true mean of apartment prices per m2 lies in the interval from 1937.44 to 2100.44.
fit1 <- lm(Price ~ Age, data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Regression coefficient: For every additional year, the price of an apartment per m2 decreases by 8,975€ on average (p=0,034).
Coefficient of determination: 5.3 % of the variability of the price per m2 is explained by the linear effect of age.
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
Correlation coefficient: The linear correlation between price and age of an apartment is weak, because the value of correlation coefficient is equal to 0.23.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata[, c("Price", "Age", "Distance")], smooth = FALSE)
The scatter between Age and Distance does not show any kind of relationship between the two variables, there seems to be no problem of multicolinearity.
fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
Since vif < 5 and is close to 1 there is no sign of multicolinearity between Age and Distance.
mydata$Stdresid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2),3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances")
head(mydata[order(-mydata$CooksD), "CooksD"], 10)
## # A tibble: 10 × 1
## CooksD
## <dbl>
## 1 0.32
## 2 0.104
## 3 0.069
## 4 0.066
## 5 0.061
## 6 0.038
## 7 0.037
## 8 0.034
## 9 0.032
## 10 0.03
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))
fit2 <- lm(Price ~ Age + Distance, data = mydata)
mydata$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$Stdresid, x = mydata$StdFittedValues,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted Values",
boxplots = FALSE,
regline = FALSE,
smooth = FALSE)
## Warning in plot.window(...): "regline" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "regline" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in box(...): "regline" is not a graphical parameter
## Warning in title(...): "regline" is not a graphical parameter
The points in the scatterplot are more or less randomly distributed, which would indicate no heteroskedasticity. I cannot say that with utmost certainty.
library("olsrr")
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
We cannot reject H0 (p=0.1873174), we assume homoskedasticity.
hist(mydata$Stdresid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
The histogram is slightly right skewed.
shapiro.test(mydata$Stdresid)
##
## Shapiro-Wilk normality test
##
## data: mydata$Stdresid
## W = 0.93418, p-value = 0.0004761
We can reject H0 (p=0.0004761). Standard residuals are not distributed normally.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
The coefficient of determination = 0.5361 Approximately 54% of the variability of the price per m2 is explained by the linear effect of age and distance.
Estimate of regression coefficient = -8.674 For every additional year of an appartment, the price per m2 decreases by 8.674 euros on average (p = 0.00869), assuming that distance is constant.
Estimate of regression coefficient = -24.063 If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24.063 (p < 0.001), assuming that age is constant.
We can reject H0 (p<0,001). That means that this model is a good model and it makes sense.
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
Correlation coefficient = 0.73 The linear correlation between price, age and distance of an apartment is strong.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
fit3 model does not fit the data better than fit 2, because p=0.1135.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
On average, apartments with parking have a higher price per m2 of €128.70 than apartments without parking, when Age and Distance are constant (p = 0.0376).
We found no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not, because the P-value is greater than 0.05 (p=0.9165).
The F-statistic is used to test the overall significance of the regression model. H0 is all the regression coefficients (except the intercept) are equal to zero.
mydata$StdFittedValues <- fitted.values(fit3)
mydata$Stdresid <- residuals(fit3)
head(mydata[ , colnames(mydata) %in% c("ID", "Price", "StdFittedValues", "Stdresid")])
## # A tibble: 6 × 3
## Price Stdresid StdFittedValues
## <dbl> <dbl> <dbl>
## 1 1640 -88.6 1729.
## 2 2800 443. 2357.
## 3 1660 -62.6 1723.
## 4 1850 311. 1539.
## 5 1640 -349. 1989.
## 6 1770 -143. 1913.
Since based on the estimated regression function, we expected the price per m2 for this apartment to be 2356.597 euros, but the actual price was 2800 euros, the residual is 443.4 euros.