Task 1

mydata <- read.table("~/fifa – kopija/fifa_players.csv", skip=1,
                     header= TRUE,
                     sep= "," ,
                     dec= ".")
head(mydata)

##                        full_name age height_cm weight_kgs positions nationality
## 1 Lionel Andrés Messi Cuccittini  31    170.18       72.1  CF,RW,ST   Argentina
## 2   Christian  Dannemann Eriksen  27    154.94       76.2 CAM,RM,CM     Denmark
## 3                     Paul Pogba  25    190.50       83.9    CM,CAM      France
## 4                Lorenzo Insigne  27    162.56       59.0     LW,ST       Italy
## 5              Kalidou Koulibaly  27    187.96       88.9        CB     Senegal
## 6                Virgil van Dijk  27    193.04       92.1        CB Netherlands
##   overall_rating potential value_euro wage_euro
## 1             94        94  110500000    565000
## 2             88        89   69500000    205000
## 3             88        91   73000000    255000
## 4             88        88   62000000    165000
## 5             88        91   60000000    135000
## 6             88        90   59500000    215000

Explanation of variables:

full_name: The name of the football player.
age: The player’s age in years.
height_cm: The player’s height in centimeters.
weight_kgs: The player’s weight in kilograms.
positions: The playing positions of the player (e.g., CF, RW, CAM, etc.).
nationality: The nationality of the player.
overall_rating: The overall rating of the player in the game (on a scale of 0–100).
potential: The player’s potential rating (on a scale of 0–100).
value_euro: The estimated market value of the player in euros.
wage_euro: The player’s weekly wage in euros.

b)

Here I created new variable called Rating_Difference, which is absolute difference between potential of a player and his overall rating.

mydata$Rating_Difference <- mydata$potential - mydata$overall_rating

head(mydata)

##                        full_name age height_cm weight_kgs positions nationality
## 1 Lionel Andrés Messi Cuccittini  31    170.18       72.1  CF,RW,ST   Argentina
## 2   Christian  Dannemann Eriksen  27    154.94       76.2 CAM,RM,CM     Denmark
## 3                     Paul Pogba  25    190.50       83.9    CM,CAM      France
## 4                Lorenzo Insigne  27    162.56       59.0     LW,ST       Italy
## 5              Kalidou Koulibaly  27    187.96       88.9        CB     Senegal
## 6                Virgil van Dijk  27    193.04       92.1        CB Netherlands
##   overall_rating potential value_euro wage_euro Rating_Difference
## 1             94        94  110500000    565000                 0
## 2             88        89   69500000    205000                 1
## 3             88        91   73000000    255000                 3
## 4             88        88   62000000    165000                 0
## 5             88        91   60000000    135000                 3
## 6             88        90   59500000    215000                 2

Then I renamed all of the columns, so they start with a capital letter and it looks better.

colnames(mydata) [1] <- "Player_Name"
colnames(mydata) [2] <- "Age"
colnames(mydata) [3] <- "Height_cm"
colnames(mydata) [4] <- "Weight_kgs"
colnames(mydata) [5] <- "Playing_position"
colnames(mydata) [6] <- "Nationality"
colnames(mydata) [7] <- "Overall_rating"
colnames(mydata) [8] <- "Potential"
colnames(mydata) [9] <- "Market_value_euro"
colnames(mydata) [10] <- "Wage_euro"
colnames(mydata) [11] <- "Rating_difference"

head(mydata)

##                      Player_Name Age Height_cm Weight_kgs Playing_position
## 1 Lionel Andrés Messi Cuccittini  31    170.18       72.1         CF,RW,ST
## 2   Christian  Dannemann Eriksen  27    154.94       76.2        CAM,RM,CM
## 3                     Paul Pogba  25    190.50       83.9           CM,CAM
## 4                Lorenzo Insigne  27    162.56       59.0            LW,ST
## 5              Kalidou Koulibaly  27    187.96       88.9               CB
## 6                Virgil van Dijk  27    193.04       92.1               CB
##   Nationality Overall_rating Potential Market_value_euro Wage_euro
## 1   Argentina             94        94         110500000    565000
## 2     Denmark             88        89          69500000    205000
## 3      France             88        91          73000000    255000
## 4       Italy             88        88          62000000    165000
## 5     Senegal             88        91          60000000    135000
## 6 Netherlands             88        90          59500000    215000
##   Rating_difference
## 1                 0
## 2                 1
## 3                 3
## 4                 0
## 5                 3
## 6                 2

Then I filtered out players older than the age of 30.

older_players <- mydata[mydata$Age > 30, ]

head(older_players)

##                       Player_Name Age Height_cm Weight_kgs Playing_position
## 1  Lionel Andrés Messi Cuccittini  31    170.18       72.1         CF,RW,ST
## 9                    Manuel Neuer  32    193.04       92.1               GK
## 10   Edinson Roberto Cavani Gómez  32    185.42       77.1               ST
## 18             Fernando Luiz Rosa  33    152.40       67.1              CDM
## 19        Gonzalo Gerardo Higuaín  31    185.42       88.9               ST
## 21                 Jan Vertonghen  31    187.96       86.2               CB
##    Nationality Overall_rating Potential Market_value_euro Wage_euro
## 1    Argentina             94        94         110500000    565000
## 9      Germany             89        89          38000000    130000
## 10     Uruguay             89        89          60000000    200000
## 18      Brazil             87        87          20500000    200000
## 19   Argentina             87        87          48500000    205000
## 21     Belgium             87        87          34000000    155000
##    Rating_difference
## 1                  0
## 9                  0
## 10                 0
## 18                 0
## 19                 0
## 21                 0

Then I decreased my database from 17954 to 50 observations

set.seed(123)
majdata <-mydata[1:50, ]

head(majdata)

##                      Player_Name Age Height_cm Weight_kgs Playing_position
## 1 Lionel Andrés Messi Cuccittini  31    170.18       72.1         CF,RW,ST
## 2   Christian  Dannemann Eriksen  27    154.94       76.2        CAM,RM,CM
## 3                     Paul Pogba  25    190.50       83.9           CM,CAM
## 4                Lorenzo Insigne  27    162.56       59.0            LW,ST
## 5              Kalidou Koulibaly  27    187.96       88.9               CB
## 6                Virgil van Dijk  27    193.04       92.1               CB
##   Nationality Overall_rating Potential Market_value_euro Wage_euro
## 1   Argentina             94        94         110500000    565000
## 2     Denmark             88        89          69500000    205000
## 3      France             88        91          73000000    255000
## 4       Italy             88        88          62000000    165000
## 5     Senegal             88        91          60000000    135000
## 6 Netherlands             88        90          59500000    215000
##   Rating_difference
## 1                 0
## 2                 1
## 3                 3
## 4                 0
## 5                 3
## 6                 2

c)

Here I displayed descriptive statistics for selected variables “Age”, “Height_cm” and “Weight_kgs”.

summary(majdata[, c("Age", "Height_cm", "Weight_kgs")])

##       Age          Height_cm       Weight_kgs   
##  Min.   :20.00   Min.   :152.4   Min.   :59.00  
##  1st Qu.:25.00   1st Qu.:164.5   1st Qu.:71.20  
##  Median :27.00   Median :175.3   Median :76.20  
##  Mean   :27.18   Mean   :175.5   Mean   :77.15  
##  3rd Qu.:30.00   3rd Qu.:187.3   3rd Qu.:83.22  
##  Max.   :37.00   Max.   :198.1   Max.   :97.10

Mean:

27.18 ~ explains that the average age of the player is 27.18 years.

175.5 ~ explains that the average height of the player is 175.5 cm.

77.15 ~ explains that the average weight of the player is 77.15 kg.

Median:

27.00 ~ half of the players are younger than 27 years age and half of the players are older than 27 years.

175.3 ~ half of the players are shorter than 175.3 cm and half of the players are taller than 175.3 cm.

76.20 ~ half of the players are lighter than 76.20 kg and half of the players are heavier than 76.20 kg.

Min:

20.00 ~ the lowest age of the player is 20 years.

152.4 ~ the lowest height of the player is 152.4 cm.

59.00 ~ the lowest weight of the player is 59 kg.

d)

hist(majdata$Overall_rating, 
     breaks = 5,
     col = "green",
     border = "black",
     main = "Distribution of Rating of the Players",
     xlab = "Player rating",
     ylab = "Frequency")

Here I displayed a histogram that shows the distribution of the variable Overall_rating. We can sense the bimodal distribution because of two peaks. This suggests that there are are two ranges (82- 84 and 86- 88) of player ratings that occur more frequently than others in the dataset. I also sense some potential outliers on the right side.

boxplot(majdata$Market_value_euro/1000000,
        main = "Boxplot of Players market value", 
        ylab = "Market value", 
        col = "lightgreen", 
        border = "black")

Here i displayed Boxplot of variable “Market_value_euro”. First of all I divided all market values to get values in millions, because it makes more sense for the graph. The boxplot reveals that market values have some outliers, which is indicated by individual points above upper whisker, confirming also the high values in histogram.

plot(majdata$Age, majdata$Overall_rating,
     main = "Scatterplot of Player rating vs Age",
     xlab = "Age",
     ylab = "Player rating",
     col = "blue")

The scatterplot shows that there is no clear linear trend between age and player rating. It helps visualize whether older players have higher raings in general, but from this visualization, it appears that age alone is not a strong predictor of player rating.

Task 2

library(ggplot2)
library(readxl)
mydata <- read_excel("~/R Take Home Exam 2024/Task 2/Business School.xlsx")
head(mydata)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

ggplot(mydata, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "green", color = "black") +
  ylab("Frequency") +
  xlab("Undergrad Degree") +
  ggtitle("Distribution of Undergrad Degrees")

Above I displayed a histogram showing the distribution of undergrad degrees among all 100 MBA students. We can see that the least common one is Art and the most common one is Business degree.

summary(mydata$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

Descriptive statistics of the variable Annual Salary are supplied by function summary.

mydata$`Annual Salary1000` <- mydata$`Annual Salary` / 1000
head(mydata)

## # A tibble: 6 × 10
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 6 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>,
## #   `Annual Salary1000` <dbl>

I created a new variable called ” Annual Salary2” by dividing an annual salary by 1000 to get a smaller value for annual salary, so the histogram plays out nicely.

ggplot(mydata, aes(x = `Annual Salary1000`)) +
  geom_histogram(binwidth = 50, fill = "blue", color = "black", alpha = 0.7) +
  xlab("Annual Salary") +
  ylab("Frequency") +
  ggtitle("Distribution of Annual Salary")

The histogram is unimodal, which means it has only one peak, which lies around 100 (Annual salary in thousands). Since most of the distribution is on left side, I can conclude that it is right skewed. We can also see some outliers on the right side. This means that there are not many values that are way higher than the other values.

t_test_result <- t.test(mydata$`MBA Grade`, mu = 74, alternative = "two.sided")
print(t_test_result)

## 
##  One Sample t-test
## 
## data:  mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

With the function t.test I tested null hypothesis (H0:𝜇MBA Grade=74). The p-value of 0.00915 is significantly lower than 0.05, which means I can accept H1:𝜇MBA Grade≠74. That means that the true mean MBA grade differs from 74, our sample mean is also greater than 74 (76.04055>74)

library(effectsize)
cohens_d(mydata$`MBA Grade`, mu= 74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

The value of Cohen’s is 0.27, which indicates that there is a small effect size (with 95 % confidence). There is a modest difference between this year’s generation and last year’s generation.

Task 3

library(readxl)
Apartmani <- read_excel("~/R Take Home Exam 2024/Task 3/Apartments.xlsx")
head(Apartmani)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Apartmani$Balcony <- factor(Apartmani$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
Apartmani$Parking <- factor(Apartmani$Parking,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

I changed categorical variables into factors.

t.test(Apartmani$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartmani$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The P value that we got with T-Test is significantly lower than 0.05, so we can reject the null hypothesis, therefore we can say that the true mean differs from 1900. With 95 % confidence I can say that true mean of apartment prices per m2 lies in the interval from 1937.443 to 2100.440.

fit1 <- lm(Price ~ Age, data = Apartmani)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartmani)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

sqrt(summary(fit1)$r.squared)

## [1] 0.230255

Estimate of regression coefficient= -8.975

If the age of an apartment increases by 1 year, the price per m2 decreases by 8.975 euros on average (p=0.034).

Corelation coefficent = 0.230355

The linear correlation between age and price of an apartment is WEAK and POSITIVE based on the value of correlation coefficient (0.23).

Coefficient of determination = 0.05302

This coefficient indicates the proportion of the total variability of the dependent variable that can be explained by the linear effect of all explanatory variables. Therefore only 5.3 % of the variability of the price per m2 is explained by the linear effect of age.

library(car)

## Loading required package: carData

scatterplotMatrix(Apartmani[c("Price", "Age", "Distance")], smooth = FALSE)

fit2 <- lm(Price ~ Age + Distance, data = Apartmani)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartmani)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Both VIF values are close to 1, which indicates no multicollinearity between the two independent variables. They are not correlated.

Apartmani$StdResid <- round(rstandard(fit2), 3)

Here I calculated all standardized residuals and stored them as StdResid. There are no outliers, since no absolute value of standard residual is over 3.

Apartmani$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartmani$CooksD,
     xlab = "Cooks distance",
     ylab = "Freequency",
     main = "Histogram of Cooks distances")

I made a histogram displaying distribution of Cooks distances. From the histogram we can see that we have outliers. We can see that by a gap between a gap between 0.15 to 0.30.

head(Apartmani[order(-Apartmani$CooksD), "CooksD"], 10)

## # A tibble: 10 × 1
##    CooksD
##     <dbl>
##  1  0.32 
##  2  0.104
##  3  0.069
##  4  0.066
##  5  0.061
##  6  0.038
##  7  0.037
##  8  0.034
##  9  0.032
## 10  0.03

First 5 values are significantly greater than the other ones, therefore we can conclude they are the outliers.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Apartmani<- Apartmani %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))

fit2 <- lm(Price ~ Age + Distance, data = Apartmani)
Apartmani$StdFittedValues <- scale(fit2$fitted.values)

I created the fit2 without potentially excluded units and show the summary of the model here.

library(car)
scatterplot(y=Apartmani$StdResid, x=Apartmani$StdFittedValues,
            ylab = "Standardized residuals", 
            xlab = "Standardized fitted values", 
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

I would say that the points in the scatterplot are randomly distributed, so we can conclude that there is no heteroskedasticity.

library(ggplot2)
ggplot(Apartmani, aes(StdResid))+
  geom_histogram(fill = "blue", color = "black", alpha = 0.7) +
  xlab("Standard Residuals") +
  ylab("Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I created a histogram of standard residulas with ggplot.

shapiro.test(Apartmani$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartmani$StdResid
## W = 0.93418, p-value = 0.0004761

P-value<0.05; Errors are not normally distributed.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartmani)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

sqrt(summary(fit2)$r.squared)

## [1] 0.732187

The linear correlation is strong between price, age, distance of apartment. (0.73)

54% of variability in price is described by the linear effect of age and distance.

If the age of an apartment increases by 1 year, the price per m2 decreases by 8.67 euros on average, assuming the distance is constant. (p=0.009)

If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24 euros, assuming that age is constant. (p<0.0001)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartmani)

I estimated the linear regression function Price = f(Age, Distance, Parking and Balcony).

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Anova tells is which regression function is better. P value is greater than 0.05, which means that fit3 is a worse model than fit 3.

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartmani)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Explanation:

Given the values of the other variables, apartments with a parking space have an average price per m2 higher by 129 euros compared to the apartments without a parking space (p = 0.00251).

There is no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not. (p=0.917).

F-statistic is testing if Rsquared is equal to 0.

Apartmani$StdFittedValues <- fitted.values(fit3)
Apartmani$StdResid <- residuals(fit3)
head(Apartmani[ , colnames(Apartmani) %in% c("ID", "Price", "StdFittedValues", "StdResid")])

## # A tibble: 6 × 3
##   Price StdResid StdFittedValues
##   <dbl>    <dbl>           <dbl>
## 1  1640    -88.6           1729.
## 2  2800    443.            2357.
## 3  1660    -62.6           1723.
## 4  1850    311.            1539.
## 5  1640   -349.            1989.
## 6  1770   -143.            1913.

Standard residual of apartment ID2 is 443.

Take home exam

Mihael Maj Batagelj

2024-09-23

Task 1

b)

c)

d)

Task 2

Task 3