mydata <- read.table("~/fifa – kopija/fifa_players.csv", skip=1,
header= TRUE,
sep= "," ,
dec= ".")
head(mydata)
## full_name age height_cm weight_kgs positions nationality
## 1 Lionel Andrés Messi Cuccittini 31 170.18 72.1 CF,RW,ST Argentina
## 2 Christian Dannemann Eriksen 27 154.94 76.2 CAM,RM,CM Denmark
## 3 Paul Pogba 25 190.50 83.9 CM,CAM France
## 4 Lorenzo Insigne 27 162.56 59.0 LW,ST Italy
## 5 Kalidou Koulibaly 27 187.96 88.9 CB Senegal
## 6 Virgil van Dijk 27 193.04 92.1 CB Netherlands
## overall_rating potential value_euro wage_euro
## 1 94 94 110500000 565000
## 2 88 89 69500000 205000
## 3 88 91 73000000 255000
## 4 88 88 62000000 165000
## 5 88 91 60000000 135000
## 6 88 90 59500000 215000
Explanation of variables:
Here I created new variable called Rating_Difference, which is absolute difference between potential of a player and his overall rating.
mydata$Rating_Difference <- mydata$potential - mydata$overall_rating
head(mydata)
## full_name age height_cm weight_kgs positions nationality
## 1 Lionel Andrés Messi Cuccittini 31 170.18 72.1 CF,RW,ST Argentina
## 2 Christian Dannemann Eriksen 27 154.94 76.2 CAM,RM,CM Denmark
## 3 Paul Pogba 25 190.50 83.9 CM,CAM France
## 4 Lorenzo Insigne 27 162.56 59.0 LW,ST Italy
## 5 Kalidou Koulibaly 27 187.96 88.9 CB Senegal
## 6 Virgil van Dijk 27 193.04 92.1 CB Netherlands
## overall_rating potential value_euro wage_euro Rating_Difference
## 1 94 94 110500000 565000 0
## 2 88 89 69500000 205000 1
## 3 88 91 73000000 255000 3
## 4 88 88 62000000 165000 0
## 5 88 91 60000000 135000 3
## 6 88 90 59500000 215000 2
Then I renamed all of the columns, so they start with a capital letter and it looks better.
colnames(mydata) [1] <- "Player_Name"
colnames(mydata) [2] <- "Age"
colnames(mydata) [3] <- "Height_cm"
colnames(mydata) [4] <- "Weight_kgs"
colnames(mydata) [5] <- "Playing_position"
colnames(mydata) [6] <- "Nationality"
colnames(mydata) [7] <- "Overall_rating"
colnames(mydata) [8] <- "Potential"
colnames(mydata) [9] <- "Market_value_euro"
colnames(mydata) [10] <- "Wage_euro"
colnames(mydata) [11] <- "Rating_difference"
head(mydata)
## Player_Name Age Height_cm Weight_kgs Playing_position
## 1 Lionel Andrés Messi Cuccittini 31 170.18 72.1 CF,RW,ST
## 2 Christian Dannemann Eriksen 27 154.94 76.2 CAM,RM,CM
## 3 Paul Pogba 25 190.50 83.9 CM,CAM
## 4 Lorenzo Insigne 27 162.56 59.0 LW,ST
## 5 Kalidou Koulibaly 27 187.96 88.9 CB
## 6 Virgil van Dijk 27 193.04 92.1 CB
## Nationality Overall_rating Potential Market_value_euro Wage_euro
## 1 Argentina 94 94 110500000 565000
## 2 Denmark 88 89 69500000 205000
## 3 France 88 91 73000000 255000
## 4 Italy 88 88 62000000 165000
## 5 Senegal 88 91 60000000 135000
## 6 Netherlands 88 90 59500000 215000
## Rating_difference
## 1 0
## 2 1
## 3 3
## 4 0
## 5 3
## 6 2
Then I filtered out players older than the age of 30.
older_players <- mydata[mydata$Age > 30, ]
head(older_players)
## Player_Name Age Height_cm Weight_kgs Playing_position
## 1 Lionel Andrés Messi Cuccittini 31 170.18 72.1 CF,RW,ST
## 9 Manuel Neuer 32 193.04 92.1 GK
## 10 Edinson Roberto Cavani Gómez 32 185.42 77.1 ST
## 18 Fernando Luiz Rosa 33 152.40 67.1 CDM
## 19 Gonzalo Gerardo Higuaín 31 185.42 88.9 ST
## 21 Jan Vertonghen 31 187.96 86.2 CB
## Nationality Overall_rating Potential Market_value_euro Wage_euro
## 1 Argentina 94 94 110500000 565000
## 9 Germany 89 89 38000000 130000
## 10 Uruguay 89 89 60000000 200000
## 18 Brazil 87 87 20500000 200000
## 19 Argentina 87 87 48500000 205000
## 21 Belgium 87 87 34000000 155000
## Rating_difference
## 1 0
## 9 0
## 10 0
## 18 0
## 19 0
## 21 0
Then I decreased my database from 17954 to 50 observations
set.seed(123)
majdata <-mydata[1:50, ]
head(majdata)
## Player_Name Age Height_cm Weight_kgs Playing_position
## 1 Lionel Andrés Messi Cuccittini 31 170.18 72.1 CF,RW,ST
## 2 Christian Dannemann Eriksen 27 154.94 76.2 CAM,RM,CM
## 3 Paul Pogba 25 190.50 83.9 CM,CAM
## 4 Lorenzo Insigne 27 162.56 59.0 LW,ST
## 5 Kalidou Koulibaly 27 187.96 88.9 CB
## 6 Virgil van Dijk 27 193.04 92.1 CB
## Nationality Overall_rating Potential Market_value_euro Wage_euro
## 1 Argentina 94 94 110500000 565000
## 2 Denmark 88 89 69500000 205000
## 3 France 88 91 73000000 255000
## 4 Italy 88 88 62000000 165000
## 5 Senegal 88 91 60000000 135000
## 6 Netherlands 88 90 59500000 215000
## Rating_difference
## 1 0
## 2 1
## 3 3
## 4 0
## 5 3
## 6 2
Here I displayed descriptive statistics for selected variables “Age”, “Height_cm” and “Weight_kgs”.
summary(majdata[, c("Age", "Height_cm", "Weight_kgs")])
## Age Height_cm Weight_kgs
## Min. :20.00 Min. :152.4 Min. :59.00
## 1st Qu.:25.00 1st Qu.:164.5 1st Qu.:71.20
## Median :27.00 Median :175.3 Median :76.20
## Mean :27.18 Mean :175.5 Mean :77.15
## 3rd Qu.:30.00 3rd Qu.:187.3 3rd Qu.:83.22
## Max. :37.00 Max. :198.1 Max. :97.10
Mean:
27.18 ~ explains that the average age of the player is 27.18 years.
175.5 ~ explains that the average height of the player is 175.5 cm.
77.15 ~ explains that the average weight of the player is 77.15 kg.
Median:
27.00 ~ half of the players are younger than 27 years age and half of the players are older than 27 years.
175.3 ~ half of the players are shorter than 175.3 cm and half of the players are taller than 175.3 cm.
76.20 ~ half of the players are lighter than 76.20 kg and half of the players are heavier than 76.20 kg.
Min:
20.00 ~ the lowest age of the player is 20 years.
152.4 ~ the lowest height of the player is 152.4 cm.
59.00 ~ the lowest weight of the player is 59 kg.
hist(majdata$Overall_rating,
breaks = 5,
col = "green",
border = "black",
main = "Distribution of Rating of the Players",
xlab = "Player rating",
ylab = "Frequency")
Here I displayed a histogram that shows the distribution of the variable Overall_rating. We can sense the bimodal distribution because of two peaks. This suggests that there are are two ranges (82- 84 and 86- 88) of player ratings that occur more frequently than others in the dataset. I also sense some potential outliers on the right side.
boxplot(majdata$Market_value_euro/1000000,
main = "Boxplot of Players market value",
ylab = "Market value",
col = "lightgreen",
border = "black")
Here i displayed Boxplot of variable “Market_value_euro”. First of all I divided all market values to get values in millions, because it makes more sense for the graph. The boxplot reveals that market values have some outliers, which is indicated by individual points above upper whisker, confirming also the high values in histogram.
plot(majdata$Age, majdata$Overall_rating,
main = "Scatterplot of Player rating vs Age",
xlab = "Age",
ylab = "Player rating",
col = "blue")
The scatterplot shows that there is no clear linear trend between age and player rating. It helps visualize whether older players have higher raings in general, but from this visualization, it appears that age alone is not a strong predictor of player rating.
library(ggplot2)
library(readxl)
mydata <- read_excel("~/R Take Home Exam 2024/Task 2/Business School.xlsx")
head(mydata)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
ggplot(mydata, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "green", color = "black") +
ylab("Frequency") +
xlab("Undergrad Degree") +
ggtitle("Distribution of Undergrad Degrees")
Above I displayed a histogram showing the distribution of undergrad degrees among all 100 MBA students. We can see that the least common one is Art and the most common one is Business degree.
summary(mydata$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
Descriptive statistics of the variable Annual Salary are supplied by function summary.
mydata$`Annual Salary1000` <- mydata$`Annual Salary` / 1000
head(mydata)
## # A tibble: 6 × 10
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 6 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>,
## # `Annual Salary1000` <dbl>
I created a new variable called ” Annual Salary2” by dividing an annual salary by 1000 to get a smaller value for annual salary, so the histogram plays out nicely.
ggplot(mydata, aes(x = `Annual Salary1000`)) +
geom_histogram(binwidth = 50, fill = "blue", color = "black", alpha = 0.7) +
xlab("Annual Salary") +
ylab("Frequency") +
ggtitle("Distribution of Annual Salary")
The histogram is unimodal, which means it has only one peak, which lies around 100 (Annual salary in thousands). Since most of the distribution is on left side, I can conclude that it is right skewed. We can also see some outliers on the right side. This means that there are not many values that are way higher than the other values.
t_test_result <- t.test(mydata$`MBA Grade`, mu = 74, alternative = "two.sided")
print(t_test_result)
##
## One Sample t-test
##
## data: mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
With the function t.test I tested null hypothesis (H0:𝜇MBA Grade=74). The p-value of 0.00915 is significantly lower than 0.05, which means I can accept H1:𝜇MBA Grade≠74. That means that the true mean MBA grade differs from 74, our sample mean is also greater than 74 (76.04055>74)
library(effectsize)
cohens_d(mydata$`MBA Grade`, mu= 74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
The value of Cohen’s is 0.27, which indicates that there is a small effect size (with 95 % confidence). There is a modest difference between this year’s generation and last year’s generation.
library(readxl)
Apartmani <- read_excel("~/R Take Home Exam 2024/Task 3/Apartments.xlsx")
head(Apartmani)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes
Apartmani$Balcony <- factor(Apartmani$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartmani$Parking <- factor(Apartmani$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
I changed categorical variables into factors.
t.test(Apartmani$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartmani$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The P value that we got with T-Test is significantly lower than 0.05, so we can reject the null hypothesis, therefore we can say that the true mean differs from 1900. With 95 % confidence I can say that true mean of apartment prices per m2 lies in the interval from 1937.443 to 2100.440.
fit1 <- lm(Price ~ Age, data = Apartmani)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartmani)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
Estimate of regression coefficient= -8.975
If the age of an apartment increases by 1 year, the price per m2 decreases by 8.975 euros on average (p=0.034).
Corelation coefficent = 0.230355
The linear correlation between age and price of an apartment is WEAK and POSITIVE based on the value of correlation coefficient (0.23).
Coefficient of determination = 0.05302
This coefficient indicates the proportion of the total variability of the dependent variable that can be explained by the linear effect of all explanatory variables. Therefore only 5.3 % of the variability of the price per m2 is explained by the linear effect of age.
library(car)
## Loading required package: carData
scatterplotMatrix(Apartmani[c("Price", "Age", "Distance")], smooth = FALSE)
fit2 <- lm(Price ~ Age + Distance, data = Apartmani)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartmani)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Both VIF values are close to 1, which indicates no multicollinearity between the two independent variables. They are not correlated.
Apartmani$StdResid <- round(rstandard(fit2), 3)
Here I calculated all standardized residuals and stored them as StdResid. There are no outliers, since no absolute value of standard residual is over 3.
Apartmani$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartmani$CooksD,
xlab = "Cooks distance",
ylab = "Freequency",
main = "Histogram of Cooks distances")
I made a histogram displaying distribution of Cooks distances. From the histogram we can see that we have outliers. We can see that by a gap between a gap between 0.15 to 0.30.
head(Apartmani[order(-Apartmani$CooksD), "CooksD"], 10)
## # A tibble: 10 × 1
## CooksD
## <dbl>
## 1 0.32
## 2 0.104
## 3 0.069
## 4 0.066
## 5 0.061
## 6 0.038
## 7 0.037
## 8 0.034
## 9 0.032
## 10 0.03
First 5 values are significantly greater than the other ones, therefore we can conclude they are the outliers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Apartmani<- Apartmani %>%
filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))
fit2 <- lm(Price ~ Age + Distance, data = Apartmani)
Apartmani$StdFittedValues <- scale(fit2$fitted.values)
I created the fit2 without potentially excluded units and show the summary of the model here.
library(car)
scatterplot(y=Apartmani$StdResid, x=Apartmani$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
I would say that the points in the scatterplot are randomly distributed, so we can conclude that there is no heteroskedasticity.
library(ggplot2)
ggplot(Apartmani, aes(StdResid))+
geom_histogram(fill = "blue", color = "black", alpha = 0.7) +
xlab("Standard Residuals") +
ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I created a histogram of standard residulas with ggplot.
shapiro.test(Apartmani$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartmani$StdResid
## W = 0.93418, p-value = 0.0004761
P-value<0.05; Errors are not normally distributed.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartmani)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
The linear correlation is strong between price, age, distance of apartment. (0.73)
54% of variability in price is described by the linear effect of age and distance.
If the age of an apartment increases by 1 year, the price per m2 decreases by 8.67 euros on average, assuming the distance is constant. (p=0.009)
If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24 euros, assuming that age is constant. (p<0.0001)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartmani)
I estimated the linear regression function Price = f(Age, Distance, Parking and Balcony).
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
Anova tells is which regression function is better. P value is greater than 0.05, which means that fit3 is a worse model than fit 3.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartmani)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Explanation:
Given the values of the other variables, apartments with a parking space have an average price per m2 higher by 129 euros compared to the apartments without a parking space (p = 0.00251).
There is no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not. (p=0.917).
F-statistic is testing if Rsquared is equal to 0.
Apartmani$StdFittedValues <- fitted.values(fit3)
Apartmani$StdResid <- residuals(fit3)
head(Apartmani[ , colnames(Apartmani) %in% c("ID", "Price", "StdFittedValues", "StdResid")])
## # A tibble: 6 × 3
## Price StdResid StdFittedValues
## <dbl> <dbl> <dbl>
## 1 1640 -88.6 1729.
## 2 2800 443. 2357.
## 3 1660 -62.6 1723.
## 4 1850 311. 1539.
## 5 1640 -349. 1989.
## 6 1770 -143. 1913.
Standard residual of apartment ID2 is 443.