Homework 1
1: Data manipulations
mydata1 <- read.table("~/Desktop/R December/Euroleague matches.csv",
header=TRUE,
sep=",",
dec=",",
na.strings=c("","NA"))
mydata_fixed <- mydata1[ , c(-1, -2, -3, -5, -11, -14, -17,-18, -19, -21, -22, -23, -24, -25, -27 ) ]
head(mydata_fixed)
## Team Opponent X.1 MP FG FGA X3P X3PA FT FTA
## 1 Real Madrid Khimki Moscow Region L 5:40 0 0 0 0 2 2
## 2 Real Madrid Crvena Zvezda Telekom Belgrade W 18:57 1 3 1 2 0 0
## 3 Real Madrid Fenerbahce Istanbul W 17:22 1 4 0 2 4 4
## 4 Real Madrid FC Bayern Munich W 6:30 1 3 0 2 2 2
## 5 Real Madrid Strasbourg W 13:06 1 2 0 0 1 2
## 6 Real Madrid Brose Baskets Bamberg W 20:16 1 2 0 1 0 0
## TRB PTS
## 1 0 2
## 2 5 3
## 3 3 6
## 4 1 4
## 5 4 3
## 6 2 2
Here I firstly eliminated the variables that I will definetly not use
during my analysis process
mydata_fixed2 <- drop_na(mydata_fixed)
head(mydata_fixed2)
## Team Opponent X.1 MP FG FGA X3P X3PA FT FTA
## 1 Real Madrid Khimki Moscow Region L 5:40 0 0 0 0 2 2
## 2 Real Madrid Crvena Zvezda Telekom Belgrade W 18:57 1 3 1 2 0 0
## 3 Real Madrid Fenerbahce Istanbul W 17:22 1 4 0 2 4 4
## 4 Real Madrid FC Bayern Munich W 6:30 1 3 0 2 2 2
## 5 Real Madrid Strasbourg W 13:06 1 2 0 0 1 2
## 6 Real Madrid Brose Baskets Bamberg W 20:16 1 2 0 1 0 0
## TRB PTS
## 1 0 2
## 2 5 3
## 3 3 6
## 4 1 4
## 5 4 3
## 6 2 2
In this section I put where there is a missing section 0 so that the
analysis will be better and possible.
mydata_fixed2 <- subset(mydata_fixed2, select=c(1,2,3,4,5,6,7,8,9,10,11,12))
colnames(mydata_fixed2) <- c("Luka's Team", "Opponents", "Win or Lose", "Minutes played", "Field baskets", "Field basket attempts", "Three point basket", "Three point basket attempts", "Free throws", "Free throw attempts", "Total Rebounds", "Points")
head(mydata_fixed2)
## Luka's Team Opponents Win or Lose Minutes played
## 1 Real Madrid Khimki Moscow Region L 5:40
## 2 Real Madrid Crvena Zvezda Telekom Belgrade W 18:57
## 3 Real Madrid Fenerbahce Istanbul W 17:22
## 4 Real Madrid FC Bayern Munich W 6:30
## 5 Real Madrid Strasbourg W 13:06
## 6 Real Madrid Brose Baskets Bamberg W 20:16
## Field baskets Field basket attempts Three point basket
## 1 0 0 0
## 2 1 3 1
## 3 1 4 0
## 4 1 3 0
## 5 1 2 0
## 6 1 2 0
## Three point basket attempts Free throws Free throw attempts Total Rebounds
## 1 0 2 2 0
## 2 2 0 0 5
## 3 2 4 4 3
## 4 2 2 2 1
## 5 0 1 2 4
## 6 1 0 0 2
## Points
## 1 2
## 2 3
## 3 6
## 4 4
## 5 3
## 6 2
For me to analyse the data I named it in a manner, that can be better
understood. For instance instead of FT I put Free Throws
minutesPlayedToSeconds <- function(minutes) {
return (strtoi(sub(":.*","", minutes)) * 60 + strtoi(substr(minutes,nchar(minutes) - 2 + 1, nchar(minutes))))
}
mydata_fixed2$MinInSec <- minutesPlayedToSeconds(mydata_fixed2$`Minutes played`)
head(mydata_fixed2[c(-4,-10)])
## Luka's Team Opponents Win or Lose Field baskets
## 1 Real Madrid Khimki Moscow Region L 0
## 2 Real Madrid Crvena Zvezda Telekom Belgrade W 1
## 3 Real Madrid Fenerbahce Istanbul W 1
## 4 Real Madrid FC Bayern Munich W 1
## 5 Real Madrid Strasbourg W 1
## 6 Real Madrid Brose Baskets Bamberg W 1
## Field basket attempts Three point basket Three point basket attempts
## 1 0 0 0
## 2 3 1 2
## 3 4 0 2
## 4 3 0 2
## 5 2 0 0
## 6 2 0 1
## Free throws Total Rebounds Points MinInSec
## 1 2 0 2 340
## 2 0 5 3 1137
## 3 4 3 6 1042
## 4 2 1 4 390
## 5 1 4 3 786
## 6 0 2 2 1216
Because the Minutes played variable could not be properly assesed i
transformed the minutes into seconds which are analysed much more
easier.
3: Main goal
During the analysis I will try to prove 3 hypotheses:
Firstly I will try to prove that if the team of Luka Dončič lost, he
scored fewer points than if they won.
I will also try to assess that if Luka Dončič spent more time on the
court, he scored more points.
In the end I will try to prove that he scored fewer Field baskets,
Three point baskets and Free throws due to his career still being in the
beginning of the development.
4: Data set explanation
mydata_fixed2$`Win or Lose` <- as.factor(mydata_fixed2$`Win or Lose`)
head(mydata_fixed2[c(-4,-6,-8,-10,-11)])
## Luka's Team Opponents Win or Lose Field baskets
## 1 Real Madrid Khimki Moscow Region L 0
## 2 Real Madrid Crvena Zvezda Telekom Belgrade W 1
## 3 Real Madrid Fenerbahce Istanbul W 1
## 4 Real Madrid FC Bayern Munich W 1
## 5 Real Madrid Strasbourg W 1
## 6 Real Madrid Brose Baskets Bamberg W 1
## Three point basket Free throws Points MinInSec
## 1 0 2 2 340
## 2 1 0 3 1137
## 3 0 4 6 1042
## 4 0 2 4 390
## 5 0 1 3 786
## 6 0 0 2 1216
Here i firstly transformed my “Win or Lose” section into a factor, so
when I use the summary function that describes the data, puts 0 or 1 for
L or W.
My unit of observation is performance from Luka Dončič per game My
sample size was 80
As primary variables I took the following ones: Luka’s Team which was
at that time Real Madrid Opponents which were the opposing team at the
specific match Win or Lose which states if they lost or won the match
Field baskets which were the points scored by Luka from game (excluding
Three point baskets and Free throws) Three point basket which represent
the points scored from behind the 3 point line Free throws variable
represents the points scored after he was fouled and got awarded 2 free
throws Points represents the points scored in a specific match MinInSec
represents the seconds spent on the court for a specific match
Unit of measurement in Win or Lose is either you won or lost the
match “W or L” Unit of measurement for Field baskets, Three point
baskets and Free throws was points scored Unit of measurement for points
was points scored in a whole match Unit of measurement for MinInSec was
seconds spent on the court
5: Descriptive statistics
summary(mydata_fixed2[c(-4,-6,-8,-10,-11)])
## Luka's Team Opponents Win or Lose Field baskets
## Length:80 Length:80 L:26 Min. : 0.00
## Class :character Class :character W:54 1st Qu.: 1.00
## Mode :character Mode :character Median : 3.00
## Mean : 3.15
## 3rd Qu.: 5.00
## Max. :12.00
##
## Three point basket Free throws Points MinInSec
## Min. :0.000 Min. : 0.00 Min. : 0.00 Min. : 24
## 1st Qu.:0.000 1st Qu.: 0.75 1st Qu.: 4.75 1st Qu.:1014
## Median :1.000 Median : 2.50 Median :10.00 Median :1322
## Mean :1.262 Mean : 3.00 Mean :10.56 Mean :1266
## 3rd Qu.:2.000 3rd Qu.: 5.00 3rd Qu.:16.00 3rd Qu.:1586
## Max. :4.000 Max. :11.00 Max. :33.00 Max. :2268
## NA's :4
‘Mean’
Mean of the scored points through field baskets was 3.15 Mean of
scored points through three point baskets was approx. 1.26 Mean of
scored points through free throws 3.00 Mean of scored Points was 10.56
Mean of MinInSec(seconds played during the match) was 1266 seconds which
is approximately 21 minutes per match
‘Median’
In 50% of matches Luka scored 3 points or less through field baskets.
In other 50 percent he scored 3 points or more from field baskets In 50%
of matches Luka scored 1 point or less through three point throws. In
other 50 percent he scored 1 point or more from three point throws In
50% of matches Luka scored 2.5 points or less through free throws. In
other 50 percent he scored 2.5 points or more from free throws In 50% of
matches Luka scored 10.56 points or less during the whole match. In
other 50 percent he scored 10.56 points or more during the whole match
In 50 percent of matches Luka played 1322 seconds per match. In other 50
percent he played 1322 seconds per match or more
‘min’
The minimum points scored through field baskets, three point baskets
and free throws was 0. The minimum points that Luka scored during the
whole match was also 0. The minimum seconds Luka played in a match was
24.
‘max’
The maximum points scored through field baskets was 12 The maximum
points scored through three point baskets was 4 The maximum points
scored through free throws was 11 The maximum points scored during the
whole match was 33 The maximum seconds played during a match was
2268
6. Visual representation
ggplot(mydata_fixed2, aes(
x = `Win or Lose`, y = Points, fill = `Win or Lose`))+
geom_boxplot()

In order to assess the first hypothesis which is “if the team of Luka
Dončič lost, he scored fewer points than if they won” I firstly
presented the relevant data in a boxplot. From the visual representation
it is visible that when Luka lost a game, he scored more points than
when he won the game. It is also visible that when he won the match he
had a smaller range of scored points. This can be attributed that when
Luka’s team was losing, he was able to play more and score more
points
From this we can conclude that he scored more points when his team
lost. So we can reject the hypothesis
scatterplot(x = mydata_fixed2$Points, y = mydata_fixed2$MinInSec,
ylab = "Seconds played per match",
xlab = "Scored points",
smooth = FALSE)

The second hypthesis that will be assesed is “if Luka Dončič spent
more time on the court, he scored more points”. To analyze this
hypothesis I used variable minutes played in seconds and scored points.
In order to have a clear representation between the 2 hypotheses I used
the scatterplot. This allowed me to asses the connection between
variable on x axis and y axis. From the visual representation it can be
clearly seen that the more seconds he played per match the more points
he scored.
With this in mind, we can accept the second hypothesis
hist(mydata_fixed2$`Field baskets`,
main = "Distribution of Field Baskets",
xlab = "Scored Field Baskets",
ylab = "Frequency",
breaks = seq(from = -1, to = 12, by = 1),
col = "orange")

hist(mydata_fixed2$`Three point basket`,
main = "Distribution of Three point Baskets",
xlab = "Scored three point Baskets",
ylab = "Frequency",
breaks = seq(from = 0, to = 4, by = 1),
col = "orange")

hist(mydata_fixed2$`Free throws`,
main = "Distribution of Free Throws",
xlab = "Scored Free Throws",
ylab = "Frequency",
breaks = seq(from = 0, to = 11, by = 1),
col = "orange")

In order to accept or reject the final hypthesis which is “he scored
fewer Field baskets, Three point baskets and Free throws due to his
career still being in the beginning of the development” I helped myself
with a histogram where on the x axis were different variables (field
baskets, three point baskets and free throws) and on the y axis was the
frequency. From all the histograms it is visible that the alignment is
not normal and skewed to the right. That being said, the most skewed is
the distribution of free throws.
From this it is seen, that he predominantely scored less points from
all the above mentioned analyzed variables. This is why I can accept the
hypothesis as he was still in the beginning development so he didnt have
as high of statistics as he has now.