Homework 1

1: Data manipulations

mydata1 <- read.table("~/Desktop/R December/Euroleague matches.csv", 
                     header=TRUE, 
                     sep=",", 
                     dec=",",
                     na.strings=c("","NA"))

mydata_fixed <- mydata1[ , c(-1, -2, -3, -5, -11, -14, -17,-18, -19, -21, -22, -23, -24, -25, -27 ) ] 

head(mydata_fixed)
##          Team                       Opponent X.1    MP FG FGA X3P X3PA FT FTA
## 1 Real Madrid           Khimki Moscow Region   L  5:40  0   0   0    0  2   2
## 2 Real Madrid Crvena Zvezda Telekom Belgrade   W 18:57  1   3   1    2  0   0
## 3 Real Madrid            Fenerbahce Istanbul   W 17:22  1   4   0    2  4   4
## 4 Real Madrid               FC Bayern Munich   W  6:30  1   3   0    2  2   2
## 5 Real Madrid                     Strasbourg   W 13:06  1   2   0    0  1   2
## 6 Real Madrid          Brose Baskets Bamberg   W 20:16  1   2   0    1  0   0
##   TRB PTS
## 1   0   2
## 2   5   3
## 3   3   6
## 4   1   4
## 5   4   3
## 6   2   2

Here I firstly eliminated the variables that I will definetly not use during my analysis process

mydata_fixed2 <- drop_na(mydata_fixed)

head(mydata_fixed2)
##          Team                       Opponent X.1    MP FG FGA X3P X3PA FT FTA
## 1 Real Madrid           Khimki Moscow Region   L  5:40  0   0   0    0  2   2
## 2 Real Madrid Crvena Zvezda Telekom Belgrade   W 18:57  1   3   1    2  0   0
## 3 Real Madrid            Fenerbahce Istanbul   W 17:22  1   4   0    2  4   4
## 4 Real Madrid               FC Bayern Munich   W  6:30  1   3   0    2  2   2
## 5 Real Madrid                     Strasbourg   W 13:06  1   2   0    0  1   2
## 6 Real Madrid          Brose Baskets Bamberg   W 20:16  1   2   0    1  0   0
##   TRB PTS
## 1   0   2
## 2   5   3
## 3   3   6
## 4   1   4
## 5   4   3
## 6   2   2

In this section I put where there is a missing section 0 so that the analysis will be better and possible.

mydata_fixed2 <- subset(mydata_fixed2, select=c(1,2,3,4,5,6,7,8,9,10,11,12))
colnames(mydata_fixed2) <- c("Luka's Team", "Opponents", "Win or Lose", "Minutes played", "Field baskets", "Field basket attempts", "Three point basket", "Three point basket attempts", "Free throws", "Free throw attempts", "Total Rebounds", "Points") 

head(mydata_fixed2)
##   Luka's Team                      Opponents Win or Lose Minutes played
## 1 Real Madrid           Khimki Moscow Region           L           5:40
## 2 Real Madrid Crvena Zvezda Telekom Belgrade           W          18:57
## 3 Real Madrid            Fenerbahce Istanbul           W          17:22
## 4 Real Madrid               FC Bayern Munich           W           6:30
## 5 Real Madrid                     Strasbourg           W          13:06
## 6 Real Madrid          Brose Baskets Bamberg           W          20:16
##   Field baskets Field basket attempts Three point basket
## 1             0                     0                  0
## 2             1                     3                  1
## 3             1                     4                  0
## 4             1                     3                  0
## 5             1                     2                  0
## 6             1                     2                  0
##   Three point basket attempts Free throws Free throw attempts Total Rebounds
## 1                           0           2                   2              0
## 2                           2           0                   0              5
## 3                           2           4                   4              3
## 4                           2           2                   2              1
## 5                           0           1                   2              4
## 6                           1           0                   0              2
##   Points
## 1      2
## 2      3
## 3      6
## 4      4
## 5      3
## 6      2

For me to analyse the data I named it in a manner, that can be better understood. For instance instead of FT I put Free Throws

minutesPlayedToSeconds <- function(minutes) {
  return (strtoi(sub(":.*","", minutes)) * 60 + strtoi(substr(minutes,nchar(minutes) - 2 + 1, nchar(minutes))))
}

mydata_fixed2$MinInSec <- minutesPlayedToSeconds(mydata_fixed2$`Minutes played`)

head(mydata_fixed2[c(-4,-10)])
##   Luka's Team                      Opponents Win or Lose Field baskets
## 1 Real Madrid           Khimki Moscow Region           L             0
## 2 Real Madrid Crvena Zvezda Telekom Belgrade           W             1
## 3 Real Madrid            Fenerbahce Istanbul           W             1
## 4 Real Madrid               FC Bayern Munich           W             1
## 5 Real Madrid                     Strasbourg           W             1
## 6 Real Madrid          Brose Baskets Bamberg           W             1
##   Field basket attempts Three point basket Three point basket attempts
## 1                     0                  0                           0
## 2                     3                  1                           2
## 3                     4                  0                           2
## 4                     3                  0                           2
## 5                     2                  0                           0
## 6                     2                  0                           1
##   Free throws Total Rebounds Points MinInSec
## 1           2              0      2      340
## 2           0              5      3     1137
## 3           4              3      6     1042
## 4           2              1      4      390
## 5           1              4      3      786
## 6           0              2      2     1216

Because the Minutes played variable could not be properly assesed i transformed the minutes into seconds which are analysed much more easier.

3: Main goal

During the analysis I will try to prove 3 hypotheses:

Firstly I will try to prove that if the team of Luka Dončič lost, he scored fewer points than if they won.

I will also try to assess that if Luka Dončič spent more time on the court, he scored more points.

In the end I will try to prove that he scored fewer Field baskets, Three point baskets and Free throws due to his career still being in the beginning of the development.

4: Data set explanation

mydata_fixed2$`Win or Lose` <- as.factor(mydata_fixed2$`Win or Lose`)
head(mydata_fixed2[c(-4,-6,-8,-10,-11)])
##   Luka's Team                      Opponents Win or Lose Field baskets
## 1 Real Madrid           Khimki Moscow Region           L             0
## 2 Real Madrid Crvena Zvezda Telekom Belgrade           W             1
## 3 Real Madrid            Fenerbahce Istanbul           W             1
## 4 Real Madrid               FC Bayern Munich           W             1
## 5 Real Madrid                     Strasbourg           W             1
## 6 Real Madrid          Brose Baskets Bamberg           W             1
##   Three point basket Free throws Points MinInSec
## 1                  0           2      2      340
## 2                  1           0      3     1137
## 3                  0           4      6     1042
## 4                  0           2      4      390
## 5                  0           1      3      786
## 6                  0           0      2     1216

Here i firstly transformed my “Win or Lose” section into a factor, so when I use the summary function that describes the data, puts 0 or 1 for L or W.

My unit of observation is performance from Luka Dončič per game My sample size was 80

As primary variables I took the following ones: Luka’s Team which was at that time Real Madrid Opponents which were the opposing team at the specific match Win or Lose which states if they lost or won the match Field baskets which were the points scored by Luka from game (excluding Three point baskets and Free throws) Three point basket which represent the points scored from behind the 3 point line Free throws variable represents the points scored after he was fouled and got awarded 2 free throws Points represents the points scored in a specific match MinInSec represents the seconds spent on the court for a specific match

Unit of measurement in Win or Lose is either you won or lost the match “W or L” Unit of measurement for Field baskets, Three point baskets and Free throws was points scored Unit of measurement for points was points scored in a whole match Unit of measurement for MinInSec was seconds spent on the court

5: Descriptive statistics

summary(mydata_fixed2[c(-4,-6,-8,-10,-11)])
##  Luka's Team         Opponents         Win or Lose Field baskets  
##  Length:80          Length:80          L:26        Min.   : 0.00  
##  Class :character   Class :character   W:54        1st Qu.: 1.00  
##  Mode  :character   Mode  :character               Median : 3.00  
##                                                    Mean   : 3.15  
##                                                    3rd Qu.: 5.00  
##                                                    Max.   :12.00  
##                                                                   
##  Three point basket  Free throws        Points         MinInSec   
##  Min.   :0.000      Min.   : 0.00   Min.   : 0.00   Min.   :  24  
##  1st Qu.:0.000      1st Qu.: 0.75   1st Qu.: 4.75   1st Qu.:1014  
##  Median :1.000      Median : 2.50   Median :10.00   Median :1322  
##  Mean   :1.262      Mean   : 3.00   Mean   :10.56   Mean   :1266  
##  3rd Qu.:2.000      3rd Qu.: 5.00   3rd Qu.:16.00   3rd Qu.:1586  
##  Max.   :4.000      Max.   :11.00   Max.   :33.00   Max.   :2268  
##                                                     NA's   :4

‘Mean’

Mean of the scored points through field baskets was 3.15 Mean of scored points through three point baskets was approx. 1.26 Mean of scored points through free throws 3.00 Mean of scored Points was 10.56 Mean of MinInSec(seconds played during the match) was 1266 seconds which is approximately 21 minutes per match

‘Median’

In 50% of matches Luka scored 3 points or less through field baskets. In other 50 percent he scored 3 points or more from field baskets In 50% of matches Luka scored 1 point or less through three point throws. In other 50 percent he scored 1 point or more from three point throws In 50% of matches Luka scored 2.5 points or less through free throws. In other 50 percent he scored 2.5 points or more from free throws In 50% of matches Luka scored 10.56 points or less during the whole match. In other 50 percent he scored 10.56 points or more during the whole match In 50 percent of matches Luka played 1322 seconds per match. In other 50 percent he played 1322 seconds per match or more

‘min’

The minimum points scored through field baskets, three point baskets and free throws was 0. The minimum points that Luka scored during the whole match was also 0. The minimum seconds Luka played in a match was 24.

‘max’

The maximum points scored through field baskets was 12 The maximum points scored through three point baskets was 4 The maximum points scored through free throws was 11 The maximum points scored during the whole match was 33 The maximum seconds played during a match was 2268

6. Visual representation

ggplot(mydata_fixed2, aes(
  x = `Win or Lose`, y = Points, fill = `Win or Lose`))+
  geom_boxplot()

In order to assess the first hypothesis which is “if the team of Luka Dončič lost, he scored fewer points than if they won” I firstly presented the relevant data in a boxplot. From the visual representation it is visible that when Luka lost a game, he scored more points than when he won the game. It is also visible that when he won the match he had a smaller range of scored points. This can be attributed that when Luka’s team was losing, he was able to play more and score more points

From this we can conclude that he scored more points when his team lost. So we can reject the hypothesis

scatterplot(x = mydata_fixed2$Points, y = mydata_fixed2$MinInSec, 
            ylab = "Seconds played per match", 
            xlab = "Scored points", 
            smooth = FALSE)

The second hypthesis that will be assesed is “if Luka Dončič spent more time on the court, he scored more points”. To analyze this hypothesis I used variable minutes played in seconds and scored points. In order to have a clear representation between the 2 hypotheses I used the scatterplot. This allowed me to asses the connection between variable on x axis and y axis. From the visual representation it can be clearly seen that the more seconds he played per match the more points he scored.

With this in mind, we can accept the second hypothesis

hist(mydata_fixed2$`Field baskets`,
     main = "Distribution of Field Baskets", 
     xlab = "Scored Field Baskets", 
     ylab = "Frequency", 
     breaks = seq(from = -1, to = 12, by = 1),
     col = "orange")

hist(mydata_fixed2$`Three point basket`,
     main = "Distribution of Three point Baskets", 
     xlab = "Scored three point Baskets", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 4, by = 1),
     col = "orange")

hist(mydata_fixed2$`Free throws`,
     main = "Distribution of Free Throws", 
     xlab = "Scored Free Throws", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 11, by = 1),
     col = "orange")

In order to accept or reject the final hypthesis which is “he scored fewer Field baskets, Three point baskets and Free throws due to his career still being in the beginning of the development” I helped myself with a histogram where on the x axis were different variables (field baskets, three point baskets and free throws) and on the y axis was the frequency. From all the histograms it is visible that the alignment is not normal and skewed to the right. That being said, the most skewed is the distribution of free throws.

From this it is seen, that he predominantely scored less points from all the above mentioned analyzed variables. This is why I can accept the hypothesis as he was still in the beginning development so he didnt have as high of statistics as he has now.