EPS 700 Lab 1

Author

Ao (Alan) Huang

rm(list = ls())

Task 2: Conduct some General R Practice (1 point)

  1. Perform the following calculations in R
15-4
[1] 11
27+13
[1] 40
5^11
[1] 48828125
(6+5)/(2^(1/2))
[1] 7.778175
(1/2)*((200^(-2))^(1/2))
[1] 0.0025
  1. Create and perform operations with variables
Var1 <- 6
Var2 <- 22
Var1/Var2
[1] 0.2727273
Var1*Var2
[1] 132
Var1^Var2
[1] 1.316217e+17
Var3 <- Var1^Var2
Var3>Var1*Var2
[1] TRUE
sqrt(Var2)>sqrt(Var3-Var1)
[1] FALSE

Task 3: Importing data into R

  1. Install and load the following packages: ggplot2
require(ggplot2) || install.packages("ggplot2")
Loading required package: ggplot2
[1] TRUE
library(ggplot2)
  1. Import the Dataset
df <- read.csv(file.choose(), header=TRUE, sep=",")
  1. Inspect the headers
head(df)
      Name Country HeightInches Age Goals Assists Points Minutes GamesPlayed
1     Saka  Canada           65  26    17      45     62    1333          45
2 Lapinski  Canada           70  23     8      33     41    1406          49
3   Angard  Canada           69  19     6      23     29    1347          55
4      Fox  Canada           69  33    11      12     23    1445          65
5     Bure  Canada           74  19    13      29     42    1849          72
6     Park  Canada           68  32     7      17     24    2110          79
  FreeAgent PlusMin
1        No     -14
2       Yes       9
3       Yes      33
4       Yes      12
5        No      -2
6       Yes      44

b, I actually did not see asterisks next to values, probably because this is in a quarto file

  1. Find the variable names
names(df)
 [1] "Name"         "Country"      "HeightInches" "Age"          "Goals"       
 [6] "Assists"      "Points"       "Minutes"      "GamesPlayed"  "FreeAgent"   
[11] "PlusMin"     
  1. Summarize your data
summary(df)
     Name             Country           HeightInches        Age       
 Length:31          Length:31          Min.   :63.00   Min.   :19.00  
 Class :character   Class :character   1st Qu.:67.00   1st Qu.:23.00  
 Mode  :character   Mode  :character   Median :70.00   Median :25.00  
                                       Mean   :69.84   Mean   :25.81  
                                       3rd Qu.:72.50   3rd Qu.:29.50  
                                       Max.   :77.00   Max.   :33.00  
     Goals          Assists          Points         Minutes      GamesPlayed   
 Min.   : 3.00   Min.   : 4.00   Min.   : 7.00   Min.   :1239   Min.   :45.00  
 1st Qu.: 6.50   1st Qu.:16.50   1st Qu.:25.00   1st Qu.:1494   1st Qu.:62.50  
 Median :11.00   Median :22.00   Median :33.00   Median :1801   Median :72.00  
 Mean   :13.26   Mean   :26.06   Mean   :39.32   Mean   :1729   Mean   :68.94  
 3rd Qu.:17.00   3rd Qu.:33.00   3rd Qu.:48.50   3rd Qu.:1934   3rd Qu.:78.00  
 Max.   :42.00   Max.   :56.00   Max.   :78.00   Max.   :2234   Max.   :82.00  
  FreeAgent            PlusMin      
 Length:31          Min.   :-24.00  
 Class :character   1st Qu.:  0.00  
 Mode  :character   Median :  9.00  
                    Mean   : 11.97  
                    3rd Qu.: 24.50  
                    Max.   : 48.00  
  1. Use the command describe()
require(psych) || install.packages("psych")
Loading required package: psych

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
[1] TRUE
library(psych)
describe(df)
             vars  n    mean     sd median trimmed    mad  min  max range  skew
Name*           1 31   16.00   9.09     16   16.00  11.86    1   31    30  0.00
Country*        2 31    2.32   1.17      2    2.28   1.48    1    4     3  0.23
HeightInches    3 31   69.84   3.62     70   69.92   4.45   63   77    14 -0.18
Age             4 31   25.81   4.25     25   25.76   4.45   19   33    14  0.15
Goals           5 31   13.26   9.61     11   11.76   8.90    3   42    39  1.21
Assists         6 31   26.06  12.85     22   25.00  10.38    4   56    52  0.65
Points          7 31   39.32  19.51     33   37.92  13.34    7   78    71  0.63
Minutes         8 31 1728.90 291.78   1801 1728.24 366.20 1239 2234   995 -0.03
GamesPlayed     9 31   68.94  10.98     72   70.00  10.38   45   82    37 -0.68
FreeAgent*     10 31    1.61   0.50      2    1.64   0.00    1    2     1 -0.44
PlusMin        11 31   11.97  18.48      9   11.64  17.79  -24   48    72  0.18
             kurtosis    se
Name*           -1.32  1.63
Country*        -1.47  0.21
HeightInches    -0.96  0.65
Age             -1.25  0.76
Goals            1.06  1.73
Assists         -0.44  2.31
Points          -0.79  3.50
Minutes         -1.23 52.41
GamesPlayed     -0.87  1.97
FreeAgent*      -1.86  0.09
PlusMin         -0.74  3.32

if skew is larger than zero, right skewed: Age, Goals, Assists

if skew is smaller than zero, left skewed: HeightInches, Minutes, GamesPlayed

  1. Use the descibeBy function, grouped by Country
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# a. Country with the most players
most_players_country <- df %>%
  group_by(Country) %>%
  summarise(NumberOfPlayers = n()) %>%
  arrange(desc(NumberOfPlayers)) %>%
  slice(1)
# b. Country with the highest average points
highest_avg_points <- df %>%
  group_by(Country) %>%
  summarize(AveragePoints = mean(Points, na.rm = TRUE)) %>%
  arrange(desc(AveragePoints)) %>%
  top_n(1, AveragePoints)

# c. Country with the lowest mean age
lowest_mean_age <- df %>%
  group_by(Country) %>%
  summarize(MeanAge = mean(Age, na.rm = TRUE)) %>%
  arrange(MeanAge) %>%
  top_n(1, -MeanAge)

# d. Country with the highest variation in minutes played
highest_variation_minutes <- df %>%
  group_by(Country) %>%
  summarize(VariationMinutes = sd(Minutes, na.rm = TRUE)) %>%
  arrange(desc(VariationMinutes)) %>%
  top_n(1, VariationMinutes)

print(most_players_country)
# A tibble: 1 × 2
  Country NumberOfPlayers
  <chr>             <int>
1 Canada               10
print(highest_avg_points)
# A tibble: 1 × 2
  Country AveragePoints
  <chr>           <dbl>
1 Finland          41.8
print(lowest_mean_age)
# A tibble: 1 × 2
  Country MeanAge
  <chr>     <dbl>
1 USA        24.4
print(highest_variation_minutes)
# A tibble: 1 × 2
  Country VariationMinutes
  <chr>              <dbl>
1 Canada              328.

Task 4: Descriptive Statistics

1 .Fill out the table below for each variable in the data set.

data <- data.frame(
  Name = c("Name", "Country", "HeightInch", "Age", "Goals", "Assists", 
           "Points", "Minutes", "GamesPlayed", "FreeAgent", "PlusMin"),
  Definition = c("Player Last name", "Player’s home country", "Height in inches", "Age", 
                 "Goals scored in the season", "Assists registered in the season", 
                 "Points (goals + assists)", "Minutes play over the course of the season", 
                 "Games appeared in", "Is the player a free agent next year?", 
                 "Plus-Mins rating for the player"),
  Range = c(NA, NA, 14, 14, 39, 52, 71, 995, 37, NA, 72),
  Median = c(NA, NA, 70, 25, 11, 22, 33, 1801, 72, NA, 9),
  Measurement_Type = c("Nominal", "Nominal", "Numeric", "Numeric", "Numeric", 
                       "Numeric", "Numeric", "Numeric", "Numeric", "Nominal", 
                       "Numeric")
)

# Replace 'N/A' with NA for proper missing values representation in R
data[is.na(data)] <- NA
library(knitr)
kable(data)
Name Definition Range Median Measurement_Type
Name Player Last name NA NA Nominal
Country Player’s home country NA NA Nominal
HeightInch Height in inches 14 70 Numeric
Age Age 14 25 Numeric
Goals Goals scored in the season 39 11 Numeric
Assists Assists registered in the season 52 22 Numeric
Points Points (goals + assists) 71 33 Numeric
Minutes Minutes play over the course of the season 995 1801 Numeric
GamesPlayed Games appeared in 37 72 Numeric
FreeAgent Is the player a free agent next year? NA NA Nominal
PlusMin Plus-Mins rating for the player 72 9 Numeric
  1. What is the overall mean minutes per game of all the players

    df$Minutes <- as.numeric(df$Minutes)
    df$GamesPlayed <- as.numeric(df$GamesPlayed)
    
    # Create the MinutesPerGame column
    df <- transform(df, MinutesPerGame = Minutes / GamesPlayed)
    
    # Sum the total minutes and total games played across all players
    total_minutes <- sum(df$Minutes, na.rm = TRUE)
    total_games_played <- sum(df$GamesPlayed, na.rm = TRUE)
    
    # Calculate the overall mean minutes per game
    overall_mean_minutes_per_game <- total_minutes / total_games_played
    overall_mean_minutes_per_game
    [1] 25.08002
  2. How many countries are represented in the data? Which country has the most players? Which countries has the highest average height?

    # Number of countries represented
    number_of_countries <- length(unique(df$Country))
    
    # Country with the most players
    country_most_players <- df %>%
      group_by(Country) %>%
      summarize(NumberOfPlayers = n()) %>%
      arrange(desc(NumberOfPlayers)) %>%
      slice(1)
    
    # Country with the highest average height
    country_highest_avg_height <- df %>%
      group_by(Country) %>%
      summarize(AverageHeight = mean(HeightInches, na.rm = TRUE)) %>%
      arrange(desc(AverageHeight)) %>%
      slice(1)
    
    print(number_of_countries)
    [1] 4
    print(country_most_players)
    # A tibble: 1 × 2
      Country NumberOfPlayers
      <chr>             <int>
    1 Canada               10
    print(country_highest_avg_height)
    # A tibble: 1 × 2
      Country AverageHeight
      <chr>           <dbl>
    1 USA              71.1
  3. For each of the following variables, what is the mean, median, and mode?

    # Function to calculate mode
    get_mode <- function(v) {
      uniqv <- unique(v)
      uniqv[which.max(tabulate(match(v, uniqv)))]
    }
    
    # Mean, median, and mode for each variable
    stats_age <- c(mean = mean(df$Age, na.rm = TRUE), 
                   median = median(df$Age, na.rm = TRUE), 
                   mode = get_mode(df$Age))
    
    stats_points <- c(mean = mean(df$Points, na.rm = TRUE), 
                      median = median(df$Points, na.rm = TRUE), 
                      mode = get_mode(df$Points))
    
    stats_minutes_per_game <- c(mean = mean(df$MinutesPerGame, na.rm = TRUE), 
                                median = median(df$MinutesPerGame, na.rm = TRUE), 
                                mode = get_mode(df$MinutesPerGame))
    
    stats_games_played <- c(mean = mean(df$GamesPlayed, na.rm = TRUE), 
                            median = median(df$GamesPlayed, na.rm = TRUE), 
                            mode = get_mode(df$GamesPlayed))
    print(stats_age)
        mean   median     mode 
    25.80645 25.00000 23.00000 
    print(stats_points)
        mean   median     mode 
    39.32258 33.00000 25.00000 
    print(stats_minutes_per_game)
        mean   median     mode 
    25.20837 25.21212 29.62222 
    print(stats_games_played)
        mean   median     mode 
    68.93548 72.00000 72.00000 
    1. For each of the following categories, what is the mean, median, and mode?
    stats_heightinc <- c(mean = mean(df$HeightInch, na.rm = TRUE), 
                         median = median(df$HeightInch, na.rm = TRUE), 
                         mode = get_mode(df$HeightInch))
    
    stats_minutes <- c(mean = mean(df$Minutes, na.rm = TRUE), 
                       median = median(df$Minutes, na.rm = TRUE), 
                       mode = get_mode(df$Minutes))
    
    stats_plusmin <- c(mean = mean(df$PlusMin, na.rm = TRUE), 
                       median = median(df$PlusMin, na.rm = TRUE), 
                       mode = get_mode(df$PlusMin))
    print(stats_heightinc)
        mean   median     mode 
    69.83871 70.00000 71.00000 
    print(stats_minutes)
        mean   median     mode 
    1728.903 1801.000 1333.000 
    print(stats_plusmin)
        mean   median     mode 
    11.96774  9.00000 13.00000 

Task 5: Creating Graphs/Charts

  1. Create a histogram showing the distribution of Goals for all players.

    library(ggplot2)
    
    # Assuming 'Goals' is a column in your data frame 'df'
    ggplot(df, aes(x=Goals)) +
      geom_histogram(binwidth=1, fill="blue", color="black") +
      labs(title="Histogram of Goals", x="Goals", y="Count") +
      theme_minimal()

The distribution is mainly within the range of 3 and 22, with 33 and 42 as outliers. Due to the small sample size, it is hard to characterize the distribution. It is NOT normal or uniform distribution. I would say the shape is left-skewed.

  1. Create a scatterplot of minutes per game

    ggplot(df, aes(x=MinutesPerGame, y=Points)) +
      geom_point(alpha=0.5, color="blue") +
      labs(title="Scatterplot of Minutes per Game vs. Points",
           x="Minutes per Game", y="Points") +
      theme_minimal()

It is hard to say anything concrete about the correlation. It seems positive but not very strong.

  1. Create a boxplot showing assists by country
ggplot(df, aes(x=Country, y=Assists)) +
  geom_boxplot(fill="cyan") +
  labs(title="Boxplot of Assists by Country", x="Country", y="Assists") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Three criteria:

  • The median line (inside the box) close to the center of the box.

  • The smallest interquartile range (IQR), which is the distance between the first and third quartiles.

  • Few or no outliers, which are typically indicated by dots beyond the whiskers.

Finland would be the one whose average assists are least likely to be influenced by an outlier because its data is more clustered around the median with less spread.

Feedback

1. How long did this lab take you to complete?

4h+

2. What parts, if any, were large departures from the course material?

Maybe the graphing part? Actually, all are relevant in the material

3. Which questions, if any, were unduly frustrating or challenging?

None, but it is hard to ‘present’ or ‘showcase’ the results we can easily have in the console.

4. Which questions, if any, were especially useful/interesting?

Task 5 3

The codes are also publicly available at: https://rpubs.com/AlanHuang/EPS700_Lab1