rm(list = ls())

Getting Started

#load packages#
library(psych) #for describe#
library(tidyverse) #for ggplot and dplyr#
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car) 
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## The following object is masked from 'package:psych':
## 
##     logit
library(lsr)
## Warning: package 'lsr' was built under R version 4.3.3
library(lessR)
## Warning: package 'lessR' was built under R version 4.3.3
## 
## lessR 4.3.0                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables
##   Enter:  browseVignettes("lessR")
## 
## View changes in this and recent versions of lessR
##   Enter: news(package="lessR")
## 
## Interactive data analysis
##   Enter: interact()
## 
## 
## Attaching package: 'lessR'
## 
## The following objects are masked from 'package:car':
## 
##     bc, recode, sp
## 
## The following objects are masked from 'package:dplyr':
## 
##     recode, rename
## 
## The following objects are masked from 'package:psych':
## 
##     reflect, rescale, scree, skew
#import data#
hockey2<- read.csv(file.choose(), header=TRUE, sep=",")
glimpse(hockey2)
## Rows: 55
## Columns: 11
## $ Name         <chr> "Kluivert", "Lapinski", "Goldberg", "Elias", "Palmer", "G…
## $ Country      <chr> "Sweden", "Canada", "USA", "Sweden", "USA", "Finland", "U…
## $ HeightInches <int> 74, 70, 72, 67, 75, 73, 70, 68, 66, 64, 71, 69, 79, 64, 7…
## $ Age          <int> 26, 23, 21, 24, 21, 25, 30, 35, 30, 25, 23, 19, 30, 27, 1…
## $ Goals        <int> 33, 8, 8, 13, 3, 7, 11, 3, 17, 33, 4, 6, 3, 10, 13, 5, 15…
## $ Assists      <int> 33, 33, 19, 15, 4, 33, 6, 6, 22, 43, 21, 23, 21, 20, 29, …
## $ Points       <int> 66, 41, 27, 28, 7, 40, 17, 9, 39, 73, 25, 29, 24, 30, 42,…
## $ Minutes      <int> 1355, 1306, 1873, 1441, 1249, 1554, 1204, 1105, 1239, 144…
## $ GamesPlayed  <int> 78, 49, 72, 71, 69, 60, 69, 55, 55, 66, 51, 55, 64, 55, 7…
## $ FreeAgent    <chr> "Yes", "No", "No", "No", "No", "Yes", "Yes", "Yes", "Yes"…
## $ PlusMin      <int> -11, -13, -9, -2, -1, -1, 1, 1, 2, 3, 6, 4, 3, 5, 6, 6, 6…

Section 1: Confidence Intervals (1 Point)

Based on your sample you want to gain more insight into league-wide averages for a few values. To do this, you will construct confidence intervals, a form of statistical inference.

Since the R code here is new, I have included what you need highlighted in BLUE. Simply replace any all-caps values with the correct values and run the code (make sure the correct packages above are installed).

Step 1: Use describe() from the psych package and ciMean() function from the lsr package to complete the following table.

ciMean(x, conf = 0.95, na.rm = FALSE)

The variables: Points, Minutes, PlusMin

The statistics: Sample Mean, Sample Standard Deviation, 95% Confidence Interval of Mean, 99% Confidence Interval of Mean

variables <- c("Points", "Minutes", "PlusMin")

results <- data.frame(
  Variable = character(),
  SampleMean = numeric(),
  SampleSD = numeric(),
  CI95 = character(),
  CI99 = character(),
  stringsAsFactors = FALSE
)

for(var in variables) {
  des_stats <- describe(hockey2[[var]])
  mean <- des_stats$mean
  sd <- des_stats$sd
  
  ci_95 <- ciMean(hockey2[[var]], conf = 0.95, na.rm = TRUE)
  ci_99 <- ciMean(hockey2[[var]], conf = 0.99, na.rm = TRUE)
  
  ci_95_text <- paste0("(", round(ci_95[1], 2), ", ", round(ci_95[2], 2), ")")
  ci_99_text <- paste0("(", round(ci_99[1], 2), ", ", round(ci_99[2], 2), ")")
  
  results <- rbind(results, data.frame(
    Variable = var,
    SampleMean = mean,
    SampleSD = sd,
    CI95 = ci_95_text,
    CI99 = ci_99_text
  ))
}

markdown_table <- "| Variable | Sample Mean | Sample Standard Deviation | 95% CI of Mean | 99% CI of Mean |\n"
markdown_table <- paste0(markdown_table, "|----------|-----------|----------|----------------|----------------|\n")

for(i in 1:nrow(results)) {
  markdown_table <- paste0(markdown_table, "| ", results$Variable[i], 
                           " | ", round(results$SampleMean[i], 2), 
                           " | ", round(results$SampleSD[i], 2), 
                           " | ", results$CI95[i],
                           " | ", results$CI99[i],
                           " |\n")
}

cat(markdown_table)
## | Variable | Sample Mean | Sample Standard Deviation | 95% CI of Mean | 99% CI of Mean |
## |----------|-----------|----------|----------------|----------------|
## | Points | 38.62 | 19.32 | (33.39, 43.84) | (31.66, 45.57) |
## | Minutes | 1573.53 | 298.86 | (1492.73, 1654.32) | (1465.93, 1681.12) |
## | PlusMin | 12.85 | 11.26 | (9.81, 15.9) | (8.8, 16.91) |

-. What does the 95% Confidence Interval mean conceptually? What can we say about the league-wide mean? Use Points as an example.

A 95% Confidence Interval (CI) is a range of values, derived from sample statistics, that is likely to contain the population mean with a probability of 95%. Conceptually, it reflects the degree of uncertainty or certainty in the sample estimate of a population parameter. The wider the interval, the more uncertainty there is about the precise value of the population mean.

Considering the Points variable, the 95% CI is (33.39, 43.84). This means we are 95% confident that the true league-wide average number of points falls between 33.39 and 43.84. It does not imply that 95% of individual observations fall within this range, but rather that if we were to take many samples and calculate the CI for each, approximately 95% of those intervals would contain the true population mean. (This is very very important!!! Many people are confused by this!!!)

-. Compare the 95th and the 99th confidence intervals across variables. Which is larger or smaller? Conceptually, why is this the case?

When comparing the 95th and 99th confidence intervals for the variables Points, Minutes, and PlusMin, we will notice that the 99th confidence intervals are wider than the 95th confidence intervals for all variables:

This observation is consistent across variables because a higher confidence level (99% vs. 95%) implies a wider confidence interval. The reason is conceptual: to be more confident (from 95% to 99%) that the interval contains the true population mean, we need to accept a broader range of values. The wider interval compensates for the increased uncertainty, making it more likely to capture the true mean.

\[The 99% CI offers greater assurance (99% confidence) that it encloses the population mean compared to the 95% CI. However, this comes at the cost of precision¡ªthe interval is broader, reflecting greater uncertainty about the exact value of the population mean.\]

Step 2: Visualize confidence intervals using GGPlot to investigate the relationship between Points and Plus-Minus. At each stage below, paste your code and your visualization.

# Basic Scatterplot
ggplot(hockey2, aes(x=Points, y=PlusMin)) + 
  geom_point()

# Scatterplot with Trend Line
ggplot(hockey2, aes(x=Points, y=PlusMin)) + 
  geom_point() + 
  geom_smooth(method=lm, color="red", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

# Scatterplot with Trend Line and 95% Confidence Interval
ggplot(hockey2, aes(x=Points, y=PlusMin)) + 
  geom_point() + 
  geom_smooth(method=lm, color="red", se=TRUE)
## `geom_smooth()` using formula = 'y ~ x'

# Scatterplot with Trend Line and 99% Confidence Interval
ggplot(hockey2, aes(x=Points, y=PlusMin)) +
  geom_point() +
  geom_smooth(method=lm, color="red", se=TRUE, level=0.99)
## `geom_smooth()` using formula = 'y ~ x'

Summarize your findings ¨C based on this exercise, does there appear to be a relationship between Points and Plus-Minus? How confident are you of this (in qualitative terms)?

There appears to be a positive correlation between Points and Plus-Minus. In qualitative terms, we can be quite confident about this relationship (but not very confident to be exact).

Section 2: Independent Samples T-Test (4 Points)

A team executive is looking to recruit new players to help in the playoffs. ?One scout suggests signing a Free Agent player, but another argues Free Agents perform worse. If better performance is reflected by higher Plus-Minus, who is right?

To answer this question, we¡¯ll replicate each of the steps outlined in the lecture to test whether Free Agent and non-Free Agent players significantly differ in Plus-Minus. For the R code you need, refer to the class 3 slides.

Step 1: Clarify your research question. No data transformations should be needed.

Research Question: Whether Free Agent players perform differently from non-Free Agent players in terms of Plus-Minus

Grouping variable: Free Agent status (indicating whether a player is a Free Agent or not) Response variable: Plus-Minus (reflecting player performance)

Null Hypothesis (H0): There is no difference in the Plus-Minus scores between Free Agent and non-Free Agent players. Alternate Hypothesis (H1): There is a significant difference in the Plus-Minus scores between Free Agent and non-Free Agent players.

Step 2: Inspect your data. Use GGplot to create bar charts of the variable of interest by two groups. - What are some observations about the two groups from the boxplots? Which has a higher mean? Are their ranges comparable?

Optionally, at this point you may also want to use describeBy from the psych() package to see statistics for each group. It can be particularly useful to get a sense of n, mean, sd, and skew for each group.

ggplot(hockey2, aes(x=FreeAgent, y=PlusMin)) + 
  geom_boxplot() +
  labs(x="Free Agent Status", y="Plus-Minus")

describeBy(hockey2$PlusMin, group = hockey2$FreeAgent)
## 
##  Descriptive statistics by group 
## group: No
##    vars  n  mean    sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 30 15.53 12.93     15   15.83 12.6 -13  41    54 -0.19    -0.49 2.36
## ------------------------------------------------------------ 
## group: Yes
##    vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 25 9.64 7.97     11    9.81 7.41 -11  26    37 -0.36     0.11 1.59

Free Agent: obs is 30;

Non-Free Agent: obs is 25;

Free-Agent group has a much higher mean;

based on the min and max, these two groups are comparable.

Step 3: Evaluate your assumptions.

# QQ Plot for Normality
qqnorm(hockey2$PlusMin)
qqline(hockey2$PlusMin)

# Independence
"In the context of this research question, independence means that the Plus-Minus performance of Free Agent players is not influenced by or correlated with the performance of non-Free Agent players. Each player's performance is independent of the others'."
## [1] "In the context of this research question, independence means that the Plus-Minus performance of Free Agent players is not influenced by or correlated with the performance of non-Free Agent players. Each player's performance is independent of the others'."
# Homogeneity of Variance - Levene Test
library(car)
leveneTest(PlusMin ~ FreeAgent, data = hockey2)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  5.1575 0.02723 *
##       53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In a perfectly normal distribution, the data points on the QQ plot will fall exactly on a straight line. From the graph, we can conclude that the variable PlusMin is approximately normally distributed.

A non-significant p-value (usually p > 0.05) suggests that there is no evidence to reject the null hypothesis of equal variances among groups, indicating homogeneity of variances. But for the variable PlusMin, we reject the null hypothesis of equal variances at 5 % level, violating the assumption of homogeneity of variances.

Step 4: Run the T-Test.

t.test(PlusMin ~ FreeAgent, data = hockey2, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  PlusMin by FreeAgent
## t = 2.0695, df = 49.133, p-value = 0.04377
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##   0.1711605 11.6155062
## sample estimates:
##  mean in group No mean in group Yes 
##          15.53333           9.64000

Step 5: Write up your results below. 1. Write your results in the ¡°formal¡± format. 2. In your own words, explain what this test demonstrated about the relationship between Plus-Minus and Free Agent Status? Which scout is right, and how close is the debate?

Formal: A Welch Two Sample t-test was conducted to assess the difference in Plus-Minus scores between Free Agent (Yes) and non-Free Agent (No) players. The test revealed a significant difference (t = 2.0695, df = 49.133, p-value = 0.04377), with non-Free Agents (M = 15.53) having higher Plus-Minus scores on average compared to Free Agents (M = 9.64). The 95% confidence interval for the difference in means ranged from 0.171 to 11.616.

Explanation and Conclusion: The Welch t-test indicates a statistically significant difference in performance (as measured by Plus-Minus) between Free Agent and non-Free Agent players, with non-Free Agents performing better on average. This finding supports the scout who argued that Free Agents perform worse. While the debate might be close given the p-value is just under 0.05, the data lean towards suggesting a difference in performance favoring non-Free Agents.

Section 3: Chi-Square Test (2 Points)

A recent news story suggested that the league is unwelcoming to European players and, as a result, they are less likely to have long-term contracts. The league commissions an investigation to determine whether there is any substance to this concern. Specifically, they hope to determine whether European players are as likely to be FreeAgents as non-European players. To answer this question, we will use a Chi-Square Test of Independence. For the R code you need, refer to the Class 3 Slides. Step 1. Clarify your research question and transform your data.

- What is the null hypothesis? What is your alternate hypothesis?

To start your analysis, create a region variable in R where ¡®Sweden¡¯ and ¡®Finland¡¯ are associated with European and any other value is associated with ¡®Other;

Next, identify the two variables in your dataset will you be using. Use any technique you have learned previously to get an n count for each variable.

1.????? Name:???????????????? (n=??????? )??

2.????? Name:???????????????? (n=??????? )

hockey2$region <- ifelse(hockey2$Country %in% c("Sweden", "Finland"), "European", "Other")

Research Question: whether there’s a difference in the likelihood of being a Free Agent based on whether a player is from European countries (specifically Sweden and Finland) or from other regions.

the two variables to be used in the Chi-Square Test of Independence:

table(hockey2$FreeAgent)
## 
##  No Yes 
##  30  25
table(hockey2$region)
## 
## European    Other 
##       27       28

Step 2. Review your data. Fill out the contingency tables for count and frequency. - Paste a bar chart of the proportions below. - What do you notice in the bar chart? In 1-2 sentences, describe what it says about your data as it relates to your research question.

count_table <- table(hockey2$region, hockey2$FreeAgent)

frequency_table <- prop.table(table(hockey2$region, hockey2$FreeAgent), margin = 1)

print(count_table)
##           
##            No Yes
##   European 11  16
##   Other    19   9
print(frequency_table)
##           
##                   No       Yes
##   European 0.4074074 0.5925926
##   Other    0.6785714 0.3214286
ggplot(hockey2, aes(x = region, fill = FreeAgent)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "Region", y = "Proportion", fill = "Free Agent Status")

The contingency table and frequencies reveal a significant difference in Free Agent status between European players and players from other regions. European players have a higher proportion of being Free Agents (approximately 59.26%) compared to players from other regions, who have about 32.14% in the Free Agent category.

This observation directly indicates that regional differences might influence a player’s likelihood of being a Free Agent within the league.

Step 3. Confirm Assumptions

  1. Observations are independent. Each participant or observation should fall into one and only one cell of the contingency table.

  2. Large enough sample size. Each cell in the contingency table should have an expected count of 5 or more.

  3. Random sampling. The data should be collected in a way that is representative of the population, and each participant has an equal chance of being selected.

Step 4. Run and interpret the Chi-Square test. Paste your results below.

contingency_table <- table(hockey2$region, hockey2$FreeAgent)
chi_test_result <- chisq.test(contingency_table)
print(chi_test_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 3.0562, df = 1, p-value = 0.08043

Step 5: Write up your results below.

1. Write your results in ¡°formal¡± format. Include an analysis of your effect size calculation.

2. In your own words, explain what this test demonstrated about the relationship between European Players and Free Agent status.


The Pearson’s Chi-squared test with Yates’ continuity correction was performed to examine the association between player region (European vs. Other) and Free Agent status. The test resulted in a Chi-squared value of 3.0562 with 1 degree of freedom and a p-value of 0.08043, indicating no significant association between player region and Free Agent status at the 5% significance level.

This outcome suggests that, based on the data analyzed, there’s insufficient evidence to conclude a statistically significant difference (5%) in Free Agent status between European players and players from other regions.

Feedback

1. How long did this lab take?

3h+

2. Do you feel you could follow the instructions in the slide and lab

Yes

3. What topics related to these statistical tests are still confusing or unclear?

None

4. Compared to prior labs, was this approach to R more or less accessible/useful for your learning?

Yes

The codes are also publicly available at: https://rpubs.com/AlanHuang/EPS700_Lab3