Setup

Load data (Kobe Bryant shots)

# Robust loader for the OpenIntro Kobe dataset
suppressWarnings(rm(kobe))
ok <- try(load(url("https://www.openintro.org/stat/data/kobe.RData")), silent = TRUE)
if (inherits(ok, "try-error") || !exists("kobe")) stop("kobe did not load; re-run or upload kobe.RData and use load('kobe.RData').")
# Normalize to an H/M character vector regardless of column name
if ("shot" %in% names(kobe)) x <- kobe$shot else if ("basket" %in% names(kobe)) x <- kobe$basket else stop("Neither 'shot' nor 'basket' found.")
 x <- toupper(as.character(x))
x <- ifelse(x %in% c("H","HIT","1","TRUE"), "H",
            ifelse(x %in% c("M","MISS","0","FALSE"), "M", NA))
x <- x[!is.na(x)]  # drop any weird values
cat("Rows:", length(x), " | H:", sum(x=="H"), " | M:", sum(x=="M"), "\n")

## Rows: 133  | H: 58  | M: 75

Helper: streak length calculator

# calc_streak counts consecutive H's; a streak ends when an M occurs.
calc_streak <- function(vec) {
  streaks <- integer(0); run <- 0L
  for (i in seq_along(vec)) {
    if (vec[i] == "H") {
      run <- run + 1L
    } else {
      streaks <- c(streaks, run); run <- 0L
    }
  }
  streaks <- c(streaks, run)   # close last run if it ends with H
  streaks[streaks > 0]
}

Q1. What is a “streak” of length 1? What about 0?

example_vec <- c("H","M","H","H","M","M","H")
example_vec

## [1] "H" "M" "H" "H" "M" "M" "H"

calc_streak(example_vec)

## [1] 1 2 1

Answer Q1 (in words):
A streak of length 1 is a single made shot that is immediately followed by a miss (the miss ends the run). A streak length of 0 corresponds to an isolated miss (no made shots in that run before it ends).

Q2. Compute Kobe’s streak distribution and describe its shape

kobe_streaks <- calc_streak(x)   # use normalized H/M vector from the Load step
if (length(kobe_streaks) == 0) stop("No streaks found — check that x contains H/M values.")
tbl_kobe <- table(kobe_streaks)
barplot(tbl_kobe,
        main = "Kobe Bryant: Streak length distribution",
        xlab = "Streak length (consecutive hits)", ylab = "Frequency")

summary(kobe_streaks)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.568   2.000   4.000

Answer Q2 (description):
The distribution is right-skewed with many short streaks (1–2) and rapidly decreasing frequency as streak length increases. Long streaks are rare, which matches what we expect even under independent shots.

Q3. Estimate Kobe’s overall hit probability and simulate independent shooting

# Q3 — Estimate hit probability and simulate independent shooting
p_hat <- mean(x == "H")   # x is the H/M vector from the Load chunk
p_hat

## [1] 0.4360902

n <- length(x) 
set.seed(60603)
sim_shots <- ifelse(runif(n) < p_hat, "H", "M")
sim_streaks <- calc_streak(sim_shots)
table(sim_streaks)

## sim_streaks
##  1  2  3  4 
## 24  6  3  2

barplot(table(sim_streaks),
main = sprintf("IID Bernoulli simulation (p = %.3f): streaks", p_hat),
xlab = "Streak length", ylab = "Frequency")

Answer Q3 (compare Kobe vs IID):
Kobe’s observed distribution and the IID simulation are qualitatively similar—short streaks dominate, and long streaks appear occasionally but infrequently. Any small differences are within what we’d expect from sampling variability.

# Q4. Multiple simulations to assess variability of streak frequencies

# Repeat many times; collect frequencies for lengths 1, 2, 3, and 4+ (collapsed)
set.seed(60603)
many <- 500
freq_mat <- replicate(many, {
  x <- ifelse(runif(n) < p_hat, "H", "M")
  cs <- calc_streak(x)
  # Tabulate 1,2,3,4+:
  freqs <- tabulate(pmin(cs, 4), nbins = 4)
  freqs
})
rownames(freq_mat) <- c("len1","len2","len3","len4plus")
sim_means <- rowMeans(freq_mat)
sim_sds   <- apply(freq_mat, 1, sd)
sim_means

##     len1     len2     len3 len4plus 
##   19.014    8.114    3.444    2.632

sim_sds

##     len1     len2     len3 len4plus 
## 3.820340 2.523835 1.696059 1.495677

# Kobe's observed frequencies for the same bins:
kobe_bins <- tabulate(pmin(kobe_streaks, 4), nbins = 4)
names(kobe_bins) <- c("len1","len2","len3","len4plus")
kobe_bins

##     len1     len2     len3 len4plus 
##       24        6        6        1

Answer Q4 (interpretation): Kobe’s observed bin counts are reasonably close to the simulation averages for IID shooting and fall within a plausible range given the simulation standard deviations. This does not provide strong evidence against independence (i.e., no strong “hot-hand” signal here).

# (Optional) Q5. Simple probability refresher: coin example

set.seed(60603)
coin <- sample(c("H","T"), size = 100, replace = TRUE)
mean(coin == "H")   # ~0.5 in the long run, but variable in small samples

## [1] 0.54

Answer Q5 (law of large numbers):
Over many trials, the proportion of heads tends to 0.5, but short runs can deviate quite a bit—mirroring how apparent streaks can arise under independence.

# Summary & conclusion What the IID model predicts: Short streaks are most common; long streaks occur occasionally.
Observed vs IID: Kobe’s streak distribution is largely consistent with independence given this sample.
Limitations: We analyze one dataset, the streak definition matters, and context (defense, shot selection) is ignored, so results are suggestive rather than definitive.

# Appendix — Reusable function test (sanity check)

# Quick check on a custom vector to confirm calc_streak behavior:
calc_streak(c("M","H","H","H","M","H","M","M","H","H"))

## [1] 3 1 2

# Citation (formula/source) The Elo/independence framework here uses an IID Bernoulli model for makes/misses; streaks are computed as consecutive H’s ending with an M. This setup follows common Hot Hand lab variants (OpenIntro/STA labs). Dataset: kobe.RData from OpenIntro (downloaded at knit time).

DATA606 Lab 3 — Probability (Hot Hand)

Sachi Kapoor

2025-09-28