Writing Functions in R and Bayes Theorum
In R, a class describes how an object should be interpreted and how methods should behave when applied to it. It controls printing, plotting, modeling, and many other operations through R’s object-oriented system. For example, a numeric vector has class ‘numeric’, while a categorical variable has class ‘factor’. Two objects can share the same underlying storage type but have different classes that changes the way R treats them and how we interact with them.
Data structures in R define how data is (data is or data are?) organized and stored. The most common ones are vectors (one dimensional, homogeneous like c(1,2,3)), matrices (two-dimensional, homogeneous like matrix(1:6,nrow = 2)), and data frames (two-dimensional and column heterogeneous like data.frame(age = c(15, 7), name = c(“Charlie”, “Alex”))(my brothers)). Factors are important for categorical data, storing integer codes with interpretable labels. These structures determine how data can be indexed, manipulated, and analyzed, forming the foundation of data engineering.
data(mtcars)
class(mtcars)
## [1] "data.frame"
typeof(mtcars)
## [1] "list"
These results make sense because in R, a data frame is implemented as a list of equal length vectors, where each vector represents a column. The class() function tells us how R conceptually treats the object (a table with rows and columns), while typeof() reveals the underlying storage mode (a list). So even thoughwe interact with mtcars as a rectangular dataset, internally it is a structured list, which explaisn why each column can have its own data type.
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x0000029df9e6d318>
## <environment: namespace:stats>
This is the standard deviation function - it computes the mean of the vector, subtracts the mean from each observation, squares these deviations, averages them and takes the square root. This requires a numeric vector as input.
IQR
## function (x, na.rm = FALSE, type = 7)
## diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE,
## type = type))
## <bytecode: 0x0000029df7c2eba0>
## <environment: namespace:stats>
This is the function for Interquartile Range - inside, it calls quantile to compute the 25th and 75th percentiles and then subtracts them. This requires a numeric variable, and is robust to outliers.
mad
## function (x, center = median(x), constant = 1.4826, na.rm = FALSE,
## low = FALSE, high = FALSE)
## {
## if (na.rm)
## x <- x[!is.na(x)]
## n <- length(x)
## constant * if ((low || high) && n%%2 == 0) {
## if (low && high)
## stop("'low' and 'high' cannot be both TRUE")
## n2 <- n%/%2 + as.integer(high)
## sort(abs(x - center), partial = n2)[n2]
## }
## else median(abs(x - center))
## }
## <bytecode: 0x0000029df905f4d8>
## <environment: namespace:stats>
This is the function for median absolute deviation, a robust measure of statistical dispersion defined as a scaled median of the absolute deviations from the median. The function above is built with the prior two functions (sd, iqr) along with the estimators, each quantifying variability.
ev <- function(winprob, odds, stake = 1){
if(odds > 0) {
profit <- odds / 100*stake
} else {
profit <- 100 / abs(odds)*stake
}
loss <- stake
ev <- winprob*profit - (1 - winprob)*loss
return(ev)
}
ev(winprob = .55, odds = -110, stake = 10)
## [1] 0.5
One of my fun projects this year was creating a sports betting model for field goal kickers using game context, environmental context and variables like distance to predict the probability of a field goal going in, and then comparing it to lines as displayed by Vegas. Expected value is interlocked with probability, while both serve as the foundation for Bayes Theorem. The inputs are win probability which must be between 0 and 1, odds (American), and stake is the amount bet. A positive EV means it is a profitable bet in the long run, and a negative EV means the sports book has the edge.
\[ P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)} \] 4.
Let A1 = sporting event A2 = academic event A3 = no event B = garage is full
Given
P(A1) = .20, P(B|A1) = .70 P(A2) = .35, P(B|A2) = .25 P(A3) = .45, P(B|A3) = .05
\[ P(A_1 \mid B) = \frac{P(B \mid A_1)\, P(A_1)}{P(B)} \]
\[ P(B)=P(B\mid A_1)P(A_1)+P(B\mid A_2)P(A_2)+P(B\mid A_3)P(A_3) \]
\[ P(A_1 \mid B)=\frac{(0.70)(0.20)}{(0.70)(0.20)+(0.25)(0.35)+(0.05)(0.45)} \] \[ P(A_1 \mid B)=\frac{0.14}{0.25}=0.56 \] If Jose finds the garage full, there is a 56% chance there is a sporting even that evening. This makes sense because sporting events tend to fill the garage (70%), so “Full” is a strong predictor of sporting events, even if they happen scarcely.
p_sport <- 0.20
p_acad <- 0.35
p_none <- 0.45
p_full_given_sport <- 0.70
p_full_given_acad <- 0.25
p_full_given_none <- 0.05
p_full <- (p_full_given_sport * p_sport) +
(p_full_given_acad * p_acad) +
(p_full_given_none * p_none)
p_sport_given_full <- (p_full_given_sport * p_sport) / p_full
p_sport_given_full
## [1] 0.56
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 4.4.3
grViz("
digraph tree {
graph [rankdir = LR]
node [shape = box]
Thursday -> Sporting [label = '0.20']
Thursday -> Academic [label = '0.35']
Thursday -> No_Event [label = '0.45']
Sporting -> Full [label = '0.70']
Sporting -> NotFull [label = '0.30']
Academic -> Full2 [label = '0.25']
Academic -> NotFull2 [label = '0.75']
No_Event -> Full3 [label = '0.05']
No_Event -> NotFull3 [label = '0.95']
Thursday [label = 'Thursday']
Sporting [label = 'Sporting event']
Academic [label = 'Academic event']
No_Event [label = 'No event']
Full [label = 'Full']
NotFull [label = 'Not full']
Full2 [label = 'Full']
NotFull2 [label = 'Not full']
Full3 [label = 'Full']
NotFull3 [label = 'Not full']
}
")
bayes_update <- function(prior, likelihood) {
evidence <- sum(prior * likelihood)
posterior <- (prior * likelihood) / evidence
return(posterior)
}
prior <- c(Sporting = 0.20, Academic = 0.35, None = 0.45)
likelihood <- c(0.70, 0.25, 0.05)
bayes_update(prior, likelihood)
## Sporting Academic None
## 0.56 0.35 0.09
Given that the garage is full, there is a 56% probability of a sporting event, 35% probability of an academic event, and 9% probability of no event