Part I R Nuts and Bolts/Programming Readings

Classes in R are the basic ways R identifies individual data values. The basic classes are logical (TRUE or FALSE), integer (positive and negative whole numbers and zero), numeric (real numbers) and character (examples: “a”, “this is a character”, etc.). In R, classes can be coerced into other classes, assuming there is some reasonable link, for example R can tell that the character “23” can be coerced into a numeric value 23 using the as.numeric() function. R has a built in function, typeof(), which can be used to identify the class of the data point.

Data structures are what R uses to store data values. The most basic is a vector, which is a one-dimensional array of values. This could hold a single value (scalar) or multiple values as in (1, 2, 3). Vectors can also be used to hold characters or logical classes, not just numeric or integer values. a matrix is then just a two dimensional vector which is similar to a data frame, but a matrix will have uniform data class, where a data frame can have multiple classes across the columns of the data frame, and will have named columns. A spreadsheet is one of the easiest to understand examples of data frames, though it is important to note that each column vector must have the same length in a data frame. There are also factors, which allows for vectors to be grouped or ordered by providing levels or categories that the elements of the vector align with.

Example dataset

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
class(iris)
## [1] "data.frame"
typeof(iris)
## [1] "list"

Using the str(), class(), and typeof() commands on the iris data, we can see that we get a data frame, with a list of data. It also uses a factor with the species variable by identifying three levels of data that can then be used to sort and filter the numeric data.

Part II Reading and Writing Functions in R

  sd(iris$Sepal.Length)
## [1] 0.8280661
  mad(iris$Sepal.Length)
## [1] 1.03782

Opening up the documentation for sd() and mad() we get that sd() calculates the standard deviation of the values in the vector or R object provided. We are also told it uses the denominator (n-1) as it assumes we are working with a sample of the data and not using the whole population. mad() is the median absolute deviation, which takes a numeric vector as an input with some variables that could change (center, constant, na.rm, low, and high) and uses the formula constant * cMedian(abs(x - center)).

  trapezoid_area <- function(l, b1, b2){
    return(0.5*(b1 + b2)*l)
  }
  trapezoid_area(2, 1, 4)
## [1] 5

Part III Bayes Theorem

Bayes Theorem is a method to calculate the probability of an event occurring given some other event already occurred. It is given by the formula \(P(A|B) = \frac{P(B|A)*P(A)}{P(B)}\) Where P(X) is the probability of event X occurring and P(X|Y) is the probability of X given that Y has already occurred. This theorem is useful in the cases where we might know the individual probability of both events A and B occurring, and that we would know the probability of P(B|A).

An example would be that given that the probability of rolling two six sided (fair) dice (1 blue, 1 red) and getting a sum of 3 is \(\frac{2}{36}\) (P(B) = \(\frac{2}{36}\)), the probability of rolling a 2 on the blue six sided (fair) die is \(\frac{1}{6}\) (P(A) = \(\frac{1}{6}\)), and given we rolled a 2 on the blue die, the probability of the sum being 3 would be \(\frac{1}{6}\) (P(B|A) = \(\frac{1}{6}\)). So if we wanted to know the probability of rolling a 2 on the blue die given that the sum of two dice was 3 we would use \(P(A|B) = \frac{\frac{1}{6}*\frac{1}{6}}{\frac{2}{36}} = \frac{1}{2}\).

Part IV

Jose visits campus every Thursday evening. However, some days the parking garage is full, often due to college events. There are academic events on 35% of evenings, sporting events on 20% of evenings, and no events on 45% of evenings. When there is an academic event, the garage fills up about 25% of the time, and it fills up 70% of evenings with sporting events. On evenings when there are no events, it only fills up about 5% of the time. If Jose comes to campus and finds the garage full, what is the probability that there is a sporting event? Use a tree diagram to solve this

Event A: Academic event in evening \(P(A) = 0.35\) Event S: Sporting event in evening \(P(S) = 0.2\) Event N: No event in evening \(P(N) = 0.45\) Event G: Garage fills up \(P(G|A) = 0.25\) & \(P(G|S) = 0.7\) & \(P(G|N) = 0.05\) To find the probability of G, we can identify look at each discrete event, i.e. the probability of G given A/S/N, which is the product of each event times the probability of the garage being full during each event. \(P(G) = P(G|A)*P(A) + P(G|S)*P(S) + P(G|N)*P(N)\) \(P(G) = 0.25\) Putting this into Bayes Theorem, we get: \(P(S|G) = \frac{P(G|S)*P(S)}{P(G)}\) \(P(S|G) = \frac{0.14}{0.25}\) \(P(S|G) = 0.56\) So there is a 56% chance that there is a sporting event given the garage is full.

library(BiocManager)
BiocManager::install("Rgraphviz")
## Bioconductor version 3.22 (BiocManager 1.30.27), R 4.5.1 (2025-06-13)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'Rgraphviz'
## Old packages: 'BH', 'boot', 'broom', 'clock', 'colorspace', 'cpp11',
##   'data.table', 'digest', 'distributional', 'e1071', 'extraDistr', 'fable',
##   'forecast', 'future'

To build the tree, we will use the bnlearn library

  library(bnlearn)
  tree = model2network("[Initial][Academic (35%)|Initial][Sport (20%)|Initial][No Event (45%)|Initial][Garage Full (25%)|Academic (35%)][Garage Not Full (75%)|Academic (35%)][Garage Full (70%)|Sport (20%)][Garage Not Full (30%)|Sport (20%)][Garage Full (5%)|No Event (45%)][Garage Not Full (95%)|No Event (45%)]")
  graphviz.plot(tree, layout = "dot")
## Loading required namespace: Rgraphviz