Classes in R are the basic ways R identifies individual data values. The basic classes are logical (TRUE or FALSE), integer (positive and negative whole numbers and zero), numeric (real numbers) and character (examples: “a”, “this is a character”, etc.). In R, classes can be coerced into other classes, assuming there is some reasonable link, for example R can tell that the character “23” can be coerced into a numeric value 23 using the as.numeric() function. R has a built in function, typeof(), which can be used to identify the class of the data point.
Data structures are what R uses to store data values. The most basic is a vector, which is a one-dimensional array of values. This could hold a single value (scalar) or multiple values as in (1, 2, 3). Vectors can also be used to hold characters or logical classes, not just numeric or integer values. a matrix is then just a two dimensional vector which is similar to a data frame, but a matrix will have uniform data class, where a data frame can have multiple classes across the columns of the data frame, and will have named columns. A spreadsheet is one of the easiest to understand examples of data frames, though it is important to note that each column vector must have the same length in a data frame. There are also factors, which allows for vectors to be grouped or ordered by providing levels or categories that the elements of the vector align with.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
class(iris)
## [1] "data.frame"
typeof(iris)
## [1] "list"
Using the str(), class(), and typeof() commands on the iris data, we can see that we get a data frame, with a list of data. It also uses a factor with the species variable by identifying three levels of data that can then be used to sort and filter the numeric data.
sd(iris$Sepal.Length)
## [1] 0.8280661
mad(iris$Sepal.Length)
## [1] 1.03782
Opening up the documentation for sd() and mad() we get that sd() calculates the standard deviation of the values in the vector or R object provided. We are also told it uses the denominator (n-1) as it assumes we are working with a sample of the data and not using the whole population. mad() is the median absolute deviation, which takes a numeric vector as an input with some variables that could change (center, constant, na.rm, low, and high) and uses the formula constant * cMedian(abs(x - center)).
trapezoid_area <- function(l, b1, b2){
return(0.5*(b1 + b2)*l)
}
trapezoid_area(2, 1, 4)
## [1] 5
Bayes Theorem is a method to calculate the probability of an event occurring given some other event already occurred. It is given by the formula \(P(A|B) = \frac{P(B|A)*P(A)}{P(B)}\) Where P(X) is the probability of event X occurring and P(X|Y) is the probability of X given that Y has already occurred. This theorem is useful in the cases where we might know the individual probability of both events A and B occurring, and that we would know the probability of P(B|A).
An example would be that given that the probability of rolling two six sided (fair) dice (1 blue, 1 red) and getting a sum of 3 is \(\frac{2}{36}\) (P(B) = \(\frac{2}{36}\)), the probability of rolling a 2 on the blue six sided (fair) die is \(\frac{1}{6}\) (P(A) = \(\frac{1}{6}\)), and given we rolled a 2 on the blue die, the probability of the sum being 3 would be \(\frac{1}{6}\) (P(B|A) = \(\frac{1}{6}\)). So if we wanted to know the probability of rolling a 2 on the blue die given that the sum of two dice was 3 we would use \(P(A|B) = \frac{\frac{1}{6}*\frac{1}{6}}{\frac{2}{36}} = \frac{1}{2}\).
Jose visits campus every Thursday evening. However, some days the parking garage is full, often due to college events. There are academic events on 35% of evenings, sporting events on 20% of evenings, and no events on 45% of evenings. When there is an academic event, the garage fills up about 25% of the time, and it fills up 70% of evenings with sporting events. On evenings when there are no events, it only fills up about 5% of the time. If Jose comes to campus and finds the garage full, what is the probability that there is a sporting event? Use a tree diagram to solve this
Event A: Academic event in evening \(P(A) = 0.35\) Event S: Sporting event in evening \(P(S) = 0.2\) Event N: No event in evening \(P(N) = 0.45\) Event G: Garage fills up \(P(G|A) = 0.25\) & \(P(G|S) = 0.7\) & \(P(G|N) = 0.05\) To find the probability of G, we can identify look at each discrete event, i.e. the probability of G given A/S/N, which is the product of each event times the probability of the garage being full during each event. \(P(G) = P(G|A)*P(A) + P(G|S)*P(S) + P(G|N)*P(N)\) \(P(G) = 0.25\) Putting this into Bayes Theorem, we get: \(P(S|G) = \frac{P(G|S)*P(S)}{P(G)}\) \(P(S|G) = \frac{0.14}{0.25}\) \(P(S|G) = 0.56\) So there is a 56% chance that there is a sporting event given the garage is full.
library(BiocManager)
BiocManager::install("Rgraphviz")
## Bioconductor version 3.22 (BiocManager 1.30.27), R 4.5.1 (2025-06-13)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'Rgraphviz'
## Old packages: 'BH', 'boot', 'broom', 'clock', 'colorspace', 'cpp11',
## 'data.table', 'digest', 'distributional', 'e1071', 'extraDistr', 'fable',
## 'forecast', 'future'
To build the tree, we will use the bnlearn library
library(bnlearn)
tree = model2network("[Initial][Academic (35%)|Initial][Sport (20%)|Initial][No Event (45%)|Initial][Garage Full (25%)|Academic (35%)][Garage Not Full (75%)|Academic (35%)][Garage Full (70%)|Sport (20%)][Garage Not Full (30%)|Sport (20%)][Garage Full (5%)|No Event (45%)][Garage Not Full (95%)|No Event (45%)]")
graphviz.plot(tree, layout = "dot")
## Loading required namespace: Rgraphviz