Discussion2

1.

In R, an object’s class tells you its abstract classification. Instead of telling you how the object is stored (which would be its type), class tells you how R will be able to use that object. While R has basic classes that are native to it (called “atomic”), because it supports object-oriented programming you are able to create your own custom classes as well. At minimum, a class needs to have an attribute (which can, in turn, be a custom class instead of an atomic class). For example, a playing card from a standard deck can have as few as two attributes:

setClass("Card", slots = list(name = "character", suit = "character"))
four_spades <- new("Card", name = "four", suit = "spades")
four_spades

## An object of class "Card"
## Slot "name":
## [1] "four"
## 
## Slot "suit":
## [1] "spades"

In this example, we have created a new class, Card, that contains two attributes, both of the “character” class. While this is sufficient, however, it does not fully represent the information needed for a playing card. Since the name is stored as text, there isn’t an easy way to compare the value of two different cards (something that would likely be necessary for a playing card down the line). Instead, it would make more sense for the Card class to look like this:

setClass("Card", slots = list(name = "character", suit = "character", rank = "numeric", color = "character"))
queen_hearts <- new("Card", name = "queen", suit = "hearts", rank = 12, color = "red")
queen_hearts

## An object of class "Card"
## Slot "name":
## [1] "queen"
## 
## Slot "suit":
## [1] "hearts"
## 
## Slot "rank":
## [1] 12
## 
## Slot "color":
## [1] "red"

This is a more complete representation of a Card (with the exception of handling how an ace can either be the lowest or highest ranked card). All future instances of a Card, however, will just be a single object. To create a deck of cards, we will need to understand data structures.

If class refers to how an object is treated, then data structure refers to how that object is organized or shaped. R comes built with six data structures that range from one-dimensional to three dimensional. Of them, 3 are one-dimensional (vectors, lists, and factors), 2 are two-dimensional (matrices and data frames), and 1 is three-dimensional (arrays). Other programming languages support data structures like dictionaries, tuples, and linked lists. A deck of playing cards, for example, would be a one-dimensional data structure that holds a single type of object (a Card). R could represent that as either a list, a vector, or a factor, since these are all one-dimensional data structures, but factors are more commonly used for categorical data, and since lists can contain objects of different types using one would be excessive here. Of the data structures native to R, a list would be most appropriate for representing a deck of playing cards.

1B.

I used the HairEyeColor dataset that comes pre-packaged in R for this question.

HairEyeColor

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

class(HairEyeColor)

## [1] "table"

str(HairEyeColor)

##  'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
##   ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
##   ..$ Sex : chr [1:2] "Male" "Female"

The class of the dataset returned the value “table”, which is commonly used in R for categorical data. This is what I would expect, since the dataset itself is measuring the frequency with which certain categories of hair color, eye color, and sex occured in the surveyed students. The structure of the dataset is also “table”, and the str() function also listed the dimension names, the number of distinct values in each dimension, and the dimension values themselves. This is unsurprising, as the data presents itself as two identical tables (hair color as rows, eye color as columns), differentiated by the sex of the surveyed students.

2.

For this question, I wanted to interpret both the HairEyeColor dataset as-is, and how it was originally measured (aggregated over sex).

HEC <- apply(HairEyeColor, c(1,2), sum)
HairEyeColor

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

HEC

##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Since these datasets are for categorical data, I will use the proportions() function as my main area of speculation. I will also use mean() to see if anything meaningful can be derived from it.

proportions(HairEyeColor)

## , , Sex = Male
## 
##        Eye
## Hair          Brown        Blue       Hazel       Green
##   Black 0.054054054 0.018581081 0.016891892 0.005067568
##   Brown 0.089527027 0.084459459 0.042229730 0.025337838
##   Red   0.016891892 0.016891892 0.011824324 0.011824324
##   Blond 0.005067568 0.050675676 0.008445946 0.013513514
## 
## , , Sex = Female
## 
##        Eye
## Hair          Brown        Blue       Hazel       Green
##   Black 0.060810811 0.015202703 0.008445946 0.003378378
##   Brown 0.111486486 0.057432432 0.048986486 0.023648649
##   Red   0.027027027 0.011824324 0.011824324 0.011824324
##   Blond 0.006756757 0.108108108 0.008445946 0.013513514

proportions(HEC)

##        Eye
## Hair         Brown       Blue      Hazel       Green
##   Black 0.11486486 0.03378378 0.02533784 0.008445946
##   Brown 0.20101351 0.14189189 0.09121622 0.048986486
##   Red   0.04391892 0.02871622 0.02364865 0.023648649
##   Blond 0.01182432 0.15878378 0.01689189 0.027027027

Instead of returning singular values, the proportions() function returns a transformed table (or transformed array of tables). As the name of the function would suggest, I speculate that the value in a given cell represents its proportional frequency, relative to the rest of the table. This is immediately disproven, as I can see that the values of (Black, Brown, Male) and (Black, Brown, Female) in HairEyeColor add up to the value of (Black, Brown) in HEC. Clearly, the proportions() function calculates the relative frequency for the whole dataset.

mean(HairEyeColor)

## [1] 18.5

mean(HEC)

## [1] 37

Since the first value is exactly half of the second value, this signals to me that mean() is being calculated by dividing the total number of surveyed students by the number of cells in that dataset’s table. As HEC only has 16 cells (four hair colors, four eye colors, no distinction by sex), it’s easy to see that the mean() is doing \(\frac{592}{16}=37\). And as HairEyeColor has 32 total cells (the same 16 as in HEC, but now one set for each of the sexes), the calculation being performed here is \(\frac{592}{32}=18.5\). While we could interpret this as the expected value for any given cell in the dataset, I don’t think mean() is revealing anything particularly important about the data.

My function will convert miles to kilometers or kilometers to miles, depending on the unit that was given.

mi_km_convert <- function(distance, measure){
    if (measure == "mi"){
        return(round(distance*0.62137,4))
    } else if (measure == "km"){
        return(round(distance*1.609344,4))
    }
    else{
        return(NA)
    }
    
}
mi_km_convert(5, "mi")

## [1] 3.1068

mi_km_convert(6.233, "km")

## [1] 10.031

mi_km_convert(18, "lbs")

## [1] NA

3.

In classical probability, we consider the likelihood of an event happening to be the ratio of favorable outcomes to the total possible number of outcomes. The most common examples are rolling a die, drawing a card out of a deck (with or without replacement), or picking a colored ball out of a bag. These are all cases where we know that our test (the rolling of the die, the drawing of the card, or the picking of the ball) are completely accurate. Bayes’ Theorem allows us to calculate the probability of an occurrence when our tests are not perfectly accurate by using new evidence to update our prior knowledge. We can use Bayes’ Theorem, for example, to see the probability of a die being loaded after seeing it being rolled a few times.

Bayes’ Theorem: \(P(A \mid B) = \frac{P(B \mid A)*P(A)}{P(B)}\)

4.

Events:

Let A denote an academic event, S denote a sporting event, N denote no event, and F denote the parking garage being full. \[ P(A)=0.35\\ P(S)=0.2\\ P(N)=0.45\\ P(F \mid A)=0.25\\ P(F \mid S)=0.7\\ P(F \mid N)=0.05\\ P(S \mid F) = \frac{P(F \mid S)*P(S)}{P(F)}\\ P(F) = P(F \mid A)*P(A) + P(F \mid S)*P(S) + P(F \mid N)*P(N) \] A convention I’ve seen used for writing probabilities in R is to represent the pipe as the letter “g” (for “given”), so P(F|A) could be rewritten as “pfga”. And since P(A) would be “pa”, that means P(F|A)P(A) would be “pfgapa”.

pfgapa <- 0.35 * 0.25
pfgsps <- 0.2 * 0.7
pfgnpn <- 0.05 * 0.45
pf <- pfgapa + pfgsps + pfgnpn
psgf <- pfgsps / pf
psgf

## [1] 0.56

\[P(S \mid F) = 0.56\] We wouldn’t immediately expect this since the likelihood of a sporting event occurring is, by itself, so low. However, since we know that sporting events have such a high chance of filling up the parking garage, it begins to make more sense that, upon seeing a full parking garage, one’s first assumption would be that a sporting event is happening.

For our tree diagram:

library(BiocManager)
BiocManager::install("Rgraphviz")

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org

## Bioconductor version 3.22 (BiocManager 1.30.27), R 4.5.2 (2025-10-31)

## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'Rgraphviz'

## Installation paths not writeable, unable to update packages
##   path: /usr/lib/R/library
##   packages:
##     spatial

library(bnlearn)
tree = model2network("[Start][Academic (0.35)|Start][Sporting (0.2)|Start][None (0.45)|Start][Full (0.25)|Academic (0.35)][Not Full (0.75)|Academic (0.35)][Full (0.7)|Sporting (0.2)][Not Full (0.3)|Sporting (0.2)][Full (0.05)|None (0.45)][Not Full (0.95)|None (0.45)]")
graphviz.plot(tree, layout = "dot")

## Loading required namespace: Rgraphviz

If we multiply the probabilities along the paths with full garages and add them up, we get the probability of the garage being full (0.25). If we take the path where there is a sporting even and the garage is full, we can multiply those together and divide by our previous result to get 0.56, the same as before.

Discussion2

Jonah Gottfried

2026-01-23

1.

1B.

2.

3.

4.