Week 2 Discussion

Andrew Gregory

2024-01-26

Part I: R Nuts and Bolts/Programming Reading

R Classes:

Classes are reference types, and a tool that can bundles data and methods together for use in the implementation of concepts or objects. Often, classes are used to implement data structures. To use an analogy, I think of a tree, where each branch is one specific class, and each leaf is a data structure. There are six data types in R, Classes can be in the form of numeric (i.e “1” or “1.5”), Integer (i.e. 5L, 10L ), a character (i.e. ” A, B, C” ), or logical (i.e. TRUE or FALSE), Raw (i.e. “data science” is stored as 65 61 74 61 6e) and complex (i.e 5+5i).

R Class Examples:

Numeric

v <- 23.5 # create a numeric vlaue 
v # print value of v
## [1] 23.5
class(v) # determine class 
## [1] "numeric"

Integer

y <- 1:10 # create integer value 
y # print value of y 
##  [1]  1  2  3  4  5  6  7  8  9 10
class(y) # determine class 
## [1] "integer"

Logical

# False logical value 
5==16 
## [1] FALSE
# True logical value 
2==2
## [1] TRUE

Character

x <- "dataset" # create a character value 
x # print the value of x 
## [1] "dataset"
typeof(x) # determine class type 
## [1] "character"

Complex

z = 1 + 2i  # create a complex number 
z           # print the value of z 
## [1] 1+2i
class(z)    # Determine class type 
## [1] "complex"

Raw

v <- charToRaw("Hello") # create a raw vlaue
v #print value of v 
## [1] 48 65 6c 6c 6f
class(v) # determine class type 
## [1] "raw"

R Data Structures:

Similar to data in an analog spreadsheet like in excel, data structures are models of handling data. The six most common types of Data Structures in R, are Vectors, Matrix, Array, Data Frames, Lists and Factors. R’s basic data object, Vectors have 1 or more numbers, and are 1 dimensional, containing the same data type. A matrix, it two dimensional data structure, winch means it has rows and columns, that are the same length and class and the columns are unnamed. What I gathered is that an Array, is similar to a matrix, but is three dimensional meaning, stacked matrices. Data frames are the data structure that most resembles a spreadsheet, and Lists are the most flexible because they can include any length, class, or structure and you can have a list within a list. A factors is a data structure that helps store categorical data (i.e. “Gender”). Generally speaking, since everything in R is an object, classes are the blueprints that store data structure.

R Data Structures Examples

Vectors

# Basic Vector Example 
vec <- c(1,2,3) 
vec
## [1] 1 2 3
# Numeric Vector 
numeric_vec <- c(1,2,3) 
numeric_vec
## [1] 1 2 3
# Logical Vector
logical_vec <- c(TRUE,TRUE,FALSE) 
logical_vec
## [1]  TRUE  TRUE FALSE
# Character Vector 
character_vec <- c("A","B","C", "D", "F") 
character_vec
## [1] "A" "B" "C" "D" "F"
# Integer Vector 
integer_vec <- c(1L, 2L, 3L)
integer_vec
## [1] 1 2 3
# Complex Vector 
complex_vec <- c(12+2i, 22i, 4+5i)
complex_vec 
## [1] 12+ 2i  0+22i  4+ 5i

Matrix

matrix1 <- matrix(c(1:9), ncol = 3) # create matrix 
matrix1 # print matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Array

arr1 <- array(c(5:68),dim=c(2,3,3)) # create Array 
arr1 # Print Array
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    5    7    9
## [2,]    6    8   10
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   11   13   15
## [2,]   12   14   16
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   17   19   21
## [2,]   18   20   22

Data Frame

# Create data frame
athlete_rank <- c(1:5) 
athlete_name <- c("john", "greg", "sarah", "james", "colby")
athlete_score <- c("third", "fifth", "second", "fourth", "first")
athlete.data <- data.frame(athlete_rank , athlete_name, athlete_score)
# Print Data Frame 
athlete.data
##   athlete_rank athlete_name athlete_score
## 1            1         john         third
## 2            2         greg         fifth
## 3            3        sarah        second
## 4            4        james        fourth
## 5            5        colby         first

List

test_list <- list(1, 2, 3, "Surprise", FALSE)
test_list
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] "Surprise"
## 
## [[5]]
## [1] FALSE

Factor

fac <- factor(c("a", "b", "a", "b", "b")) # create factor 
fac # Print factor 
## [1] a b a b b
## Levels: a b

Checking for Class and Data Structures in R

# Load the data set 
data()
UKgas
##        Qtr1   Qtr2   Qtr3   Qtr4
## 1960  160.1  129.7   84.8  120.1
## 1961  160.1  124.9   84.8  116.9
## 1962  169.7  140.9   89.7  123.3
## 1963  187.3  144.1   92.9  120.1
## 1964  176.1  147.3   89.7  123.3
## 1965  185.7  155.3   99.3  131.3
## 1966  200.1  161.7  102.5  136.1
## 1967  204.9  176.1  112.1  140.9
## 1968  227.3  195.3  115.3  142.5
## 1969  244.9  214.5  118.5  153.7
## 1970  244.9  216.1  188.9  142.5
## 1971  301.0  196.9  136.1  267.3
## 1972  317.0  230.5  152.1  336.2
## 1973  371.4  240.1  158.5  355.4
## 1974  449.9  286.6  179.3  403.4
## 1975  491.5  321.8  177.7  409.8
## 1976  593.9  329.8  176.1  483.5
## 1977  584.3  395.4  187.3  485.1
## 1978  669.2  421.0  216.1  509.1
## 1979  827.7  467.5  209.7  542.7
## 1980  840.5  414.6  217.7  670.8
## 1981  848.5  437.0  209.7  701.2
## 1982  925.3  443.4  214.5  683.6
## 1983  917.3  515.5  224.1  694.8
## 1984  989.4  477.1  233.7  730.0
## 1985 1087.0  534.7  281.8  787.6
## 1986 1163.9  613.1  347.4  782.8
#check data type 
class(UKgas)
## [1] "ts"
#check data type of every variable in data frame
str(df)
## function (x, df1, df2, ncp, log = FALSE)
#check if a variable is a specific data type
is.factor(UKgas)
## [1] FALSE
is.numeric(UKgas)
## [1] TRUE
is.logical(UKgas)
## [1] FALSE
#check data structure of UKGas 
typeof(UKgas)
## [1] "double"

Answer: After checking class using class() function, the UKgas data set is time series. Furthermore, I found that the the data type is double using typeof() function, and is numeric using is.numeric() function as ween below.

Part II: Reading and Writing Functions in R

Attempt to use “mad”, “iqr”, & “sd” Functions

# Load the data set 
data()
UKgas
##        Qtr1   Qtr2   Qtr3   Qtr4
## 1960  160.1  129.7   84.8  120.1
## 1961  160.1  124.9   84.8  116.9
## 1962  169.7  140.9   89.7  123.3
## 1963  187.3  144.1   92.9  120.1
## 1964  176.1  147.3   89.7  123.3
## 1965  185.7  155.3   99.3  131.3
## 1966  200.1  161.7  102.5  136.1
## 1967  204.9  176.1  112.1  140.9
## 1968  227.3  195.3  115.3  142.5
## 1969  244.9  214.5  118.5  153.7
## 1970  244.9  216.1  188.9  142.5
## 1971  301.0  196.9  136.1  267.3
## 1972  317.0  230.5  152.1  336.2
## 1973  371.4  240.1  158.5  355.4
## 1974  449.9  286.6  179.3  403.4
## 1975  491.5  321.8  177.7  409.8
## 1976  593.9  329.8  176.1  483.5
## 1977  584.3  395.4  187.3  485.1
## 1978  669.2  421.0  216.1  509.1
## 1979  827.7  467.5  209.7  542.7
## 1980  840.5  414.6  217.7  670.8
## 1981  848.5  437.0  209.7  701.2
## 1982  925.3  443.4  214.5  683.6
## 1983  917.3  515.5  224.1  694.8
## 1984  989.4  477.1  233.7  730.0
## 1985 1087.0  534.7  281.8  787.6
## 1986 1163.9  613.1  347.4  782.8
# confirm dataset is.numeric 
is.numeric(UKgas)
## [1] TRUE
## Calculate "mad", iqr, & "sd" 
sd(UKgas)
## [1] 251.3348
IQR(UKgas)
## [1] 316.6
mad(UKgas)
## [1] 149.4461

Answer: Functions, “mad”, “iqr” and “sd” are the functions to determine the Median Absolute Deviation (“mad”), the Interquartile Ranges (“iqr”), and the standard deviation (“sd”) of a numeric data set. Another way to get the same info plus more is to generate the descriptive statistics for a data set.

Fahrenheit to Celsius Function:

# Fahrenheit to Celsius Function
fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

#Boiling point of water in Fahrenheit
fahrenheit_to_celsius(212)
## [1] 100

Part III: Bayes Theorem

Bayes Theorem: Based on the law of conditional probability, Bayes’ Theorem main use is to update probabilities based on evidence. In essence, the formula for Bayes’ Theorem depicted below, is used when you want to determine the probability of an event occurring based on a related occurrence. In the Bayes’ Theorem formula, P(A | B) depicts the probability of event A happening based on the occurrence of event B. Similarly, P(B|A) is the probability of event B happening based on the occurrence of event A. Lastly, P(A) and P(B) are the prior probability of A, and the overall probability of observing evidence B, respectfully.

Bayes’ Theorem Formula:

\[ P(A \mid B) = \frac{P(B \mid A)*P(A)}{P(B)} \]

Part IV: Bayes’ Theorem 3.43 Guided Practice

## Intall/load BiocManager and Rgraphviz
install.packages("BiocManager")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(BiocManager)
BiocManager::install("Rgraphviz")
## Bioconductor version 3.18 (BiocManager 1.30.22), R 4.3.2 (2023-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'Rgraphviz'
library(Rgraphviz)
## Loading required package: graph
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: grid

Probability of Type of Event:

P(Academic Events) = 35%

P(Sporting Events) = 20%

P(No Event) = 45%

Probability Garage is full per type of event:

P(Academic Events) = 25%

P(Sporting Events) = 70%

P(No Event) = 5%

\[ P(S \mid F) = \frac{P(F \mid S)*P(S)}{P(F)} \]

Results:

Based on the results from the Bayes’ formula, 56% probability that Jose will come to campus, finds the garage full, when there is a sporting event.

# Solve for 3.43 Guided Practice using Bayes' formula
### Assign each probability to a variable
# Probability of academic event
a <- 0.35
# Probability of garage full when there's an academic event
a_full <- 0.25
# Probability of sporting event
s <- 0.20
# Probability of garage full when there's a sporting event
s_full <- 0.70
# Probability of no event
n <- 0.45
# Probability of garage full when there's no event
n_full <- 0.05


# Calculate the rest of the values based upon the variables above (Not full)
a_nfull  <- 1 - a_full
s_nfull  <- 1 - s_full
n_nfull <- 1 - n_full

# Joint Probabilities of each event being full and not full
aANDa_full    <-   a*a_full
aANDa_nfull   <-   a*a_nfull
sANDs_full    <-   s*s_full
sANDs_nfull   <-   s*s_nfull
nANDn_full    <-   n*n_full
nANDn_nfull   <-   n*n_nfull

# Probability garage is full
gar_full <-  aANDa_full+sANDs_full+nANDn_full 

# P(S|F) 
sGivenf <-   sANDs_full/gar_full

#show answer
print(paste0("Answer ", round(x      = sGivenf, 
                              digits = 4)
             )
      )
## [1] "Answer 0.56"
## Start coding the parts of the tree 

##Nodes 
node1 <- "P"
node2 <- "Academic"
node3 <- "Sport"
node4 <- "None"
node5 <- "Acad_Full"
node6 <- "Acad_NFull"
node7 <- "Sport_Full"
node8 <- "Sport_NFull"
node9 <- "None_Full"
node10 <-"None_NFull"
nodeNames <- c(node1, node2, node3, node4, node5, node6, node7, node8, node9, node10)


rEG <- new("graphNEL", node = nodeNames, edgemode = "directed")



### LINES
# Draw the "lines" or "branches" of the probability Tree
rEG  <- addEdge (nodeNames[1], nodeNames[2], rEG, 1)
rEG  <- addEdge (nodeNames[1], nodeNames[3], rEG, 1)
rEG  <- addEdge (nodeNames[1], nodeNames[4], rEG, 1)

rEG  <- addEdge (nodeNames[2], nodeNames[5], rEG, 1)
rEG  <- addEdge (nodeNames[2], nodeNames[6], rEG, 1)

rEG  <- addEdge (nodeNames[3], nodeNames[7], rEG, 1)
rEG  <- addEdge (nodeNames[3], nodeNames[8], rEG, 1)

rEG  <- addEdge (nodeNames[4], nodeNames[9], rEG, 1)
rEG  <- addEdge (nodeNames[4], nodeNames[10], rEG, 1)

eAttrs  <- list()

q    <-  edgeNames(rEG)



### PROBABILITY VALUES
# Add the probability values to the the branch lines
eAttrs$label <- c(toString(a),
                  toString(s),
                  toString(n),
                  toString(a_full),
                  toString(a_nfull),
                  toString(s_full),
                  toString(s_nfull),
                  toString(n_full),
                  toString(n_nfull)
                  )

names(eAttrs$label) <- c( q[1], q[2], q[3], q[4], q[5], q[6], q[7], q[8], q[9] )
edgeAttrs <- eAttrs


### COLOR
# Tree Details 
attributes <- list(node  = list(label    = "foo", 
                              fillcolor = "darkgreen", 
                              fontsize  = "15",
                              fontcolor = "white"
                              ),
                   edge  = list(color   = "darkgreen"),
                   graph = list(rankdir = "LR")
                   )


### PLOT
# Plot the probability tree using Rgraphvis
plot (rEG, edgeAttrs = eAttrs, attrs=attributes)
nodes(rEG)
##  [1] "P"           "Academic"    "Sport"       "None"        "Acad_Full"  
##  [6] "Acad_NFull"  "Sport_Full"  "Sport_NFull" "None_Full"   "None_NFull"
#Add probabilities value on the leaves
text(570, 400, aANDa_full, cex = 0.8)
text(570, 320, aANDa_nfull, cex = 0.8)

text(570, 250, sANDs_full, cex = 0.8)
text(550, 210, "0.14 / 0.25 = 0.56", cex = 0.6, col="darkgreen")
text(570, 170, sANDs_nfull, cex = 0.8)

text(570, 100, nANDn_full, cex = 0.8)
text(570, 20, nANDn_nfull, cex = 0.8)

#Add the table
text(50,80,  paste("P(A):"   ,a    ), cex = .9, col="darkgreen")
text(46,60,  paste("P(S):"  ,s ), cex = .9, col="darkgreen")
text(50,40,  paste("P(N):"  ,n ), cex = .9, col="darkgreen")

text(141,80,  paste("P(F):"  ,gar_full ), cex = .9, col="darkgreen")
text(141,60,  paste("P(FS):"  ,s_full ), cex = .9, col="darkgreen")
text(150,40,  paste("P(S∩F):"  ,sANDs_full ), cex = .9, col="darkgreen")
text(149,20,  paste("P(S|F):"  ,sGivenf ), cex = .9, col="darkgreen")