Andrew Gregory
2024-01-26
R Classes:
Classes are reference types, and a tool that can bundles data and methods together for use in the implementation of concepts or objects. Often, classes are used to implement data structures. To use an analogy, I think of a tree, where each branch is one specific class, and each leaf is a data structure. There are six data types in R, Classes can be in the form of numeric (i.e “1” or “1.5”), Integer (i.e. 5L, 10L ), a character (i.e. ” A, B, C” ), or logical (i.e. TRUE or FALSE), Raw (i.e. “data science” is stored as 65 61 74 61 6e) and complex (i.e 5+5i).
R Class Examples:
Numeric
v <- 23.5 # create a numeric vlaue
v # print value of v
## [1] 23.5
class(v) # determine class
## [1] "numeric"
Integer
y <- 1:10 # create integer value
y # print value of y
## [1] 1 2 3 4 5 6 7 8 9 10
class(y) # determine class
## [1] "integer"
Logical
# False logical value
5==16
## [1] FALSE
# True logical value
2==2
## [1] TRUE
Character
x <- "dataset" # create a character value
x # print the value of x
## [1] "dataset"
typeof(x) # determine class type
## [1] "character"
Complex
z = 1 + 2i # create a complex number
z # print the value of z
## [1] 1+2i
class(z) # Determine class type
## [1] "complex"
Raw
v <- charToRaw("Hello") # create a raw vlaue
v #print value of v
## [1] 48 65 6c 6c 6f
class(v) # determine class type
## [1] "raw"
R Data Structures:
Similar to data in an analog spreadsheet like in excel, data structures are models of handling data. The six most common types of Data Structures in R, are Vectors, Matrix, Array, Data Frames, Lists and Factors. R’s basic data object, Vectors have 1 or more numbers, and are 1 dimensional, containing the same data type. A matrix, it two dimensional data structure, winch means it has rows and columns, that are the same length and class and the columns are unnamed. What I gathered is that an Array, is similar to a matrix, but is three dimensional meaning, stacked matrices. Data frames are the data structure that most resembles a spreadsheet, and Lists are the most flexible because they can include any length, class, or structure and you can have a list within a list. A factors is a data structure that helps store categorical data (i.e. “Gender”). Generally speaking, since everything in R is an object, classes are the blueprints that store data structure.
R Data Structures Examples
Vectors
# Basic Vector Example
vec <- c(1,2,3)
vec
## [1] 1 2 3
# Numeric Vector
numeric_vec <- c(1,2,3)
numeric_vec
## [1] 1 2 3
# Logical Vector
logical_vec <- c(TRUE,TRUE,FALSE)
logical_vec
## [1] TRUE TRUE FALSE
# Character Vector
character_vec <- c("A","B","C", "D", "F")
character_vec
## [1] "A" "B" "C" "D" "F"
# Integer Vector
integer_vec <- c(1L, 2L, 3L)
integer_vec
## [1] 1 2 3
# Complex Vector
complex_vec <- c(12+2i, 22i, 4+5i)
complex_vec
## [1] 12+ 2i 0+22i 4+ 5i
Matrix
matrix1 <- matrix(c(1:9), ncol = 3) # create matrix
matrix1 # print matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Array
arr1 <- array(c(5:68),dim=c(2,3,3)) # create Array
arr1 # Print Array
## , , 1
##
## [,1] [,2] [,3]
## [1,] 5 7 9
## [2,] 6 8 10
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 11 13 15
## [2,] 12 14 16
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 17 19 21
## [2,] 18 20 22
Data Frame
# Create data frame
athlete_rank <- c(1:5)
athlete_name <- c("john", "greg", "sarah", "james", "colby")
athlete_score <- c("third", "fifth", "second", "fourth", "first")
athlete.data <- data.frame(athlete_rank , athlete_name, athlete_score)
# Print Data Frame
athlete.data
## athlete_rank athlete_name athlete_score
## 1 1 john third
## 2 2 greg fifth
## 3 3 sarah second
## 4 4 james fourth
## 5 5 colby first
List
test_list <- list(1, 2, 3, "Surprise", FALSE)
test_list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] "Surprise"
##
## [[5]]
## [1] FALSE
Factor
fac <- factor(c("a", "b", "a", "b", "b")) # create factor
fac # Print factor
## [1] a b a b b
## Levels: a b
Checking for Class and Data Structures in R
# Load the data set
data()
UKgas
## Qtr1 Qtr2 Qtr3 Qtr4
## 1960 160.1 129.7 84.8 120.1
## 1961 160.1 124.9 84.8 116.9
## 1962 169.7 140.9 89.7 123.3
## 1963 187.3 144.1 92.9 120.1
## 1964 176.1 147.3 89.7 123.3
## 1965 185.7 155.3 99.3 131.3
## 1966 200.1 161.7 102.5 136.1
## 1967 204.9 176.1 112.1 140.9
## 1968 227.3 195.3 115.3 142.5
## 1969 244.9 214.5 118.5 153.7
## 1970 244.9 216.1 188.9 142.5
## 1971 301.0 196.9 136.1 267.3
## 1972 317.0 230.5 152.1 336.2
## 1973 371.4 240.1 158.5 355.4
## 1974 449.9 286.6 179.3 403.4
## 1975 491.5 321.8 177.7 409.8
## 1976 593.9 329.8 176.1 483.5
## 1977 584.3 395.4 187.3 485.1
## 1978 669.2 421.0 216.1 509.1
## 1979 827.7 467.5 209.7 542.7
## 1980 840.5 414.6 217.7 670.8
## 1981 848.5 437.0 209.7 701.2
## 1982 925.3 443.4 214.5 683.6
## 1983 917.3 515.5 224.1 694.8
## 1984 989.4 477.1 233.7 730.0
## 1985 1087.0 534.7 281.8 787.6
## 1986 1163.9 613.1 347.4 782.8
#check data type
class(UKgas)
## [1] "ts"
#check data type of every variable in data frame
str(df)
## function (x, df1, df2, ncp, log = FALSE)
#check if a variable is a specific data type
is.factor(UKgas)
## [1] FALSE
is.numeric(UKgas)
## [1] TRUE
is.logical(UKgas)
## [1] FALSE
#check data structure of UKGas
typeof(UKgas)
## [1] "double"
Answer: After checking class using class() function, the UKgas data set is time series. Furthermore, I found that the the data type is double using typeof() function, and is numeric using is.numeric() function as ween below.
# Load the data set
data()
UKgas
## Qtr1 Qtr2 Qtr3 Qtr4
## 1960 160.1 129.7 84.8 120.1
## 1961 160.1 124.9 84.8 116.9
## 1962 169.7 140.9 89.7 123.3
## 1963 187.3 144.1 92.9 120.1
## 1964 176.1 147.3 89.7 123.3
## 1965 185.7 155.3 99.3 131.3
## 1966 200.1 161.7 102.5 136.1
## 1967 204.9 176.1 112.1 140.9
## 1968 227.3 195.3 115.3 142.5
## 1969 244.9 214.5 118.5 153.7
## 1970 244.9 216.1 188.9 142.5
## 1971 301.0 196.9 136.1 267.3
## 1972 317.0 230.5 152.1 336.2
## 1973 371.4 240.1 158.5 355.4
## 1974 449.9 286.6 179.3 403.4
## 1975 491.5 321.8 177.7 409.8
## 1976 593.9 329.8 176.1 483.5
## 1977 584.3 395.4 187.3 485.1
## 1978 669.2 421.0 216.1 509.1
## 1979 827.7 467.5 209.7 542.7
## 1980 840.5 414.6 217.7 670.8
## 1981 848.5 437.0 209.7 701.2
## 1982 925.3 443.4 214.5 683.6
## 1983 917.3 515.5 224.1 694.8
## 1984 989.4 477.1 233.7 730.0
## 1985 1087.0 534.7 281.8 787.6
## 1986 1163.9 613.1 347.4 782.8
# confirm dataset is.numeric
is.numeric(UKgas)
## [1] TRUE
## Calculate "mad", iqr, & "sd"
sd(UKgas)
## [1] 251.3348
IQR(UKgas)
## [1] 316.6
mad(UKgas)
## [1] 149.4461
Answer: Functions, “mad”, “iqr” and “sd” are the functions to determine the Median Absolute Deviation (“mad”), the Interquartile Ranges (“iqr”), and the standard deviation (“sd”) of a numeric data set. Another way to get the same info plus more is to generate the descriptive statistics for a data set.
Fahrenheit to Celsius Function:
# Fahrenheit to Celsius Function
fahrenheit_to_celsius <- function(temp_F) {
temp_C <- (temp_F - 32) * 5 / 9
return(temp_C)
}
#Boiling point of water in Fahrenheit
fahrenheit_to_celsius(212)
## [1] 100
Bayes Theorem: Based on the law of conditional probability, Bayes’ Theorem main use is to update probabilities based on evidence. In essence, the formula for Bayes’ Theorem depicted below, is used when you want to determine the probability of an event occurring based on a related occurrence. In the Bayes’ Theorem formula, P(A | B) depicts the probability of event A happening based on the occurrence of event B. Similarly, P(B|A) is the probability of event B happening based on the occurrence of event A. Lastly, P(A) and P(B) are the prior probability of A, and the overall probability of observing evidence B, respectfully.
Bayes’ Theorem Formula:
\[ P(A \mid B) = \frac{P(B \mid A)*P(A)}{P(B)} \]
## Intall/load BiocManager and Rgraphviz
install.packages("BiocManager")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(BiocManager)
BiocManager::install("Rgraphviz")
## Bioconductor version 3.18 (BiocManager 1.30.22), R 4.3.2 (2023-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'Rgraphviz'
library(Rgraphviz)
## Loading required package: graph
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: grid
Probability of Type of Event:
P(Academic Events) = 35%
P(Sporting Events) = 20%
P(No Event) = 45%
Probability Garage is full per type of event:
P(Academic Events) = 25%
P(Sporting Events) = 70%
P(No Event) = 5%
\[ P(S \mid F) = \frac{P(F \mid S)*P(S)}{P(F)} \]
Results:
Based on the results from the Bayes’ formula, 56% probability that Jose will come to campus, finds the garage full, when there is a sporting event.
# Solve for 3.43 Guided Practice using Bayes' formula
### Assign each probability to a variable
# Probability of academic event
a <- 0.35
# Probability of garage full when there's an academic event
a_full <- 0.25
# Probability of sporting event
s <- 0.20
# Probability of garage full when there's a sporting event
s_full <- 0.70
# Probability of no event
n <- 0.45
# Probability of garage full when there's no event
n_full <- 0.05
# Calculate the rest of the values based upon the variables above (Not full)
a_nfull <- 1 - a_full
s_nfull <- 1 - s_full
n_nfull <- 1 - n_full
# Joint Probabilities of each event being full and not full
aANDa_full <- a*a_full
aANDa_nfull <- a*a_nfull
sANDs_full <- s*s_full
sANDs_nfull <- s*s_nfull
nANDn_full <- n*n_full
nANDn_nfull <- n*n_nfull
# Probability garage is full
gar_full <- aANDa_full+sANDs_full+nANDn_full
# P(S|F)
sGivenf <- sANDs_full/gar_full
#show answer
print(paste0("Answer ", round(x = sGivenf,
digits = 4)
)
)
## [1] "Answer 0.56"
## Start coding the parts of the tree
##Nodes
node1 <- "P"
node2 <- "Academic"
node3 <- "Sport"
node4 <- "None"
node5 <- "Acad_Full"
node6 <- "Acad_NFull"
node7 <- "Sport_Full"
node8 <- "Sport_NFull"
node9 <- "None_Full"
node10 <-"None_NFull"
nodeNames <- c(node1, node2, node3, node4, node5, node6, node7, node8, node9, node10)
rEG <- new("graphNEL", node = nodeNames, edgemode = "directed")
### LINES
# Draw the "lines" or "branches" of the probability Tree
rEG <- addEdge (nodeNames[1], nodeNames[2], rEG, 1)
rEG <- addEdge (nodeNames[1], nodeNames[3], rEG, 1)
rEG <- addEdge (nodeNames[1], nodeNames[4], rEG, 1)
rEG <- addEdge (nodeNames[2], nodeNames[5], rEG, 1)
rEG <- addEdge (nodeNames[2], nodeNames[6], rEG, 1)
rEG <- addEdge (nodeNames[3], nodeNames[7], rEG, 1)
rEG <- addEdge (nodeNames[3], nodeNames[8], rEG, 1)
rEG <- addEdge (nodeNames[4], nodeNames[9], rEG, 1)
rEG <- addEdge (nodeNames[4], nodeNames[10], rEG, 1)
eAttrs <- list()
q <- edgeNames(rEG)
### PROBABILITY VALUES
# Add the probability values to the the branch lines
eAttrs$label <- c(toString(a),
toString(s),
toString(n),
toString(a_full),
toString(a_nfull),
toString(s_full),
toString(s_nfull),
toString(n_full),
toString(n_nfull)
)
names(eAttrs$label) <- c( q[1], q[2], q[3], q[4], q[5], q[6], q[7], q[8], q[9] )
edgeAttrs <- eAttrs
### COLOR
# Tree Details
attributes <- list(node = list(label = "foo",
fillcolor = "darkgreen",
fontsize = "15",
fontcolor = "white"
),
edge = list(color = "darkgreen"),
graph = list(rankdir = "LR")
)
### PLOT
# Plot the probability tree using Rgraphvis
plot (rEG, edgeAttrs = eAttrs, attrs=attributes)
nodes(rEG)
## [1] "P" "Academic" "Sport" "None" "Acad_Full"
## [6] "Acad_NFull" "Sport_Full" "Sport_NFull" "None_Full" "None_NFull"
#Add probabilities value on the leaves
text(570, 400, aANDa_full, cex = 0.8)
text(570, 320, aANDa_nfull, cex = 0.8)
text(570, 250, sANDs_full, cex = 0.8)
text(550, 210, "0.14 / 0.25 = 0.56", cex = 0.6, col="darkgreen")
text(570, 170, sANDs_nfull, cex = 0.8)
text(570, 100, nANDn_full, cex = 0.8)
text(570, 20, nANDn_nfull, cex = 0.8)
#Add the table
text(50,80, paste("P(A):" ,a ), cex = .9, col="darkgreen")
text(46,60, paste("P(S):" ,s ), cex = .9, col="darkgreen")
text(50,40, paste("P(N):" ,n ), cex = .9, col="darkgreen")
text(141,80, paste("P(F):" ,gar_full ), cex = .9, col="darkgreen")
text(141,60, paste("P(FS):" ,s_full ), cex = .9, col="darkgreen")
text(150,40, paste("P(S∩F):" ,sANDs_full ), cex = .9, col="darkgreen")
text(149,20, paste("P(S|F):" ,sGivenf ), cex = .9, col="darkgreen")