ADAN7301.02 Week 2 Discussion 3

Grace Inorio

I. R Nuts and Bolts/Programming Readings

Write one paragraph each (in your own words) describing what are classes and one paragraph on what are data structures (with examples).

Data classes are similar to data types in R. They define what type an object is for object-oriented programming. Data classes in R are numeric, character, integer, and logical. The class pertains to the one instance, or observational instance, of data. For example, if data were collected on the names of students in this data analysis class, each recorded name would be of class character because it is a string of unique letters. It is not numerical or represent a number in any way (like numeric or integer), nor does it amount to TRUE or FALSE (like logical). How that data is stored and displayed to the user, would then cross over into data structure territory.

A data structure describes how data is stored and organized in R. There are six different data structures in R: vector, list, matrix, data frame, table, and factor. They can be differentiated according to their dimensions as well as their homogeneity or lack there of. Data can be manipulated and accessed differently depending on which data structure is used. For example, some, such as data frames, can hold multiple data types, while others, such as a matrix, must have data of all the same type. If we were to attempt to multiply two matrices, but one had both numerical and character data, our efforts would be fruitless. Vectors are often referred to as the most fundamental R data structure, and data frames the most valued. Both are important to our work in R, and one could argue that data frames need vectors to exist. The columns of data frames are usually of the same type, and in this way, a data frame can be viewed as multiple vectors column bound to form a more complex data representation.

Pick a dataset (from base R, AER package, or even the titanic dataset), and apply the two commands on your data. What do you find, and does it make sense? Please explain.

# Using the HairEyeColor dataset from base R
class(HairEyeColor)

## [1] "table"

typeof(HairEyeColor)

## [1] "double"

head(HairEyeColor)

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

The HairEyeColor dataset contains the hair and eye color combinations of 592 statistics students at the University of Delaware from 1974. Running the class() and typeof() commands returns that this data is of class ‘table’ and type ‘double.’

When exploring the dataset in R, I discovered that the data is a 3 dimensional array. The three variables are Hair, Eye, and Sex. Hair has four levels: Black, Brown, Red, and Blond. Eye has four levels: Brown, Blue, Hazel, and Green. Sex has two levels: Male, Female.

An array holds multi-dimensional data of any type. A table can be viewed as a kind of array, so it makes sense that the class() prompt indicates table. The type of double also makes sense because a double is a numeric value that can have decimal points instead of being a whole integer number. Our data contains counts of individuals with different hair and eye colors. The values are all whole number counts, so integer would make sense too, but double allows for floating points and is acceptable here.

II. Reading and Writing Functions in R. So far you have used functions that have already been written by somebody in R. It is often possible to open up the function and see the underlying math/operations to see the mechanics of the command. EG - type “mad”, “iqr”, or “sd” in R to open up these functions, and try to speculate what are the calculations in at least 2 functions.

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x12b16f0c0>
## <environment: namespace:stats>

The sd function has two arguments: x and na.rm. X is a numeric vector. Na.rm is a logical. Its only two possible values are TRUE and FALSE, and it indicates whether or not missing values should be removed. Working from the inside out, the function is first testing if input x is a vector, or if it is a factor that can be transformed from its categories into a double. If either is true, we continue. Missing values are not removed, and then we take the variance of the numbers we have. Finally, we take the square root, and we have our final estimate of the standard deviation.

mode

## function (x) 
## {
##     if (is.expression(x)) 
##         return("expression")
##     if (is.call(x)) 
##         return(switch(deparse(x[[1L]])[1L], `(` = "(", "call"))
##     if (is.name(x)) 
##         "name"
##     else switch(tx <- typeof(x), double = , integer = "numeric", 
##         closure = , builtin = , special = "function", tx)
## }
## <bytecode: 0x12b5f5440>
## <environment: namespace:base>

The mode only takes one argument: a character string x. Once we have x, we go through a waterfall. First, if x is an expression. Expression objects are objects which contain parsed but unevaluated R statements such as “1 + 2”. If x is an expression, then it is returned and the function is finished. If it is not, the next if statement is executed, and we test whether or not x is a function call by comparing the structure of x to a function call that includes a function name and arguments. If it is a function call, such as median() or mean(), the function is returned. If it is not, the third if statement is executed, and we assess if x is a valid variable name. If it is, the name is returned. If it is not, and we have reached the end of our rope, the else statement is executed. The else statement is only executed if x does not meet the criteria for the previous three if clauses. Now, switch is used to evaluate the expression and if the elements are numeric, or character elements.

I spent a lot of time working through each line of function code in mode() to determine what was going on and figure out how in the world this was going to return the mode of a set of numbers before I realized this is finding a totally different kind of mode. This mode is used to get or set the ‘mode’ (a kind of ‘type’), or the storage mode of an R object. I explored one more mathematical function below.

IQR

## function (x, na.rm = FALSE, type = 7) 
## diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE, 
##     type = type))
## <bytecode: 0x12b7b73f8>
## <environment: namespace:stats>

The function IQR takes a numeric vector argument denoted as x, na.rm can be set to TRUE or FALSE, and type = can be set to one of many quantile algorithms. Here, type = 7 indicates that we are using the default quantile calculation method. The quantile() function converts x to numeric type and gives the quantiles corresponding to the given probabilities of 0.25 and 0.75. Na.rm has been set to false, so any NA values are not removed from our data. Names has been set to false, so we do not have names in regards to our quantiles, and we are once again using the generic quantile calculation thanks to type = 7. Once the quantile function gives us the values at the first and third quantile, diff() takes the difference of these numbers. This difference can also be thought of as the length of the boxplot, or the IQR.

Now, you will try to write your own function, like fahrenheit_to_celsius or center. Keep it simple. Ideally, see if you can write your own function for any summary statistic measure, like if you can calculate the arithmetic mean of the elements in a vector instead of using base R “mean()” function.

# my function has two arguments: a vector of (hopefully) numbers, and the a number indicating the amount of decimal points we would like in our final result #
my_mean_function = function(my_vector, digits) {
  # any values that are not able to be converted to numeric type will be excluded #
  nums = na.exclude(as.numeric(my_vector))
  return(round(sum(nums) / length(nums), digits = digits))
}

III. Please explain Bayes Theorem in your own words, and give an example. Less than 10 sentences. Also, write out the formula. Pick up on how to to type equations in R Markdown using Latex terminology.

Bayes Theorem is a rule for finding the conditional probability given that we know some elements have already occurred. Given two events, A and B, the definition can be read out loud as “The probability of A given B equals the probability of A and B divided by the probability of B. If I wanted to find the probability that I passed the first round of an interview given that I made a funny joke at the start that made the interviewer laugh, I could divide the probability that I passed the interview and that I made a funny joke by the probability that I made a funny joke. With IP representing”interview passed” and FJ representing “funny joke”, the theorem would look like this: \(\mathrm{P}(IP \mid FJ)\) = \(\mathrm{P}(IP \cap FJ)\) / \(P(FJ)\) .

IV. Bayes’ Theorem can yield surprising results. Take a look at Open Stats textbook Guided Practice 3.43 and attempt to solve this using Bayes’ formula. Interpret the results. Then use the attached code and solve via a tree diagram in R. You will need to change the initial parameters to the appropriate values. Attach your final graph to your submission, and your code.

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

DF <- data.frame(in. = c("Full", "Not Full", "Full", "Not Full", "Full", "Not Full"), out. = c("Academic", "Academic", "Sporting", "Sporting", "None", "None")) # input

g <- graph_from_edgelist(as.matrix(DF[2:1]))
lay <- layout_as_tree(g)
plot(as.undirected(g), layout = lay %*% diag(c(1, -1)))

## Warning: `as.undirected()` was deprecated in igraph 2.1.0.
## ℹ Please use `as_undirected()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

I have spent a lot of time trying to nail down the tree diagrams, but this is the best that I have been able to come up with this week. Luckily, my understand is better than my visualization.

\(\mathrm{P}(Sporting \mid Full)\) = (\(\mathrm{P}(Full \mid Sporting)\) * \(\mathrm{P}(Sporting)\)) / (\(\mathrm{P}(Full \mid Sporting)\) * \(\mathrm{P}(Sporting)\) + \(\mathrm{P}(Full \mid Academic)\) * \(\mathrm{P}(Academic)\) + \(\mathrm{P}(Full \mid None)\) * \(\mathrm{P}(None)\)

Numerically, this is:

(0.7 * 0.2) / ((0.7 * 0.2) + (0.25 * 0.35) + (0.05 * 0.45))

## [1] 0.56