In R a class can refer to the data type of a variable, though it can also be used more broadly to describe a grouping of data and functions. For our purposes we will focus on the basic data types in R which include (but not limited to) integers, doubles, characters, and logical values. The type or class of variable one chooses depends on what kind of data you are attempting to represent. Integers and doubles are ideal for numeric data, while characters and logical values are great for categorizing.
A data structure, however, is how you choose to organize multiple variables into a single object. In R the basic structures are vectors, matrices, arrays, data frames, and lists. The first 3 (vectors, matrices, and arrays) are respectively 1D, 2D, and 3D groupings of a single data type. Data frames and lists are more flexible and can include multiple types of data. In the following questions, I will be using a data frame as an example.
For this question I chose to revisit the Titanic data set from last week and applied the str() command to it. From this we can see that it is a data.frame, it has twelve variable, and the data type of each of those variables. With data frames being a more complex grouping of data, this command is ideal for a comprehensive overview.
Titanic.disc2 <- read.csv("~/Boston College/Data Analysis/Homework/Homework 1 - Titanic/Titanic HW1.csv")
str(Titanic.disc2)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
That said we can get more specific. Applying the class() command allows us to again see that Titanic.disc2 is a data.frame, while using typeof() of on a specific column in the data frame lets us know the specific data type stored there.
class(Titanic.disc2)
## [1] "data.frame"
typeof(Titanic.disc2$PassengerId)
## [1] "integer"
typeof(Titanic.disc2$Name)
## [1] "character"
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x0000025766a18ed8>
## <environment: namespace:stats>
The above function sd or standard deviation is a common statistical calculation that shows how volatile a set of of data is. The function here seems to depend on the var() function which I am guessing is the variance functions. To get standard deviation, you have to take the square root of the variance so this makes sense. The if statement within seems to account for different kinds of inputs, though I am not certain as to its exact function.
mad
## function (x, center = median(x), constant = 1.4826, na.rm = FALSE,
## low = FALSE, high = FALSE)
## {
## if (na.rm)
## x <- x[!is.na(x)]
## n <- length(x)
## constant * if ((low || high) && n%%2 == 0) {
## if (low && high)
## stop("'low' and 'high' cannot be both TRUE")
## n2 <- n%/%2 + as.integer(high)
## sort(abs(x - center), partial = n2)[n2]
## }
## else median(abs(x - center))
## }
## <bytecode: 0x0000025765e5f810>
## <environment: namespace:stats>
This function calculate the median absolute deviation of a set of data, another measure of volatility. Here you make a list of how far in absolute terms each data point is from the median, and then you take the median of that list. The initial line of the function takes the input, finds the median, and sets a constant value. The next block of code handles the removal of N/A entries if specified. After that the code becomes a bit more complex, but it essentially works through the list and performs the calculations described earlier.
myVar <- function(x, avg = mean(x), n = length((x))){ #take the input and calculate some needed values
sum((x-avg)^2)/(n-1) #perform the variance calculation
}
For this question I tried to write a function that would find the variance of a given vector using the formula given in the OpenStats book. I was unsure how exactly to perform operations on a vector in a function so it took some trial and error but I was eventually able to get results consistent with the base var() function.
x <- c(1,2,3,4,5,6,7,9,10)
myVar(x) # My function
## [1] 9.444444
var(x) # base function for comparison
## [1] 9.444444
Bayes theorem, simply put, describes how we update our beliefs about the probability of an event, given some kind of relevant information. An example I like is the question of whether we think a man named Steve is a librarian given that his personality is shy and timid, and that he prefers things neat and orderly. I first encountered this hypothetical in the book Thinking, Fast and Slow by Daniel Kahneman, though I ran into it again in the video by 3Blue1Brown while preparing for this assignment.
In it we have 2 distinct sets of outcome, the first is whether Steve is a librarian or not, we can describe the probability of him being a librarian with the notation P(L). The second set of outcomes is whether Steve is shy and orderly, or not; we can describe the probability of him being shy with the notation P(S). Once we know that Steve is in fact shy and orderly, we will update our belief about whether or not he is a librarian. This new probability can be written as P(L|S), or the chance that Steve is a librarian given that he his shy and timid. Bayes Theorem solves for this probability:
\(P(L \mid S) = \displaystyle \frac{P(S \mid L)*P(L)}{P(S)}\)
In short, we take the probability that Steve is shy and a librarian, and divide it by the probability that he is shy, giving us a more accurate prediction of the outcome.
The problem from the Open Stats book is written as follows:
Jose visits campus every Thursday evening. However, some days the parking garage is full, often due to college events. There are academic events on 35% of evenings, sporting events on 20% of evenings, and no events on 45% of evenings. When there is an academic event, the garage fills up about 25% of the time, and it fills up 70% of evenings with sporting events. On evenings when there are no events, it only fills up about 5% of the time. If Jose comes to campus and finds the garage full, what is the probability that there is a sporting event? Use a tree diagram to solve this problem. (OpenIntro Statistics Fourth Edition, p. 107)
Here we are looking for P(SE|FG) where SE = sporting event, and FG = full garage. Using Bayes Theorem we can write it as:
\(P(SE \mid FG) = \displaystyle \frac{P(FG \mid SE)*P(SE)}{P(FG)}\)
Lets start by solving for the numerator, a simple operation since all elements have already been given by the prompt:
\(P(FG \mid SE)*P(SE) = 0.7*0.2 = 0.14\)
Solving the for the denominator is a bit more complicated, as we have to find the overall probability of the garage being full across all nights. We can find this by adding up the probability of a full garage for each kind of event:
\(P(FG) = P(FG \mid SE)*P(SE)+P(FG|AE)*P(AE)+P(FG|NE)*P(NE) = 0.7*0.2+0.25*0.35+0.05*0.45 = 0.25\)
With both halves solved, we can now solve for P(SE|FG) and see that it comes out to 0.56. Put differently, if the lot is full, there is a 56% chance that there is a sporting even going on.
\(P(SE \mid FG) = \displaystyle \frac{P(FG \mid SE)*P(SE)}{P(FG)} = \displaystyle \frac{0.14}{0.25} = 0.56\)
A Tree diagram is a great way to visualize conditional probability when working with Bayes Theorem. In the code below I was able to generate a tree plot using a great open source tool from DataKwery.com. The Github repo with the function and open source license used can be found here.
As an input, all that was needed was a 2 column csv file set up as follows:
input <- read.csv("~/Boston College/Data Analysis/Discussions/Discussion 2/r-tree-diagram-main/probabilities.csv")
input
## pathString prob
## 1 SE 0.20
## 2 SE/FG 0.70
## 3 SE/EG 0.30
## 4 AE 0.35
## 5 AE/FG 0.25
## 6 AE/EG 0.75
## 7 NE 0.45
## 8 NE/FG 0.05
## 9 NE/EG 0.95
Plugging that into the tool and calling the function from the other script gives us the following graph:
source("~/Boston College/Data Analysis/Discussions/Discussion 2/r-tree-diagram-main/tree-diagram.R")
make_my_tree(mydf = input)