Part I: Classes Data classes classify which type of data a variable
encompasses. For example, data on height may be classified as numeric,
since heights can take on various measurements and can also vary widely.
Numeric data usually take on additional precision, so a data point for
height for example may be 5.5 or 5 feet 6 inches. Numeric data can also
be presented as an integer. However, the difference between a numeric
variable and an integer is that integer variables only take on whole
numbers. An example of this may be age. Age is typically measured in
years to the nearest whole number (however, this could change based on
the intent of the data collection- i.e., if we were measuring toddlers
as 1.5, 2.5 years, etc.). Variables can also be classified as logical
variables, where the variable only takes on values of TRUE or FALSE (or
NA if there are null data in the dataset). A variable encompassing a
characteristic of an observation in a dataset may be an example of this.
For instance, if there is a variable for whether a Medicare beneficiary
is a new enrollee, the variable’s valid values may be “TRUE” if the
beneficiary is categorized as a new enrollee, or “FALSE” if the
beneficiary is not a new enrollee. Logical values are also called
Boolean values. Finally, there is character data. Character data
includes variables that have qualitative data, such as the name of a
person or the address of a person. Oftentimes, these data are grouped or
aggregated together to get a summarized view of the overall population.
For example, Medicare beneficiaries may have their state recorded as
character data, but we may look at the number of Medicare beneficiaries
by state to get a better sense of how many eligible beneficiaries are
enrolled by region.
Data Structures These different data classes can be arranged into
different data structures. Data structures store different data points
or values. The first of these data structures is called a Vector. A
vector can be thought of as data in a straight line, or a
one-dimensional collection of data. However, although we can have data
of different types in other data structures, a vector must contain data
of all the same type/class. For example, a vector cannot contain both
character and numeric data, i.e., vector <- c(“test”,100). If we try
to run the previous code, R instead converts the data to the “common
denominator” variable class, or character class. This is called
coercion. One way to get around this is to create a list. A list is more
flexible than the vector data structure. Lists allow you to include any
data class. In addition, all elements in the list can be any length. A
list would allow the previous example, vector <- c(“test”,100), to
include the differing data types, where “test” would be categorized as
character and ‘100’ would be classified as numeric. Lists can also
include other lists. Next, we have a 2-dimensional data structure called
a matrix. A matrix consists of both columns and rows. Unlike lists, but
similar to vectors, matrices are homogenous, meaning that all data in
the matrix must be of the same class. If you create a matrix with
different data classes, similar to a vector, R will coerce the data into
the same class. See the following example:
#Create a Matrix
matrix <- matrix(c("a","b","c",100),nrow=2,ncol=4)
#Check if the matrix we created is indeed a matrix
is.matrix(matrix)
## [1] TRUE
##Print out our matrix
print(matrix)
## [,1] [,2] [,3] [,4]
## [1,] "a" "c" "a" "c"
## [2,] "b" "100" "b" "100"
In this example, R coerces the integer 100 into a character variable. After matrices, we have arrays. Arrays are similar to matrices and vectors in that they are homogenous, but where they differ is that an array is a multi-dimensional data structure. A vector and matrix are actually arrays, but can be described as a uni-dimensional array and a multi-dimensional array, respectively. You can create a multi-dimensional array by combining matrix_2 with vector to get a character array. A data frame is similar to a list, except that it is two-dimensional instead of one-dimensional. Like a list, it is heterogenous, or can contain multiple data classes. Data frames are typically used when we want to import CSV or excel data into R. An example of this is our Titanic data from last week’s homework. The train.csv data contained qualitative data like passenger names (character class), quantitative data like fare price (numeric class), and logical variables like survived (although this is technically categorized as integer class data, this acts as a boolean variable where 0 = died vs 1 = survived). All passengers (or observations) have data points for these variables (where data are not null) and therefore the dataset is a data frame. Factors are used to store categorical variables, which are also referred to as factor values. An example of this would be storing levels of male and female in a factor vector.
I applied the class() and typeof() commands on the train data from the Titanic exercise. I found the following:
##Read in Titanic Train Data Set as a Data Frame
train_df <- read.csv('/Users/svonmaluski/Documents/Masters in Applied Economics/Data Analysis/Week 1/train.csv')
#Check what kind of data structure the Titanic Dataset is
class(train_df)
## [1] "data.frame"
#Check the data types of some of the vars in the Titanic Dataset
typeof(train_df$PassengerId)
## [1] "integer"
typeof(train_df$Name)
## [1] "character"
typeof(train_df$Age)
## [1] "double"
The class of the train_df being a data frame makes sense, because a data frame is heterogenous and can hold multiple types of data in a single frame. It is also two-dimensional, which makes sense because we only have rows for each passenger, and then columns for different data points on each passenger. For typeof(), I looked at three different varaibles: PassengerID, Name, and Age. PassengerID makes sense as an integer because it is a single numeric identifier assigned to each passenger. Name = character makes sense because this is qualitative data that allows us to look at each passenger’s specific attribute, but this can’t really be aggregated in a way that would be useful. Finally, Age = double makes sense because R automatically stores numeric values as double, even though the dataset only records Age to the nearest integer.
Part II: Function: IQR
#Part II:
##Check functions:
###Open Function IQR
IQR
## function (x, na.rm = FALSE, type = 7)
## diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE,
## type = type))
## <bytecode: 0x1056d7f48>
## <environment: namespace:stats>
Since the above function is trying to obtain the interquartile range of a variable ‘x’, I think the function is trying to find the difference between the 75th percentile and the 25th percentile using the “diff” function. (the IQR = 75th pctile of X – 25th pctile of X). The argument “na.rm = FALSE” denotes that the function will not compute the value if there are NA values in the column.
Function: Standard Deviation
###Open Function SD
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x105171328>
## <environment: namespace:stats>
The function denotes that, if the object in question is a matrix, apply the standard deviation formula (square root of the variance) to the columns of the matrix (where 2 is denoting apply this to columns, not rows, which equals 1). If the object is a vector, simply take the square root of the variance of the vector. If the object is a data frame, apply the same methodology as the matrix. Finally, if none of the other formulas apply, take the square root of the variance of a vector, provided that R is now treating the object as a vector.
Create a function: Creating a function for 3 standard deviations from the mean, both positive and negative (this is a methodology we use at work to detect outliers):
##Create a function called "Three Positive Standard Deviations from the Mean"
Three_pos_SDs_from_mean <- function(x) {
ThreeSDs <- mean(x, na.rm=TRUE) + 3 * sd(x,na.rm = TRUE)
return(ThreeSDs)
}
#Test out function on the Age column
Three_pos_SDs_from_mean(train_df$Age)
## [1] 73.27861
#Create a function called "Three Negative Standard Deviations from the Mean"
Three_neg_SDs_from_mean <- function(x) {
ThreeSDs <- mean(x, na.rm=TRUE) - 3 * sd(x,na.rm = TRUE)
return(ThreeSDs)
}
##Test out function on the Age column
Three_neg_SDs_from_mean(train_df$Age)
## [1] -13.88037
The results of using this function on the train data are 73.27861 for the Three_pos_SDs_from_mean function and -13.88037 for the Three_neg_SDs_from_mean function. Clearly we cannot have negative ages so the Three_neg_SDs_from_mean function is less useful for non-negative data.
Part III: Bayes’ Theorem Bayes theorem provides the probability that something will happen based on evidence from a separate event that has already occurred. In other words, it describes the conditional probability that something will happen (say, even A), when event B has already occurred.
The official formula is this: \(P(A|B)\) = \(\frac{P(B|A) * P(A)}{P(B)}\)
One example of this could be the probability that it snows given that it’s cloudy outside. If the probability of snow is 10% in any given day of January, and the probability of a cloudy day in January is 50%, and we know that there’s a 60% chance it will snow if it’s cloudy out, the formula can look like this:
\(P(Snow|Cloud)\) = \(\frac{P(Cloud|Snow) * P(Snow)}{P(Cloud)}\)
which equates to: \(P(Snow|Cloud)\) = \(\frac{(.1 * .6)}{.5}\) = .12 or 12%
So there is a 12% chance it will snow on any given day in January if it’s cloudy out that day.
Part IV: Academic Event = .35 chance Sporting Event = .2 chance No Event = .45 chance
Academic Event, Garage full = .25 Academic Event, Garage not full = .75 Probability of an Academic event and garage is full: .35 * .25 = 0.0875
Sporting Event, Garage full = .7 Sporting Event, Garage not full = .3 Probability of a Sporting event and garage is full: .2 * .7 = 0.14
No Event, Garage full = .05 No Event, Garage Not Full = .95 Probability of a Sporting event and garage is full: .45 * .05 = 0.0225
Formula: \(P(A|B)\) = \(\frac{P(B|A) * P(A)}{P(B)}\)
Rewrite: \(P(E|F)\) = \(\frac{P(F|E) * P(E)}{P(F)}\)
Plug in: \(P(E|F)\) = \(\frac{.7 * .2}{(.0875 + .14 +.0225)}\) = \(P(E|F)\) = \(\frac{.24}{.25}\) = .56
So there is a 56% chance that, based on the fact that the garage is full, there’s a sporting event going on.
Below is the way I was able to plot the tree for this graph. Note that I tried to find the code mentioned in the discussion instructions, but I couldn’t find it. In addition, I tried to use the package mentioned in the discussion instructions and use google to try and figure it out, but it kept crashing my R program, so I had to switch to using DiagrammeR.
#Part IV: Bayes Theorem
#install.packages("DiagrammeR")
library(DiagrammeR)
nodes <- create_node_df(
n = 10,
label = c(
"Garage Full/\nEvent Probability",
"Academic \nEvent,\n 35%",
"Sporting \nEvent,\n 20%",
"No \nEvent, \n45%",
"AE Garage\n Full, \n25%\nP(E|F) = 35%",
"AE Garage\n Not Full, \n75%",
"SE Garage\n Full, \n70%\nP(E|F) = 56%",
"SE Garage\n Not Full,\n30%",
"NE Garage\n Full, \n5%\nP(E|F) = 9%",
"NE Garage\n Not Full, \n95%"
),
shape = "box"
)
edges <- create_edge_df(
from = c(1,1,1,
2,2,
3,3,
4,4),
to = c(2,3,4,
5,6,
7,8,
9,10
),
rel = "arrow"
)
tree_graph <- create_graph(
nodes_df = nodes,
edges_df = edges
)
render_graph(tree_graph, layout = "tree")