# clear workspace
rm(list = ls())
cat("\f")
data("Nile")
NR <- Nile #assign variable df to contain the data.frame mtcars
str(NR) #This function shows the data type of every variable in the data frame
## Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
class(NR) #gives class type of 'NR'
## [1] "ts"
This function tells us what data structure the object is. This function is telling us that the object ‘NR’ (from the dataset Nile) is a time series (at high-level organization).
typeof(NR) #Returns the low-level object data type
## [1] "double"
This funtion tells us the low-level data type in the dataset ‘NR’ (for Nile River). The low-level data type in NR is a double.
There are 5 basic classes of data in R. These are logical, integer, numeric, complex, and character.
Logical classes come in the form of either “TRUE” or “FALSE”. In other words, logical classes are Boolean data types.
#Boolean Example
typeof(TRUE)
## [1] "logical"
Numeric data class are in the form of numbers. There are a few subclasses of numeric data which are integers, singles, and doubles. Singles contain 32 bits of precision and doubles contain 64 bits of precision. Integers in R can only store whole numbers. The default for R is to store numeric data types as doubles. Most math in R will be performed as doubles. Doubles take up more memory space than integers.
#Integer Example
typeof(4L)
## [1] "integer"
typeof(4)
## [1] "double"
Character data types are used for text or strings in R. When storing character data in a vector or other data structure, usually use quotations around the characters. Numbers can also be stored as characters, but you will need to tell R that with ““. However, if there is one character value in a vector then the whole vector will default to character values, even if it contains some numbers.
#Double Example
typeof("a")
## [1] "character"
Data Structures:
Vector: A vector is a data structure with 1 or more numbers in a 1 dimensional array. All the data types in a vector (numeric, character, logical, etc.) must be the same.
#Vector Example
my_vector <- c(1,2,3,4,5)
my_vector
## [1] 1 2 3 4 5
Factor: Factors represent categorical data. They are essentially labels (really acting as integers underneath with corresponding labels for each unique integer). They help categorize data. Factors store categorical data in levels. The factor() function can take the arguments (vector, levels), although levels in sometimes optional.
#Factor Example
f <- c("yes", "no", "yes", "no", "maybe", "no")
my_factors <- factor(f)
print(my_factors)
## [1] yes no yes no maybe no
## Levels: maybe no yes
Tables: Similar to a data frame. Tables are used to create frequency tables, organize categorical data, among other uses.
Data Frame: Data frames are a type of list but with some restrictions. Elements in a data frame are vectors and they all need to be of equal length in a data frame. Any data type and any combination of them can be stored in a data frame. Columns must have the same data type. Data Frames are similar to spreadsheets.
#Data Frame Example
data_frame <- data.frame((id = 1:5), names = c("Ryan", "Matt", "Sam", "Carter", "Brain"), weight = c(120, 135, 187, 123, 150))
data_frame
## X.id...1.5. names weight
## 1 1 Ryan 120
## 2 2 Matt 135
## 3 3 Sam 187
## 4 4 Carter 123
## 5 5 Brain 150
List: A list is often considered the most flexible data type in R. It is a ordered collection of elements. These elements can be of any data class, can be different lengths, or structure. You can also store lists inside of lists. You can store vectors, matrix, data frames, etc. in a list.
#List Example
list <- list(1, "Hello", 4L, "a", TRUE)
list
## [[1]]
## [1] 1
##
## [[2]]
## [1] "Hello"
##
## [[3]]
## [1] 4
##
## [[4]]
## [1] "a"
##
## [[5]]
## [1] TRUE
X <- c(12, 4, 6, 17, 8, 3, 14) #Creating a Vector with 7 elements
X
## [1] 12 4 6 17 8 3 14
#Calculates the standard deviation of the vector X using R's built in function
R_StandardDeviation_InBuilt <- sd(X)
print(R_StandardDeviation_InBuilt)
## [1] 5.304984
#This is the standard deviation of X calculated 'by hand'
R_Standard_Deviation_Hand <- sqrt(sum((X-mean(X))^2)/(length(X)-1))
print(R_Standard_Deviation_Hand)
## [1] 5.304984
As we can see, the standard deviation by hand and R’s built in standard deviation function return the same values. That means the By Hand calculation was done correctly.
# Showing the code of the built in function 'sd'
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x000002caa4088c10>
## <environment: namespace:stats>
This shows the full function for the standard deviation function. The first part of the code gives the arguments (vector, default na.rm), the default remove na values is FALSE. The next part is the meat of the function. As we can see; the function is taking the square root of the variance. The function is also checking to see if the object is a vector, and then the function is converting the vector to a double in order to perform the math.
#Showing code in built in IQR Function
IQR
## function (x, na.rm = FALSE, type = 7)
## diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE,
## type = type))
## <bytecode: 0x000002caa6249590>
## <environment: namespace:stats>
This is the IQR function. The arguments are data, na.rm = (remove or keep NA values), and type (selects the quantile algorithm).
The meat of the function is taking the difference between the 3rd and 1st quartile to give us the middle 50%.
#Finding the IQR of our vector X
IQR(X)
## [1] 8
The IQR of vector X is 8. Note that 8 is not a range, the IQR function gives the size of the IQR (similar to how range gives its output) and not the interquartile range in the form of the 25th percentile and the 75th percentile. Although that can be calculated.
#I love sports, especially hockey. Shooting percentage is a pretty basic stat that shows the percentage of shots on net a particular player takes, scores. We will define it here.
#We will use David Pastrnak who had 407 shots and 61 goals last season
Shooting_Percentage <- function(shots, goals){
(goals/shots) * 100
}
Shooting_Percentage(407, 61)
## [1] 14.98771
This is correct as David Pastrnak had a rounded shooting percentage of 14.99% last season.
# Define the function for mean using our vector X
Mean_By_Hand <- function(vector){
sum(vector)/length(vector)
}
#Comparison of the two means
Mean_By_Hand(X)
## [1] 9.142857
mean(X)
## [1] 9.142857
#Load the data
data("mtcars")
library(moments)
# density chart of car weights
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Car_Weight <- ggplot(mtcars, aes(x = wt)) +
geom_density() +
labs(x = "Weight",
y = "Density",
title = "Car Weight Distribution")
Car_Weight
From a visual inspection, the data appears to be left skewed. There is more data on the left side of the peak than the right side.
#skew of car weights
skewness(mtcars$wt)
## [1] 0.4437855
According to the critical values; car weight skew is not an issue because it has a value of between -.5 and .5.
#density plot of mpg
mpg <- ggplot(mtcars, aes(x = mpg)) +
geom_density() +
labs(x = "MPG",
y = "Density",
title = "MPG Distribution")
mpg
Looking at the data it looks to be right skewed. There is more extreme right values than left values in the distribution curve.
#skew of mpg
skewness(mtcars$mpg)
## [1] 0.6404399
Since the skew value is between .5 and 1 it is mildly skewed to the right. This makes sense because there will always be more extreme mpgs towards higher mpgs due to competition and restrictions the government puts into place for minimal standards
#density plot of Horse Power
Horse_Power <- ggplot(mtcars, aes(x = hp)) +
geom_density(fill = 'red') +
labs(x = "Horse Power",
y = "Density",
title = "Horse Power Distribution")
Horse_Power
Looking at the data, it appears to be right skewed, due to the extreme values on the right side.
#skew of Horse Power
skewness(mtcars$hp)
## [1] 0.7614356
With a skew value between .5 and 1, horse power is mildly skewed to the right. This makes sense becuase (similar to the example of grade skew in class), there will always be more outliers that have abnormally good vs bad horsepower, largely due to competition.