lets start computing some basics
Addition
2+3
## [1] 5
Subtraction
4-3
## [1] 1
Division
9/3
## [1] 3
Exponentiation
32^45
## [1] 5.391989e+67
2^3
## [1] 8
Square Roots
sqrt(9)
## [1] 3
sqrt(99)
## [1] 9.949874
Logarithms
log10(10)
## [1] 1
log10(100)
## [1] 2
Question_1: Compute the log base 5 of 10 and the log of 10.
logb(10, base = 5)
## [1] 1.430677
log10(10)
## [1] 1
Computing some offensive metrics in Baseball
#Batting Average=(No. of Hits)/(No. of At Bats)
#What is the batting average of a player that bats 29 hits in 112 at bats?
BA=(29)/(112)
BA
## [1] 0.2589286
# This code is rounding the result of the variable BA to 3 decimal places.
Batting_Average=round(BA,digits = 3)
Batting_Average
## [1] 0.259
Question_2:What is the batting average of a player that bats 42 hits in 212 at bats?
# Calculate the batting average
BA <- (42) / (212)
# Convert to percentage and round to 2 decimal places
Batting_Average_Percentage <- round(BA, digits = 3)
# Print the result
Batting_Average_Percentage
## [1] 0.198
The Batting average would be 0.198
On Base Percentage OBP=(H+BB+HBP)/(At Bats+H+BB+HBP+SF) Let us compute the OBP for a player with the following general stats AB=515,H=172,BB=84,HBP=5,SF=6
OBP=(172+84+5)/(515+172+84+5+6)
OBP
## [1] 0.3337596
On_Base_Percentage=round(OBP,digits = 3)
On_Base_Percentage
## [1] 0.334
Question_3:Compute the OBP for a player with the following general stats: AB=565,H=156,BB=65,HBP=3,SF=7
# Calculate the On-Base Percentage (OBP)
OBP <- (156 + 65 + 3) / (565 + 156 + 65 + 3 + 7)
# Round the OBP to 3 decimal places
On_Base_Percentage <- round(OBP, digits = 3)
# Print the rounded OBP
On_Base_Percentage
## [1] 0.281
Often you will want to test whether something is less than, greater than or equal to something.
Here we are using the equality operator in R
3 == 8 # Does 3 equals 8?
## [1] FALSE
Here we are using the inequality operator in R
3 != 8 # Is 3 different from 8?
## [1] TRUE
Here we are using the “less than or equal to” operator in R
3 <= 8 # Is 3 less than or equal to 8?
## [1] TRUE
The logical operators are & for logical AND, | for logical OR, and ! for NOT. These are some examples:
Here we are using the | logical OR operator in R. It returns TRUE if at least one of the operands is TRUE. Otherwise, it returns FALSE.
# Logical Dis junction (or)
FALSE | FALSE # False OR False
## [1] FALSE
# Logical Conjunction (and)
TRUE & FALSE #True AND False
## [1] FALSE
# Negation
! FALSE # Not False
## [1] TRUE
This expression is evaluated as follows: 2 < 3 is TRUE, 1 == 5 is FALSE, TRUE OR FALSE is TRUE
# Combination of statements
2 < 3 | 1 == 5 # 2<3 is True, 1==5 is False, True OR False is True
## [1] TRUE
#Assigning Values to Varaibles
In R, you create a variable and assign it a value using <- or = as follows
Here we are assigning The variable Total_Bases 2 different values. First 11 then 33
Total_Bases <- 6 + 5
Total_Bases
## [1] 11
Total_Bases*3
## [1] 33
To see the variables that are currently defined, use ls (as in “list”)
ls()
## [1] "BA" "Batting_Average"
## [3] "Batting_Average_Percentage" "OBP"
## [5] "On_Base_Percentage" "Total_Bases"
Here we are listing the values with their structures and values. I think this will be helpful while running many models with different algorithms in the same data set
ls.str()
## BA : num 0.198
## Batting_Average : num 0.259
## Batting_Average_Percentage : num 0.198
## OBP : num 0.281
## On_Base_Percentage : num 0.281
## Total_Bases : num 11
To delete a variable, use rm (as in “remove”) I added a ls() to make sure that the variable Total_Bases got deleted
rm(Total_Bases)
ls()
## [1] "BA" "Batting_Average"
## [3] "Batting_Average_Percentage" "OBP"
## [5] "On_Base_Percentage"
The basic type of object in R is a vector, which is an ordered list of values of the same type. You can create a vector using the c() function (as in “concatenate”).
pitches_by_innings <- c(12, 15, 10, 20, 10)
pitches_by_innings
## [1] 12 15 10 20 10
c, This is the c function, which stands for “combine” or “concatenate.” It is used to create a vector by combining its arguments into a single vector. 12, 15, 10, 20, 10 are the elements being combined into the vector.
strikes_by_innings <- c(9, 12, 6, 14, 9)
strikes_by_innings
## [1] 9 12 6 14 9
Question_4: Define two vectors,runs_per_9innings and hits_per_9innings, each with five elements.
# Define the runs_per_9innings vector with five random integers between 1 and 10
set.seed(123) # Setting a seed for reproducibility
runs_per_9innings <- sample(1:10, 5, replace = TRUE)
runs_per_9innings
## [1] 3 3 10 2 6
# Define the hits_per_9innings vector with five random integers between 1 and 20
set.seed(123) # Setting a seed for reproducibility
hits_per_9innings <- sample(1:20, 5, replace = TRUE)
hits_per_9innings
## [1] 15 19 14 3 10
There are also some functions that will create vectors with regular patterns, like repeated elements.
Here we are using the rep() function to replicate the number 2 five times
# replicate function
rep(2, 5)
## [1] 2 2 2 2 2
In R wen you use the colon operator with two numbers, it generates a sequence starting from the first number and ending at the second number, inclusive.
# consecutive numbers
1:5
## [1] 1 2 3 4 5
When using the seq() function in R it generates sequences of numbers. The function has several arguments, including the starting value, ending value, and the step size (by), which specifies the increment between consecutive numbers in the sequence. In this case the start value is 2 and the end value is 10.
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
## [1] 1 3 5 7 9
Many functions and operators like + or - will work on all elements of the vector.
Here we are adding the values of the vectors together.
# add vectors
pitches_by_innings + strikes_by_innings
## [1] 21 27 16 34 19
Here we are using the == operator to compare two vectors element-wise to see if the corresponding elements are equal.
# compare vectors
pitches_by_innings == strikes_by_innings
## [1] FALSE FALSE FALSE FALSE FALSE
Here we are using the lenght() function to see how many elements are in a variable.
# find length of vector
length(pitches_by_innings)
## [1] 5
Here we are using the min() function to see what is the smallest value in a variable.
# find minimum value in vector
min(pitches_by_innings)
## [1] 10
Here we are using the function mean() to average the elements in a variable
# find average value in vector
mean(pitches_by_innings)
## [1] 13.4
You can access parts of a vector by using [. Recall what the value is of the vector pitches_by_innings.
pitches_by_innings
## [1] 12 15 10 20 10
In this case the square brackets [ ] are used to specify the position of the element you want to access and the fist position is 12.
# If you want to get the first element:
pitches_by_innings[1]
## [1] 12
Question_5: Get the first element of hits_per_9innings.
hits_per_9innings
## [1] 15 19 14 3 10
# Get the first element of hits_per_9innings
first_element <- hits_per_9innings[1]
# Print the first element
first_element
## [1] 15
If you want to get the last element of pitches_by_innings without explicitly typing the number of elements of pitches_by_innings, make use of the length function, which calculates the length of a vector:
Here we are using the element at the position given by length(pitches_by_innings), which is the last element of the vector. That number is 5
pitches_by_innings[length(pitches_by_innings)]
## [1] 10
Question_6: Get the last element of hits_per_9innings.
hits_per_9innings
## [1] 15 19 14 3 10
hits_per_9innings[length(hits_per_9innings)]
## [1] 10
You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use
Here we are using the square brackets [] to specify which elements to access, and the c() function to combine the indices into a vector.
pitches_by_innings[c(2, 3, 4)]
## [1] 15 10 20
Vectors can also be strings or logical values
Yes we are doing the same but for text, these elements are separated by , and wrapped by “”
player_positions <- c("catcher", "pitcher", "infielders", "outfielders")
player_positions
## [1] "catcher" "pitcher" "infielders" "outfielders"
In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with:
rows as observations and columns as variables.
To manually create a data frame, use the data.frame() function.
Here we are creating 3 vector and using the data.frame() function to arrange them into columns using the vector name as column headers. Notice the data type is under the vector header
data.frame(bonus = c(2, 3, 1),#in millions
active_roster = c("yes", "no", "yes"),
salary = c(1.5, 2.5, 1))#in millions
Most often you will be using data frames loaded from a file. For example, load the results of a fan’s survey. The function load or read.table can be used for this.
To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)
sample(1:10, size=5)
## [1] 2 6 3 5 4
The first argument gives the vector of data to select elements from. The second argument (size=) gives the size of the sample to select.
Taking a simple random sample from a data frame is only slightly more complicated, having two steps:
Use sample() to select a sample of size n from a vector of the row numbers of the data frame. Use the index operator [ to select those rows from the data frame.
Consider the following example with fake data. First, make up a data frame with two columns. (LETTERS is a character vector of length 26 with capital letters A to Z; LETTERS is automatically defined and pre-loaded in R)
bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar
Suppose you want to select a random sample of size 5. First, define a variable n with the size of the sample, i.e. 5
n <- 5
n
## [1] 5
Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in bar). Use the function nrow() to find the number of rows in bar instead of manually entering that number.
Use : to create a vector with all the integers between 1 and the number of rows in bar.
The tricky part here is that the function nrow(bar) is returning the other side of the number needed for the sample() function
samplerows <- sample(1:nrow(bar), size=n)
# print sample rows
samplerows
## [1] 6 9 2 3 5
The variable samplerows contains the rows of bar which make a random sample from all the rows in bar. Extract those rows from bar with
nrow(bar)
## [1] 10
This R code is used to extract specific rows from a data frame bar using indices stored in samplerows and then print the resulting subset of the data frame.
# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)
## var1 var2
## 6 F 6
## 9 I 9
## 2 B 2
## 3 C 3
## 5 E 5
The code avobe creates a new data frame called barsample with a random sample of rows from bar.
In a single line of code:
This R code is used to extract a random sample of n rows from the data frame bar. And, 1:nrow(bar): creates a sequence of numbers from 1 to the number of rows in bar.
bar[sample(1:nrow(bar), n), ]
The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.
For example, a survey asks people if they support the home team or not. The data is
Yes, No, No, Yes, Yes
We can enter this into R with the c() command, and summarize with the table() command as follows
The interesting part here is that the table(x) function creates a frequency table of the elements in the vector x. It counts the number of occurrences of each unique value in the vector.
x <- c("Yes","No","No","Yes","Yes")
table(x)
## x
## No Yes
## 2 3
x
## [1] "Yes" "No" "No" "Yes" "Yes"
Suppose, MLB Teams’ CEOs yearly compensations are sampled and the following are found (in millions)
12 .4 5 2 50 8 3 1 4 0.25
sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals)
## [1] 8.565
The var() function calculates the variance of a numeric vector in R.
# the variance
var(sals)
## [1] 225.5145
The sd() function calculates the standard deviation of a numeric vector in R.
# the standard deviation
sd(sals)
## [1] 15.01714
The median() function calculates the median of a numeric vector in R.
# the median
median(sals)
## [1] 3.5
Function fivenum():
The fivenum() function in R calculates the five-number summary of a numeric vector. The five-number summary provides a concise summary of the distribution of the data. The five-number summary includes: 1. Minimum (the smallest value) 2. First Quartile (Q1, 25th percentile) 3. Median (the middle value) 4. Third Quartile (Q3, 75th percentile) 5. Maximum (the largest value)
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
## [1] 0.25 1.00 3.50 8.00 50.00
The function summary() calculates and returns the following statistics: 1. Minimum: The smallest value in the vector. 2. 1st Quartile (Q1): The value below which 25% of the data falls (also known as the 25th percentile). 3. Median: The middle value of the dataset (50th percentile). 4. Mean: The average of all values in the dataset. 5. 3rd Quartile (Q3): The value below which 75% of the data falls (also known as the 75th percentile). 6. Maximum: The largest value in the vector.
# summary statistics
summary(sals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.250 1.250 3.500 8.565 7.250 50.000
In R we can write our own functions, and a first example of a function is shown below in order to compute the mode of a vector of observations x
This code is short but complex so lets unpack it: 1. getMode <- function(x) { … }: This defines a new function named getMode that takes a single argument x. 2. x: The input vector (numeric, character, or any other type) for which the mode is to be calculated. 3. unique(x): This function returns a vector of the unique values in x, removing any duplicates. 4. ux: A variable that stores the unique values from the input vector x. 5. match(x, ux): This function returns a vector of the positions of (first) matches of the elements of x in ux. Essentially, it maps each element in x to its position in ux. 6. tabulate(match(x, ux)): This function creates a frequency table (a count of occurrences) for each position returned by match(x, ux). 7. which.max(tabulate(match(x, ux))): This function returns the index of the maximum value in the frequency table. Essentially, it finds the position of the most frequent value in x. 8. ux[…]: This retrieves the value from ux corresponding to the index of the most frequent value. 9. The function returns this value as the mode of the input vector x.
# Function to find the mode, i.e. most frequent value
getMode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
x
## [1] "Yes" "No" "No" "Yes" "Yes"
getMode(x)
## [1] "Yes"
As an example, we can use the function defined above to find the most frequent value of the number of pitches_by_innings
Here we are using the function getmode() to get the mode from the pitches_by_innings vector values
# Most frequent value in baz
getMode(pitches_by_innings)
## [1] 10
Question_7: Find the most frequent value of hits_per_9innings.
getMode(hits_per_9innings)
## [1] 15
Question_8: Summarize the following survey with the
table() command:
What is your favorite day of the week to watch baseball? A total of 10 fans submitted this survey.
Saturday, Saturday, Sunday, Monday, Saturday,Tuesday, Sunday, Friday, Friday, Monday
# Create a vector of survey responses
survey_responses <- c("Saturday", "Saturday", "Sunday", "Monday", "Saturday", "Tuesday", "Sunday", "Friday", "Friday", "Monday")
# Summarize the survey with the table() function
survey_summary <- table(survey_responses)
# Print the summary
survey_summary
## survey_responses
## Friday Monday Saturday Sunday Tuesday
## 2 2 3 2 1
Question_9: What is the most frequent answer recorded in the survey? Use the getMode function to compute results.
#Use the getMode function to find the most frequent answer
most_frequent_day <- getMode(survey_responses)
# Print the most frequent answer
most_frequent_day
## [1] "Saturday"
The End, Thank you.