Manuel Madalena
You can use R for basic computations you would perform in a calculator
Addition
2+3
## [1] 5
Subtraction
2-3
## [1] -1
Division
2/3
## [1] 0.6666667
Exponentiation
2^3
## [1] 8
Square root
sqrt(2)
## [1] 1.414214
Logarithms
log(2)
## [1] 0.6931472
Question_1: Compute the log base 5 of 10 and the log of 10.
# Compute log base 5 of 10
log_5_10 <- log(10, base = 5)
log_5_10
## [1] 1.430677
# Compute log base 10 of 10 (common log)
log_10_10 <- log(10)
log_10_10
## [1] 2.302585
Computing some offensive metrics in Baseball
Batting Average=(No. of Hits)/(No. of At Bats) What is the batting average of a player that bats 29 hits in 112 at bats?
BA=(29)/(112)
BA
## [1] 0.2589286
Batting_Average=round(BA,digits = 3)
Batting_Average
## [1] 0.259
Question_2:What is the batting average of a player that bats 42 hits in 212 at bats?
BA2=(42)/(212)
BA2
## [1] 0.1981132
On Base Percentage
Let us compute the OBP for a player with the following general stats AB=515,H=172,BB=84,HBP=5,SF=6
OBP=(172+84+5)/(515+84+5+6)
OBP
## [1] 0.4278689
On_Base_Percentage=round(OBP,digits = 3)
On_Base_Percentage
## [1] 0.428
Question_3:Compute the OBP for a player with the following general stats:
AB=565,H=156,BB=65,HBP=3,SF=7
OBP2=(156+65+3)/(565+65+3+7)
OBP2
## [1] 0.35
Often you will want to test whether something is less than, greater than or equal to something.
3 == 8 # Does 3 equals 8?
## [1] FALSE
3 != 8# Is 3 different from 8?
## [1] TRUE
3 <= 8# Is 3 less than or equal to 8?
## [1] TRUE
3>4
## [1] FALSE
The logical operators are & for logical AND, | for logical OR, and ! for NOT. These are some examples:
Logical Disjunction (or)
FALSE | FALSE # False OR False
## [1] FALSE
Logical Conjunction (and)
TRUE & FALSE #True AND False
## [1] FALSE
Negation
! FALSE # Not False
## [1] TRUE
Combination of statements
2 < 3 | 1 == 5 # 2<3 is True, 1==5 is False, True OR False is True
## [1] TRUE
Assigning Values to Variables
In R, you create a variable and assign it a value using <- as follows
Total_Bases <- 6 + 5
Total_Bases*3
## [1] 33
To see the variables that are currently defined, use ls (as in “list”)
ls()
## [1] "BA" "BA2" "Batting_Average"
## [4] "log_10_10" "log_5_10" "OBP"
## [7] "OBP2" "On_Base_Percentage" "Total_Bases"
To delete a variable, use rm (as in “remove”)
rm(Total_Bases)
ls() # to verify Total_Bases was removed
## [1] "BA" "BA2" "Batting_Average"
## [4] "log_10_10" "log_5_10" "OBP"
## [7] "OBP2" "On_Base_Percentage"
Either <- or = can be used to assign a value to a variable, but I prefer <- because is less likely to be confused with the logical operator ==
Vectors
The basic type of object in R is a vector, which is an ordered list of values of the same type. You can create a vector using the c() function (as in “concatenate”).
pitches_by_innings <- c(12, 15, 10, 20, 10)
pitches_by_innings
## [1] 12 15 10 20 10
strikes_by_innings <- c(9, 12, 6, 14, 9)
strikes_by_innings
## [1] 9 12 6 14 9
Question_4: Define two vectors,runs_per_9innings and hits_per_9innings, each with five elements.
runs_per_9innings <- c(8, 6, 10, 7, 3)
runs_per_9innings
## [1] 8 6 10 7 3
hits_per_9innings <- c(5, 4, 8, 11, 2)
hits_per_9innings
## [1] 5 4 8 11 2
There are also some functions that will create vectors with regular patterns, like repeated elements.
replicate function
rep(2, 5)
## [1] 2 2 2 2 2
rep(1,4)
## [1] 1 1 1 1
consecutive numbers
1:5
## [1] 1 2 3 4 5
2:10
## [1] 2 3 4 5 6 7 8 9 10
sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(2,13,by=3)
## [1] 2 5 8 11
Many functions and operators like + or - will work on all elements of the vector.
add vectors
pitches_by_innings+strikes_by_innings
## [1] 21 27 16 34 19
compare vectors
pitches_by_innings == strikes_by_innings
## [1] FALSE FALSE FALSE FALSE FALSE
find length of vector
length(pitches_by_innings)
## [1] 5
find minimum value in vector
min(pitches_by_innings)
## [1] 10
find average value in vector
mean(pitches_by_innings)
## [1] 13.4
You can access parts of a vector by using [. Recall what the value is of the vector pitches_by_innings.
pitches_by_innings
## [1] 12 15 10 20 10
If you want to get the first element:
pitches_by_innings[1]
## [1] 12
Question_5: Get the first element of hits_per_9innings.
hits_per_9innings
## [1] 5 4 8 11 2
hits_per_9innings[1]
## [1] 5
If you want to get the last element of pitches_by_innings without explicitly typing the number of elements of pitches_by_innings, make use of the length function, which calculates the length of a vector:
pitches_by_innings[length(pitches_by_innings)]
## [1] 10
Question_6: Get the last element of hits_per_9innings.
You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use
pitches_by_innings[c(2, 3, 4)]
## [1] 15 10 20
Vectors can also be strings or logical values
player_positions <- c("catcher", "pitcher", "infielders", "outfielders")
player_positions
## [1] "catcher" "pitcher" "infielders" "outfielders"
Data Frames
In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with rows as observations and columns as variables.
To manually create a data frame, use the data.frame() function.
data.frame(bonus = c(2, 3, 1),#in millions
active_roster = c("yes", "no", "yes"),
salary = c(1.5, 2.5, 1))#in millions
## bonus active_roster salary
## 1 2 yes 1.5
## 2 3 no 2.5
## 3 1 yes 1.0
Most often you will be using data frames loaded from a file. For example, load the results of a fan’s survey. The function load or read.table can be used for this.
How to Make a Random Sample
To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)
sample(1:10, size=5)
## [1] 3 10 8 9 4
Taking a simple random sample from a data frame is only slightly more complicated, having two steps:
bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
bar
## var1 var2
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
## 6 F 6
## 7 G 7
## 8 H 8
## 9 I 9
## 10 J 10
Suppose you want to select a random sample of size 5. First, define a variable n with the size of the sample, i.e. 5
n <- 5
Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in bar). Use the function nrow() to find the number of rows in bar instead of manually entering that number.
Use : to create a vector with all the integers between 1 and the number of rows in bar.
samplerows <- sample(1:nrow(bar), size=n)
samplerows
## [1] 6 9 2 1 7
The variable samplerows contains the rows of bar which make a random sample from all the rows in bar. Extract those rows from bar with
extract rows
barsample <- bar[samplerows, ]
barsample
## var1 var2
## 6 F 6
## 9 I 9
## 2 B 2
## 1 A 1
## 7 G 7
The code above creates a new data frame called barsample with a random sample of rows from bar.
In a single line of code:
bar[sample(1:nrow(bar), n), ]
## var1 var2
## 10 J 10
## 5 E 5
## 6 F 6
## 8 H 8
## 1 A 1
Using Tables
The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.
For example, a survey asks people if they support the home team or not. The data is
Yes, No, No, Yes, Yes
We can enter this into R with the c() command, and summarize with the table() command as follows
Support_HomeTeam <- c("Yes","No","No","Yes","Yes")
table(Support_HomeTeam)
## Support_HomeTeam
## No Yes
## 2 3
Numerical measures of center and spread
Suppose, MLB Teams’ CEOs yearly compensations are sampled and the following are found (in millions)
12 .4 5 2 50 8 3 1 4 0.25
sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
sals
## [1] 12.00 0.40 5.00 2.00 50.00 8.00 3.00 1.00 4.00 0.25
the average
mean(sals)
## [1] 8.565
the variance
var(sals)
## [1] 225.5145
the standard deviation
sd(sals)
## [1] 15.01714
the median
median(sals)
## [1] 3.5
Tukey’s five number summary, usefull for boxplots
five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
## [1] 0.25 1.00 3.50 8.00 50.00
summary statistics
summary(sals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.250 1.250 3.500 8.565 7.250 50.000
How about the mode?
In R we can write our own functions, and a first example of a function is shown below in order to compute the mode of a vector of observations x
Function to find the mode, i.e. most frequent value
getMode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
As an example, we can use the function defined above to find the most frequent value of the number of pitches_by_innings
Most frequent value in pitches_by_innings
getMode(pitches_by_innings)
## [1] 10
Question_7: Find the most frequent value of hits_per_9innings.
getMode(hits_per_9innings)
## [1] 5
Question_8: Summarize the following survey with the
table() command:
What is your favorite day of the week to watch baseball? A total of 10 fans submitted this survey.
Saturday, Saturday, Sunday, Monday, Saturday,Tuesday, Sunday, Friday, Friday, Monday
game_day<-c("Saturday", "Saturday", "Sunday", "Monday", "Saturday","Tuesday", "Sunday", "Friday", "Friday", "Monday")
game_day
## [1] "Saturday" "Saturday" "Sunday" "Monday" "Saturday" "Tuesday"
## [7] "Sunday" "Friday" "Friday" "Monday"
Question_9: What is the most frequent answer recorded in the survey? Use the getMode function to compute results.
getMode(game_day)
## [1] "Saturday"