Basic calculations

You can use R for basic computations you would perform in a calculator

# Addition
2-3
## [1] -1
# Division
2/3
## [1] 0.6666667
# Exponentiation
2^3 
## [1] 8
# Square root
sqrt(2)
## [1] 1.414214
# Logarithms
log(2)
## [1] 0.6931472

Question_1: Compute the log base 5 of 10 and the log of 10.

log(10, base = 5)
## [1] 1.430677
log(10)
## [1] 2.302585

Computing some offensive metrics in Baseball

# Batting Average=(No. of Hits)/(No. of At Bats)
# What is the batting average of a player that bats 29 hits in 112 at bats?

BA=(29)/(112)
BA
## [1] 0.2589286
Batting_Average=round(BA,digits = 3)
Batting_Average
## [1] 0.259

Question_2:What is the batting average of a player that bats 42 hits in 212 at bats?

The batting average of a player with 42 hits in 212 at bats is around 0.1981132, or 19.81%

hits = 42
at_bats = 212
batting_average = hits / at_bats

print(batting_average)
## [1] 0.1981132
# On Base Percentage
# OBP=(H+BB+HBP)/(At Bats+BB+HBP+SF)
# Let us compute the OBP for a player with the following general stats
# AB=515,H=172,BB=84,HBP=5,SF=6

OBP=(172+84+5)/(515+84+5+6)
OBP
## [1] 0.4278689
On_Base_Percentage=round(OBP,digits = 3)
On_Base_Percentage
## [1] 0.428

Question_3:Compute the OBP for a player with the following general stats:

AB=565,H=156,BB=65,HBP=3,SF=7

Following the example shown prior, the OBP of this player would be around 0.35 using the OBP formula.

OBP=(156+65+3)/(565+65+3+7)

OBP
## [1] 0.35

Often you will want to test whether something is less than, greater than or equal to something.

3 == 8 # Does 3 equals 8?
## [1] FALSE
3 != 8 # Is 3 different from 8?
## [1] TRUE
3 <= 8 # Is 3 less than or equal to 8?
## [1] TRUE
3>4
## [1] FALSE

The logical operators are & for logical AND, | for logical OR, and ! for NOT. These are some examples:

# Logical Disjunction (or)
FALSE | FALSE # False OR False
## [1] FALSE
# Logical Conjunction (and)
TRUE & FALSE #True AND False
## [1] FALSE
# Negation
! FALSE # Not False
## [1] TRUE
# Combination of statements
2 < 3 | 1 == 5 # 2<3 is True, 1==5 is False, True OR False is True
## [1] TRUE

Assigning Values to Variables

In R, you create a variable and assign it a value using <- as follows

Total_Bases <- 6 + 5
Total_Bases*3
## [1] 33

To see the variables that are currently defined, use ls (as in “list”)

ls()
## [1] "at_bats"            "BA"                 "batting_average"   
## [4] "Batting_Average"    "hits"               "OBP"               
## [7] "On_Base_Percentage" "Total_Bases"

To delete a variable, use rm (as in “remove”)

rm(Total_Bases)

Either <- or = can be used to assign a value to a variable, but I prefer <- because is less likely to be confused with the logical operator ==

Vectors

The basic type of object in R is a vector, which is an ordered list of values of the same type. You can create a vector using the c() function (as in “concatenate”).

pitches_by_innings <- c(12, 15, 10, 20, 10) 
pitches_by_innings
## [1] 12 15 10 20 10
strikes_by_innings <- c(9, 12, 6, 14, 9)
strikes_by_innings
## [1]  9 12  6 14  9

Question_4: Define two vectors,runs_per_9innings and hits_per_9innings, each with five elements.

Using a random assortment of elements, the below two codes define the two vectors “runs_per_9innings” and “hits_per_9innings”, each with 5 different elements.

runs_per_9innings <- c(4, 11, 7, 1, 6)
runs_per_9innings
## [1]  4 11  7  1  6
hits_per_9innings <- c(7, 15, 3, 12, 4)
hits_per_9innings
## [1]  7 15  3 12  4

There are also some functions that will create vectors with regular patterns, like repeated elements.

# replicate function
rep(2, 5)
## [1] 2 2 2 2 2
rep(1,4)
## [1] 1 1 1 1
# consecutive numbers
1:5
## [1] 1 2 3 4 5
2:10
## [1]  2  3  4  5  6  7  8  9 10
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(2,13,by=3)
## [1]  2  5  8 11

Many functions and operators like + or - will work on all elements of the vector.

# add vectors
pitches_by_innings+strikes_by_innings
## [1] 21 27 16 34 19
pitches_by_innings == strikes_by_innings
## [1] FALSE FALSE FALSE FALSE FALSE
# find length of vector
length(pitches_by_innings)
## [1] 5
# find minimum value in vector
min(pitches_by_innings)
## [1] 10
# find average value in vector
mean(pitches_by_innings)
## [1] 13.4

You can access parts of a vector by using [. Recall what the value is of the vector pitches_by_innings.

pitches_by_innings
## [1] 12 15 10 20 10
# If you want to get the first element:
pitches_by_innings[1]
## [1] 12

Question_5: Get the first element of hits_per_9innings.

Using the example code above, we find that the first element of hits_per_9innings in this scenario is 7.

hits_per_9innings[1]
## [1] 7

If you want to get the last element of pitches_by_innings without explicitly typing the number of elements of pitches_by_innings, make use of the length function, which calculates the length of a vector:

pitches_by_innings[length(pitches_by_innings)]
## [1] 10

Question_6: Get the last element of hits_per_9innings.

Using the example code above, we find that the last element of hits_per_9innings in this scenario is 4.

hits_per_9innings[length(hits_per_9innings)]
## [1] 4

You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use

pitches_by_innings[c(2, 3, 4)]
## [1] 15 10 20

Vectors can also be strings or logical values

player_positions <- c("catcher", "pitcher", "infielders", "outfielders")

Data Frames

In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with rows as observations and columns as variables.

To manually create a data frame, use the data.frame() function.

data.frame(bonus = c(2, 3, 1),#in millions 
           active_roster = c("yes", "no", "yes"), 
           salary = c(1.5, 2.5, 1))#in millions 
##   bonus active_roster salary
## 1     2           yes    1.5
## 2     3            no    2.5
## 3     1           yes    1.0

Most often you will be using data frames loaded from a file. For example, load the results of a fan’s survey. The function load or read.table can be used for this.

How to Make a Random Sample

To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)

sample(1:10, size=5)
## [1] 3 6 9 7 2
The first argument gives the vector of data to select elements from.
The second argument (size=) gives the size of the sample to select.

Taking a simple random sample from a data frame is only slightly more complicated, having two steps:

Use sample() to select a sample of size n from a vector of the row numbers of the data frame.
Use the index operator [ to select those rows from the data frame.

Consider the following example with fake data. First, make up a data frame with two columns. (LETTERS is a character vector of length 26 with capital letters âAâ to âZâ; LETTERS is automatically defined and pre-loaded in R)

bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar
##    var1 var2
## 1     A    1
## 2     B    2
## 3     C    3
## 4     D    4
## 5     E    5
## 6     F    6
## 7     G    7
## 8     H    8
## 9     I    9
## 10    J   10

Suppose you want to select a random sample of size 5. First, define a variable n with the size of the sample, i.e. 5

n <- 5

Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in bar). Use the function nrow() to find the number of rows in bar instead of manually entering that number.

Use : to create a vector with all the integers between 1 and the number of rows in bar.

samplerows <- sample(1:nrow(bar), size=n) 
# print sample rows
samplerows
## [1]  6  8  9  2 10

The variable sample rows contains the rows of bar which make a random sample from all the rows in bar. Extract those rows from bar with

# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)
##    var1 var2
## 6     F    6
## 8     H    8
## 9     I    9
## 2     B    2
## 10    J   10

The code above creates a new data frame called barsample with a random sample of rows from bar.

In a single line of code:

bar[sample(1:nrow(bar), n), ]
##    var1 var2
## 3     C    3
## 10    J   10
## 4     D    4
## 5     E    5
## 7     G    7

Using Tables

The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.

For example, a survey asks people if they support the home team or not. The data is

Yes, No, No, Yes, Yes

We can enter this into R with the c() command, and summarize with the table() command as follows

x <- c("Yes","No","No","Yes","Yes") 
table(x)
## x
##  No Yes 
##   2   3

Numerical measures of center and spread

Suppose, MLB Teams’ CEOs yearly compensations are sampled and the following are found (in millions)

12 .4 5 2 50 8 3 1 4 0.25

sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals) 
## [1] 8.565
# the variance
var(sals)
## [1] 225.5145
# the median
median(sals)
## [1] 3.5
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
## [1]  0.25  1.00  3.50  8.00 50.00
# summary statistics
summary(sals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.250   1.250   3.500   8.565   7.250  50.000

How about the mode?

In R we can write our own functions, and a first example of a function is shown below in order to compute the mode of a vector of observations x

# Function to find the mode, i.e. most frequent value
getMode <- function(x) {
     ux <- unique(x)
     ux[which.max(tabulate(match(x, ux)))]
 }

As an example, we can use the function defined above to find the most frequent value of the number of pitches_by_innings

# Most frequent value in pitches_by_innings
getMode(pitches_by_innings)
## [1] 10

Question_7: Find the most frequent value of hits_per_9innings.

Using the example code above, we find that the most frequent value of hits_per_9innings in this scenario is 7.

getMode(hits_per_9innings)
## [1] 7

Question_8: Summarize the following survey with the table() command: What is your favorite day of the week to watch baseball? A total of 10 fans submitted this survey.

Using the table() command, we can see that the most common favorite day of the week to watch baseball is Saturday with 3 responses out of 10.

game_day<-c("Saturday", "Saturday", "Sunday", "Monday", "Saturday","Tuesday", "Sunday", "Friday", "Friday", "Monday")

table(game_day)
## game_day
##   Friday   Monday Saturday   Sunday  Tuesday 
##        2        2        3        2        1

Question_9: What is the most frequent answer recorded in the survey? Use the getMode function to compute results.

Using the getMode() command instead, we can see that our initial table was correct, and that Saturday was the most frequent answer.

getMode(game_day)
## [1] "Saturday"