First Steps with R Basic calculations You can use R for basic computations you would perform in a calculator

# Addition
2-3
[1] -1
4+5
[1] 9
# Division
2/3
[1] 0.6666667
5/2
[1] 2.5
# Exponentiation
2^3 
[1] 8
3^3
[1] 27
# Square root
sqrt(2)
[1] 1.414214
sqrt(16)
[1] 4
# Logarithms
log(2)
[1] 0.6931472
log10(10)
[1] 1
log10(100)
[1] 2
#Question_1: Compute the log base 5 of 10 and the log of 10.
log(10, base = 5)
[1] 1.430677
log(10)
[1] 2.302585

Computing some offensive metrics in Baseball

#Batting Average=(No. of Hits)/(No. of At Bats)
#What is the batting average of a player that bats 29 hits in 112 at bats?

BA=(29)/(112)
BA
[1] 0.2589286
Batting_Average=round(BA,digits = 3)
Batting_Average
[1] 0.259

Question_2:What is the batting average of a player that bats 42 hits in 212 at bats?

#On Base Percentage
#OBP=(H+BB+HBP)/(At Bats+H+BB+HBP+SF)
#Let us compute the OBP for a player with the following general stats
#AB=515,H=172,BB=84,HBP=5,SF=6
OBP=(172+84+5)/(515+172+84+5+6)
OBP
[1] 0.3337596
BA=(42)/(212)
BA
[1] 0.1981132
#
On_Base_Average=round(BA,digits = 3)
On_Base_Average
[1] 0.198
On_Base_Percentage=round(OBP,digits = 3)
On_Base_Percentage
[1] 0.334

Question_3:Compute the OBP for a player with the following general stats: #AB=565,H=156,BB=65,HBP=3,SF=7


#AB=565,H=156,BB=65,HBP=3,SF=7
OBP2=(156+65+3)/(565+156+65+3+7)
OBP2
[1] 0.281407

Often you will want to test whether something is less than, greater than or equal to something.

3 == 8# Does 3 equals 8?
[1] FALSE
3 != 8# Is 3 different from 8?
[1] TRUE
3 <= 8# Is 3 less than or equal to 8?
[1] TRUE

The logical operators are & for logical AND, | for logical OR, and ! for NOT. These are some examples:

# Logical Disjunction (or)
FALSE | FALSE # False OR False
[1] FALSE
# Logical Conjunction (and)
TRUE & FALSE #True AND False
[1] FALSE
# Negation
! FALSE # Not False
[1] TRUE
# Combination of statements
2 < 3 | 1 == 5 # 2<3 is True, 1==5 is False, True OR False is True
[1] TRUE

Assigning Values to Variables In R, you create a variable and assign it a value using <- as follows

Total_Bases <- 6 + 5
Total_Bases*3
[1] 33

To see the variables that are currently defined, use ls (as in “list”)

ls()
 [1] "allcontracts"               "BA"                         "Batting_Average"           
 [4] "con"                        "contract_length"            "contract_years"            
 [7] "contracts_mean"             "contracts_median"           "contracts_n"               
[10] "contracts_sd"               "contracts_w1sd"             "contracts_w2sd"            
[13] "contracts_w3sd"             "hits_per_9innings"          "HR_before"                 
[16] "JSn_seasons"                "n_1"                        "n_2"                       
[19] "n_3"                        "n_4"                        "n_seasons"                 
[22] "OBP"                        "OBP2"                       "On_Base_Average"           
[25] "On_Base_Average_Percentage" "On_Base_Percentage"         "pitches_by_innings"        
[28] "Robert_HRs"                 "runs_per_9innings"          "salary_ave"                
[31] "salary_ave_bask_nfl"        "strikes_by_innings"         "Total_Bases"               
[34] "Walks_before"               "wanted_HR"                  "wanted_walks"              
[37] "x_4"                        "x_6"                        "y_1"                       
[40] "y_2"                        "y_3"                        "y_4"                       

To delete a variable, use rm (as in “remove”)

rm(Total_Bases)
ls()
 [1] "allcontracts"               "BA"                         "Batting_Average"           
 [4] "con"                        "contract_length"            "contract_years"            
 [7] "contracts_mean"             "contracts_median"           "contracts_n"               
[10] "contracts_sd"               "contracts_w1sd"             "contracts_w2sd"            
[13] "contracts_w3sd"             "hits_per_9innings"          "HR_before"                 
[16] "JSn_seasons"                "n_1"                        "n_2"                       
[19] "n_3"                        "n_4"                        "n_seasons"                 
[22] "OBP"                        "OBP2"                       "On_Base_Average"           
[25] "On_Base_Average_Percentage" "On_Base_Percentage"         "pitches_by_innings"        
[28] "Robert_HRs"                 "runs_per_9innings"          "salary_ave"                
[31] "salary_ave_bask_nfl"        "strikes_by_innings"         "Walks_before"              
[34] "wanted_HR"                  "wanted_walks"               "x_4"                       
[37] "x_6"                        "y_1"                        "y_2"                       
[40] "y_3"                        "y_4"                       

Either <- or = can be used to assign a value to a variable, but I prefer <- because is less likely to be confused with the logical operator ==

Vectors The basic type of object in R is a vector, which is an ordered list of values of the same type. You can create a vector using the c() function (as in “concatenate”).

pitches_by_innings <- c(12, 15, 10, 20, 10) 
pitches_by_innings
[1] 12 15 10 20 10
strikes_by_innings <- c(9, 12, 6, 14, 9)
strikes_by_innings
[1]  9 12  6 14  9

Question_4: Define two vectors,runs_per_9innings and hits_per_9innings, each with five elements.

runs_per_9innings <- c(0, 1, 0, 3, 0)
runs_per_9innings
[1] 0 1 0 3 0
hits_per_9innings <- c(5, 0, 0, 1, 2)
hits_per_9innings
[1] 5 0 0 1 2

There are also some functions that will create vectors with regular patterns, like repeated elements.

# replicate function
rep(2, 5)
[1] 2 2 2 2 2
# consecutive numbers
1:5
[1] 1 2 3 4 5
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
[1] 1 3 5 7 9

Many functions and operators like + or - will work on all elements of the vector.

# add vectors
pitches_by_innings
[1] 12 15 10 20 10
strikes_by_innings
[1]  9 12  6 14  9
pitches_by_innings+strikes_by_innings
[1] 21 27 16 34 19
# compare vectors
pitches_by_innings == strikes_by_innings
[1] FALSE FALSE FALSE FALSE FALSE
# find length of vector
length(pitches_by_innings)
[1] 5
# find minimum value in vector
min(pitches_by_innings)
[1] 10
# find average value in vector
mean(pitches_by_innings)
[1] 13.4

You can access parts of a vector by using [. Recall what the value is of the vector pitches_by_innings.

pitches_by_innings
[1] 12 15 10 20 10
# If you want to get the first element:
pitches_by_innings[1]
[1] 12

Question_5: Get the first element of hits_per_9innings.

hits_per_9innings
[1] 5 0 0 1 2
hits_per_9innings[1]
[1] 5

If you want to get the last element of pitches_by_innings without explicitly typing the number of elements of pitches_by_innings, make use of the length function, which calculates the length of a vector:

pitches_by_innings[length(pitches_by_innings)]
[1] 10

Question_6: Get the last element of hits_per_9innings.

hits_per_9innings
[1] 5 0 0 1 2
hits_per_9innings[length(hits_per_9innings)]
[1] 2

You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use

pitches_by_innings[c(2, 3, 4)]
[1] 15 10 20

Vectors can also be strings or logical values

player_positions <- c("catcher", "pitcher", "infielders", "outfielders")
player_positions
[1] "catcher"     "pitcher"     "infielders"  "outfielders"

Data Frames In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with rows as observations and columns as variables.

To manually create a data frame, use the data.frame() function.

data.frame(bonus = c(2, 3, 1),#in millions 
           active_roster = c("yes", "no", "yes"), 
           salary = c(1.5, 2.5, 1))#in millions 

Most often you will be using data frames loaded from a file. For example, load the results of a fan’s survey. The function load or read.table can be used for this.

How to Make a Random Sample To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)

sample(1:10, size=5)
[1] 2 8 7 9 3

The first argument gives the vector of data to select elements from. The second argument (size=) gives the size of the sample to select. Taking a simple random sample from a data frame is only slightly more complicated, having two steps:

Use sample() to select a sample of size n from a vector of the row numbers of the data frame. Use the index operator [ to select those rows from the data frame. Consider the following example with fake data. First, make up a data frame with two columns. (LETTERS is a character vector of length 26 with capital letters A to Z; LETTERS is automatically defined and pre-loaded in R)

bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar

Suppose you want to select a random sample of size 5. First, define a variable n with the size of the sample, i.e. 5

n <- 5

Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in bar). Use the function nrow() to find the number of rows in bar instead of manually entering that number.

Use : to create a vector with all the integers between 1 and the number of rows in bar.

samplerows <- sample(1:nrow(bar), size=n) 
# print sample rows
samplerows
[1] 10  2  6  4  8

The variable samplerows contains the rows of bar which make a random sample from all the rows in bar. Extract those rows from bar with

# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)

The code above creates a new data frame called barsample with a random sample of rows from bar.

In a single line of code:

bar[sample(1:nrow(bar), n), ]

Using Tables The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.

For example, a survey asks people if they support the home team or not. The data is

Yes, No, No, Yes, Yes

We can enter this into R with the c() command, and summarize with the table() command as follows

x <- c("Yes","No","No","Yes","Yes") 
table(x)
x
 No Yes 
  2   3 

Numerical measures of center and spread Suppose, MLB Teams’ CEOs yearly compensations are sampled and the following are found (in millions)

12 .4 5 2 50 8 3 1 4 0.25

sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals) 
[1] 8.565
# the variance
var(sals)
[1] 225.5145
# the standard deviation
sd(sals)
[1] 15.01714
# the median
median(sals)
[1] 3.5
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
[1]  0.25  1.00  3.50  8.00 50.00
# summary statistics
summary(sals)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.250   1.250   3.500   8.565   7.250  50.000 

How about the mode? In R we can write our own functions, and a first example of a function is shown below in order to compute the mode of a vector of observations x

# Function to find the mode, i.e. most frequent value
getMode <- function(x) {
     ux <- unique(x)
     ux[which.max(tabulate(match(x, ux)))]
 }

As an example, we can use the function defined above to find the most frequent value of the number of pitches_by_innings

# Most frequent value in baz

pitches_by_innings
[1] 12 15 10 20 10
getMode(pitches_by_innings)
[1] 10

Question_7: Find the most frequent value of hits_per_9innings.

hits_per_9innings
[1] 5 0 0 1 2
getMode(hits_per_9innings)
[1] 0

Question_8: Summarize the following survey with the table() command: What is your favorite day of the week to watch baseball? A total of 10 fans submitted this survey. Saturday, Saturday, Sunday, Monday, Saturday,Tuesday, Sunday, Friday, Friday, Monday

survey <- c("Saturday","Saturday","Sunday","Monday","Saturday","Tuesday","Sunday","Friday","Friday","Monday")
summary_table <-table(survey)
summary_table
survey
  Friday   Monday Saturday   Sunday  Tuesday 
       2        2        3        2        1 
favorite_day <- names(summary_table[summary_table==max(summary_table)])
favorite_day
[1] "Saturday"

Question_9: What is the most frequent answer recorded in the survey? Use the getMode function to compute results.

frequent_answer <- getMode(survey)
frequent_answer
[1] "Saturday"
---
title: "In-class activity #4(HW): Introduction to R for Sports"
output: html_notebook
---

__First Steps with R
Basic calculations__
You can use R for basic computations you would perform in a calculator


```{r}
# Addition
2-3
4+5
```

```{r}
# Division
2/3
5/2
```


```{r}
# Exponentiation
2^3 
3^3
```

```{r}
# Square root
sqrt(2)
sqrt(16)
```
```{r}
# Logarithms
log(2)
log10(10)
log10(100)
```

```{r}
#Question_1: Compute the log base 5 of 10 and the log of 10.
log(10, base = 5)
log(10)
```
__Computing some offensive metrics in Baseball__

```{r}
#Batting Average=(No. of Hits)/(No. of At Bats)
#What is the batting average of a player that bats 29 hits in 112 at bats?

BA=(29)/(112)
BA
```
```{r}
Batting_Average=round(BA,digits = 3)
Batting_Average
```
__Question_2__:What is the batting average of a player that bats 42 hits in 212 at bats?

```{r}
#On Base Percentage
#OBP=(H+BB+HBP)/(At Bats+H+BB+HBP+SF)
#Let us compute the OBP for a player with the following general stats
#AB=515,H=172,BB=84,HBP=5,SF=6
OBP=(172+84+5)/(515+172+84+5+6)
OBP
BA=(42)/(212)
BA
#
On_Base_Average=round(BA,digits = 3)
On_Base_Average
```

```{r}
On_Base_Percentage=round(OBP,digits = 3)
On_Base_Percentage
```
__Question_3__:Compute the OBP for a player with the following general stats: #AB=565,H=156,BB=65,HBP=3,SF=7

```{r}

#AB=565,H=156,BB=65,HBP=3,SF=7
OBP2=(156+65+3)/(565+156+65+3+7)
OBP2
```

Often you will want to test whether something is less than, greater than or equal to something.

```{r}
3 == 8# Does 3 equals 8?
```

```{r}
3 != 8# Is 3 different from 8?
```

```{r}
3 <= 8# Is 3 less than or equal to 8?
```
The logical operators are & for logical AND, | for logical OR, and ! for NOT. These are some examples:

```{r}
# Logical Disjunction (or)
FALSE | FALSE # False OR False
```

```{r}
# Logical Conjunction (and)
TRUE & FALSE #True AND False
```


```{r}
# Negation
! FALSE # Not False
```

```{r}
# Combination of statements
2 < 3 | 1 == 5 # 2<3 is True, 1==5 is False, True OR False is True
```
__Assigning Values to Variables__
In R, you create a variable and assign it a value using <- as follows


```{r}
Total_Bases <- 6 + 5
Total_Bases*3
```
To see the variables that are currently defined, use ls (as in “list”)


```{r}
ls()
```

To delete a variable, use rm (as in “remove”)

```{r}
rm(Total_Bases)
```

```{r}
ls()
```


Either <- or = can be used to assign a value to a variable, but I prefer <- because is less likely to be confused with the logical operator ==

___Vectors___
The basic type of object in R is a vector, which is an ordered list of values of the same type. You can create a vector using the c() function (as in “concatenate”).


```{r}
pitches_by_innings <- c(12, 15, 10, 20, 10) 
pitches_by_innings
```

```{r}
strikes_by_innings <- c(9, 12, 6, 14, 9)
strikes_by_innings
```

__Question_4__: Define two vectors,runs_per_9innings and hits_per_9innings, each with five elements. 

```{r}
runs_per_9innings <- c(0, 1, 0, 3, 0)
runs_per_9innings
hits_per_9innings <- c(5, 0, 0, 1, 2)
hits_per_9innings

```

There are also some functions that will create vectors with regular patterns, like repeated elements.

```{r}
# replicate function
rep(2, 5)
```

```{r}
# consecutive numbers
1:5
```

```{r}
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
```

Many functions and operators like + or - will work on all elements of the vector.


```{r}
# add vectors
pitches_by_innings
strikes_by_innings
pitches_by_innings+strikes_by_innings

```

```{r}
# compare vectors
pitches_by_innings == strikes_by_innings
```

```{r}
# find length of vector
length(pitches_by_innings)
```

```{r}
# find minimum value in vector - in the case of the 5 innings game
min(pitches_by_innings)
```

```{r}
# find average value in vector- the average pitches in the game
mean(pitches_by_innings)
```

You can access parts of a vector by using [. Recall what the value is of the vector pitches_by_innings.

```{r}
pitches_by_innings
```

```{r}
# If you want to get the first element:
pitches_by_innings[1]
```

__Question_5__: Get the first element of hits_per_9innings.

```{r}
hits_per_9innings
hits_per_9innings[1]
```
If you want to get the last element of pitches_by_innings without explicitly typing the number of elements of pitches_by_innings, make use of the length function, which calculates the length of a vector:

```{r}
pitches_by_innings[length(pitches_by_innings)]
```

__Question_6__: Get the last element of hits_per_9innings.

```{r}
hits_per_9innings
hits_per_9innings[length(hits_per_9innings)]
```

You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use

```{r}
pitches_by_innings[c(2, 3, 4)]
```

Vectors can also be strings or logical values

```{r}
player_positions <- c("catcher", "pitcher", "infielders", "outfielders")
player_positions
```

___Data Frames___
In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with rows as observations and columns as variables.

To manually create a data frame, use the data.frame() function.


```{r}
data.frame(bonus = c(2, 3, 1),#in millions 
           active_roster = c("yes", "no", "yes"), 
           salary = c(1.5, 2.5, 1))#in millions 
```

Most often you will be using data frames loaded from a file. For example, load the results of a fan’s survey. The function load or read.table can be used for this.

___How to Make a Random Sample___
To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)

```{r}
sample(1:10, size=5)
```

The first argument gives the vector of data to select elements from.
The second argument (size=) gives the size of the sample to select.
Taking a simple random sample from a data frame is only slightly more complicated, having two steps:

Use sample() to select a sample of size n from a vector of the row numbers of the data frame.
Use the index operator [ to select those rows from the data frame.
Consider the following example with fake data. First, make up a data frame with two columns. (LETTERS is a character vector of length 26 with capital letters A to Z; LETTERS is automatically defined and pre-loaded in R)

```{r}
bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar
```

Suppose you want to select a random sample of size 5. First, define a variable n with the size of the sample, i.e. 5

```{r}
n <- 5
```

Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in bar). Use the function nrow() to find the number of rows in bar instead of manually entering that number.

Use : to create a vector with all the integers between 1 and the number of rows in bar.

```{r}
samplerows <- sample(1:nrow(bar), size=n) 
# print sample rows
samplerows
```
The variable samplerows contains the rows of bar which make a random sample from all the rows in bar. Extract those rows from bar with

```{r}
# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)
```

The code above creates a new data frame called barsample with a random sample of rows from bar.

In a single line of code:


```{r}
bar[sample(1:nrow(bar), n), ]
```

___Using Tables___
The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.

For example, a survey asks people if they support the home team or not. The data is

Yes, No, No, Yes, Yes

We can enter this into R with the c() command, and summarize with the table() command as follows

```{r}
x <- c("Yes","No","No","Yes","Yes") 
table(x)
```

___Numerical measures of center and spread___
Suppose, MLB Teams’ CEOs yearly compensations are sampled and the following are found (in millions)

12 .4 5 2 50 8 3 1 4 0.25


```{r}
sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals) 
```

```{r}
# the variance
var(sals)
```

```{r}
# the standard deviation
sd(sals)
```

```{r}
# the median
median(sals)
```

```{r}
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
```

```{r}
# summary statistics
summary(sals)
```

__How about the mode?__
In R we can write our own functions, and a first example of a function is shown below in order to compute the mode of a vector of observations x

```{r}
# Function to find the mode, i.e. most frequent value
getMode <- function(x) {
     ux <- unique(x)
     ux[which.max(tabulate(match(x, ux)))]
 }
```


As an example, we can use the function defined above to find the most frequent value of the number of pitches_by_innings


```{r}
# Most frequent value in baz

pitches_by_innings
getMode(pitches_by_innings)
```


__Question_7__: Find the most frequent value of hits_per_9innings.

```{r}
hits_per_9innings
getMode(hits_per_9innings)
```
__Question_8__: Summarize the following survey with the `table()` command:
What is your favorite day of the week to watch baseball? A total of 10 fans submitted this survey.
Saturday, Saturday, Sunday, Monday, Saturday,Tuesday, Sunday, Friday, Friday, Monday

```{r}
survey <- c("Saturday","Saturday","Sunday","Monday","Saturday","Tuesday","Sunday","Friday","Friday","Monday")
summary_table <-table(survey)
summary_table
favorite_day <- names(summary_table[summary_table==max(summary_table)])
favorite_day
```

__Question_9__: What is the most frequent answer recorded in the survey? Use the getMode function to compute results. 

```{r}
frequent_answer <- getMode(survey)
frequent_answer
```

