This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

mean(cars$speed)
[1] 15.4
mean(cars$dist)
[1] 42.98
max(cars$speed)
[1] 25
max(cars$dist)
[1] 120

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

2^5
[1] 32

LOGARITMIC FUNCTION: AS X GETS LARGER THE FUNCTION FLATS OUT. IT DOES NOT INCREASE AT THE SAME RATE. SO FOR PROBLEM WHERE WE HAVE ATTRIBUTES HAVE DIFFERENT SCALE, THE LOG FUNCTION IS MORE SUITABLE. WE USE IT TO LINEARIZE PROBLEMS.

FOR EXAMPLE, A SCIENTIST IN HEALTH CARE INDUSTRY WANTS TO DETERMINE THE AGE OF A PARTICULAR SPECIES. THE FUNCTION IS EXPONENTIAL, BUT YOU APPLIED LOG PROPERTIES (LN).

log(2)
[1] 0.6931472
log10(100)
[1] 2
log10(0.5)
[1] -0.30103
log10(10)
[1] 1
log(10,base=5)
[1] 1.430677

COMPUTING OFFENSIVE METRICS IN BASEBALL

BA=(29)/(112)
BA
[1] 0.2589286
Batting_average = round(BA, digits=3)
Batting_average
[1] 0.259
#ON BASE PERCENTAGE
#OBP=(H+BB+HBP)/(At Bats+H+BB+HBP+SF)
#LET US COMPUTE THE OBP FOR A PLAYER WITH THE FOLLOWING GENERAL STATS
#AB=515+,H=172,BB=84,HBP=5,SF=6
OBP=(172+84+5)/(515+172+84+5+6)
OBP
[1] 0.3337596
On_Base_Percentage=round(OBP,digits=3)
On_Base_Percentage
[1] 0.334
#Question_3:Compute the OBP for a player with the following general stats:
#AB=565,H=156,BB=65,HBP=3,SF=7
3==8
[1] FALSE
3!=8
[1] TRUE
3<=8
[1] TRUE
3>4
[1] FALSE

LOGICAL OPERATOR

#LOGICAL DISJUCTION (OR)
FALSE|FALSE #FALSE OR FALSE
[1] FALSE
#LOGICAL CONJUCTION (AND)
TRUE&FALSE #TRUE AND FALSE
[1] FALSE
# NEGATION
!FALSE #NOT FALSE
[1] TRUE
#COMBINATION OF STATEMENTS
2<3 | 1==5 # 2<3 IS TRUE, 1==5 IS FALSE, TRUE OR FALSE IS TRUE
[1] TRUE
Total_Bases <- 6+5
Total_Bases*3
[1] 33
ls()
[1] "BA"                 "Batting_average"   
[3] "OBP"                "On_Base_Percentage"
[5] "Total_Bases"       
#DELETE A VARIABLE USE RM = REMOVE
rm(Total_Bases)

VECTORS

pitches_by_innings <- c(12,15,10,20,10)
pitches_by_innings
[1] 12 15 10 20 10
strikes_by_innings <- c(9,12,6,14,9)
strikes_by_innings
[1]  9 12  6 14  9
runs_per_9innings <- c(21,3,17,6,9)
hits_per_9innings <- c(15,4,26,10,7)

CREATE VECTORS WITH REGULAR PATTERNS

#REPLICATE FUNCTION
rep(2,5)
[1] 2 2 2 2 2
rep(1,4)
[1] 1 1 1 1
#CONSECUTIVE NUMBERS
1:5
[1] 1 2 3 4 5
2:10
[1]  2  3  4  5  6  7  8  9 10
#SEQUENCE FROM 1 TO 10 WITH A STEP OF 2
seq(1,10,by=2)
[1] 1 3 5 7 9
seq(2,13,by=3)
[1]  2  5  8 11
#ADD VECTORS
pitches_by_innings+strikes_by_innings
[1] 21 27 16 34 19
#COMPARE VECTORS
pitches_by_innings == strikes_by_innings
[1] FALSE FALSE FALSE FALSE FALSE
#FIND LENGTH OF VECTOR
length(pitches_by_innings)
[1] 5
#FIND MINIMUM VALUE IN VECTOR
min(pitches_by_innings)
[1] 10
#FIND AVERAGE VALUE IN VECTOR
mean(pitches_by_innings)
[1] 13.4

YOU CAN ACCESS PARTS OF A VECTOR BY USING []. RECALL WHAT THE VALUE IS OF THE VECTOR

pitches_by_innings
[1] 12 15 10 20 10
#GET THE FIRST ELEMENT
pitches_by_innings[1]
[1] 12
#FIRST ELEMENT OF HITS_PER_9INNINGS
hits_per_9innings[1]
[1] 15

IF YOU WANT THE LAST ELEMENT OF A LIST FIRST FIND THE LENGTH THEN CALCULATE THE NUMBER ON THE VECTOR

pitches_by_innings[length(pitches_by_innings)]
[1] 10
#GET THE LAST ELEMENT OF THE hits_per_9innings
hits_per_9innings[length(hits_per_9innings)]
[1] 7

YOU CAN ALSO EXTRACT MULTIPLE VALUES FROM A VECTOR. FOR INSTANCE TO GET THE 2NDD THROUGH 4TH VALUES ARE:

pitches_by_innings[c(2,3,4)]
[1] 15 10 20

VECTORS CAN ALSO BE STRINGS OR LOGICAL VALUES

player_positions <- c("catcher","pitcher","infielders","outfielders")

DATA FRAMES In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with rows as observations and columns as variables.

To manually create a data frame, use the data.frame() function.

Most often you will be using data frames loaded from a file. For example, load the results of a fan’s survey. The function load or read.table can be used for this. HOW TO MAKE A RANDOM SAMPLE To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)

sample(1:10,size=5)
[1] 5 4 9 2 7
bar <- data.frame(var1=LETTERS[1:10],var2=1:10)
# CHECK DATA FRAME
bar
n<-5
samplerows <- sample(1:nrow(bar),size=n)
#print sample rows
samplerows
[1] 9 1 6 5 4
#THE VARIABLE SAMPLEROWS CONTAINS THE ROWS OF BAR WHICH MAKE A RANDOM SAMPLE FROM ALL THE ROWS IN BAR. EXTRACT THOSE ROWS FROM BAR WITH:
#extract rows
barsample <- bar[samplerows, ]
#print sample
print(barsample)

THE CODE ABOVE CREATES A NEW DATAFRAME CALLED barsample with a random sample of rows from bar

#IN A SINGLE LINE OF CODE:
bar[sample(1:nrow(bar),n), ]

USING TABLES The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.

For example, a survey asks people if they support the home team or not. The data is

Yes, No, No, Yes, Yes

We can enter this into R with the c() command, and summarize with the table() command as follows

x<-c("Yes","No","No","Yes","Yes")
table(x)
x
 No Yes 
  2   3 

NUMERICAL MEASURES OF CENTER AND SPREAD Suppose, MLB Teams’ CEOs yearly compensations are sampled and the following are found (in millions)

12 .4 5 2 50 8 3 1 4 0.25

sals <- c(12,.4,5,2,50,8,3,1,4,0.25)
# the average
mean(sals)
[1] 8.565
#THE VARIENCE
var(sals)
[1] 225.5145
#THE STANDARD DEVIATION
sd(sals)
[1] 15.01714
#THE MEDIAN
median(sals)
[1] 3.5
#TUKEYS FIVE NUMBER SUMMARY, USEFULL FOR BOXPLOTS
#FIVE NUMBER: MIN,LOWER HINGE, MEDIAN, UPPER HINGE, MAX
fivenum(sals)
[1]  0.25  1.00  3.50  8.00 50.00
#SUMMARY STATISTICS
summary(sals)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.250   1.250   3.500   8.565   7.250  50.000 

HOW ABOUT THE MODE

#FUNCTION TO FIND THE MODE, i.e. MOST FREQUENT VALUE
getmode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x,ux)))]
}

As an example, we can use the function defined above to find the most frequent value of the number of pitches_by_innings

#MOST FREQUENT VALUE IN pitches_by_innings
getmode(pitches_by_innings)
[1] 10
# 7 FIND THE MOST FREQUENT VALUE OF hits_per_9innings
getmode(hits_per_9innings)
[1] 15
# 8 SUMMARIZE THE FOLLOWING SURVEY WITH THE 'table()' command:
# What is your favorite day of the week to watch baseball? a total of 10 fans submitted this survey.
#Saturday, Saturday, Sunday, Monday, Saturday,Tuesday, Sunday, Friday, Friday, Monday
game_day<-c("Saturday", "Saturday", "Sunday", "Monday", "Saturday","Tuesday", "Sunday", "Friday", "Friday", "Monday")
# 9 WHAT IS THE MOST FREQUENT ANSWER RECORDED IN THE SURVEY? USE THE GETMODE FUNCTION TO COMPUTE RESULTS
getmode(game_day)
[1] "Saturday"
