This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Try executing this chunk by clicking the Run button within
the chunk or by placing your cursor inside it and pressing
Ctrl+Shift+Enter.
plot(cars)

Add a new chunk by clicking the Insert Chunk button on the
toolbar or by pressing Ctrl+Alt+I.
mean(cars$speed)
[1] 15.4
mean(cars$dist)
[1] 42.98
max(cars$speed)
[1] 25
max(cars$dist)
[1] 120
When you save the notebook, an HTML file containing the code and
output will be saved alongside it (click the Preview button or
press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the
editor. Consequently, unlike Knit, Preview does not
run any R code chunks. Instead, the output of the chunk when it was last
run in the editor is displayed.
2^5
[1] 32
LOGARITMIC FUNCTION: AS X GETS LARGER THE FUNCTION FLATS OUT. IT DOES
NOT INCREASE AT THE SAME RATE. SO FOR PROBLEM WHERE WE HAVE ATTRIBUTES
HAVE DIFFERENT SCALE, THE LOG FUNCTION IS MORE SUITABLE. WE USE IT TO
LINEARIZE PROBLEMS.
FOR EXAMPLE, A SCIENTIST IN HEALTH CARE INDUSTRY WANTS TO DETERMINE
THE AGE OF A PARTICULAR SPECIES. THE FUNCTION IS EXPONENTIAL, BUT YOU
APPLIED LOG PROPERTIES (LN).
log(2)
[1] 0.6931472
log10(100)
[1] 2
log10(0.5)
[1] -0.30103
log10(10)
[1] 1
log(10,base=5)
[1] 1.430677
COMPUTING OFFENSIVE METRICS IN BASEBALL
BA=(29)/(112)
BA
[1] 0.2589286
Batting_average = round(BA, digits=3)
Batting_average
[1] 0.259
#ON BASE PERCENTAGE
#OBP=(H+BB+HBP)/(At Bats+H+BB+HBP+SF)
#LET US COMPUTE THE OBP FOR A PLAYER WITH THE FOLLOWING GENERAL STATS
#AB=515+,H=172,BB=84,HBP=5,SF=6
OBP=(172+84+5)/(515+172+84+5+6)
OBP
[1] 0.3337596
On_Base_Percentage=round(OBP,digits=3)
On_Base_Percentage
[1] 0.334
#Question_3:Compute the OBP for a player with the following general stats:
#AB=565,H=156,BB=65,HBP=3,SF=7
3==8
[1] FALSE
3!=8
[1] TRUE
3<=8
[1] TRUE
3>4
[1] FALSE
LOGICAL OPERATOR
#LOGICAL DISJUCTION (OR)
FALSE|FALSE #FALSE OR FALSE
[1] FALSE
#LOGICAL CONJUCTION (AND)
TRUE&FALSE #TRUE AND FALSE
[1] FALSE
# NEGATION
!FALSE #NOT FALSE
[1] TRUE
#COMBINATION OF STATEMENTS
2<3 | 1==5 # 2<3 IS TRUE, 1==5 IS FALSE, TRUE OR FALSE IS TRUE
[1] TRUE
Total_Bases <- 6+5
Total_Bases*3
[1] 33
ls()
[1] "BA" "Batting_average"
[3] "OBP" "On_Base_Percentage"
[5] "Total_Bases"
#DELETE A VARIABLE USE RM = REMOVE
rm(Total_Bases)
VECTORS
pitches_by_innings <- c(12,15,10,20,10)
pitches_by_innings
[1] 12 15 10 20 10
strikes_by_innings <- c(9,12,6,14,9)
strikes_by_innings
[1] 9 12 6 14 9
runs_per_9innings <- c(21,3,17,6,9)
hits_per_9innings <- c(15,4,26,10,7)
CREATE VECTORS WITH REGULAR PATTERNS
#REPLICATE FUNCTION
rep(2,5)
[1] 2 2 2 2 2
rep(1,4)
[1] 1 1 1 1
#CONSECUTIVE NUMBERS
1:5
[1] 1 2 3 4 5
2:10
[1] 2 3 4 5 6 7 8 9 10
#SEQUENCE FROM 1 TO 10 WITH A STEP OF 2
seq(1,10,by=2)
[1] 1 3 5 7 9
seq(2,13,by=3)
[1] 2 5 8 11
#ADD VECTORS
pitches_by_innings+strikes_by_innings
[1] 21 27 16 34 19
#COMPARE VECTORS
pitches_by_innings == strikes_by_innings
[1] FALSE FALSE FALSE FALSE FALSE
#FIND LENGTH OF VECTOR
length(pitches_by_innings)
[1] 5
#FIND MINIMUM VALUE IN VECTOR
min(pitches_by_innings)
[1] 10
#FIND AVERAGE VALUE IN VECTOR
mean(pitches_by_innings)
[1] 13.4
YOU CAN ACCESS PARTS OF A VECTOR BY USING []. RECALL WHAT THE
VALUE IS OF THE VECTOR
pitches_by_innings
[1] 12 15 10 20 10
#GET THE FIRST ELEMENT
pitches_by_innings[1]
[1] 12
#FIRST ELEMENT OF HITS_PER_9INNINGS
hits_per_9innings[1]
[1] 15
IF YOU WANT THE LAST ELEMENT OF A LIST FIRST FIND THE LENGTH
THEN CALCULATE THE NUMBER ON THE VECTOR
pitches_by_innings[length(pitches_by_innings)]
[1] 10
#GET THE LAST ELEMENT OF THE hits_per_9innings
hits_per_9innings[length(hits_per_9innings)]
[1] 7
YOU CAN ALSO EXTRACT MULTIPLE VALUES FROM A VECTOR. FOR
INSTANCE TO GET THE 2NDD THROUGH 4TH VALUES ARE:
pitches_by_innings[c(2,3,4)]
[1] 15 10 20
VECTORS CAN ALSO BE STRINGS OR LOGICAL VALUES
player_positions <- c("catcher","pitcher","infielders","outfielders")
DATA FRAMES In statistical applications, data is
often stored as a data frame, which is like a spreadsheet, with rows as
observations and columns as variables.
To manually create a data frame, use the data.frame()
function.
Most often you will be using data frames loaded from a file. For
example, load the results of a fan’s survey. The function load or
read.table can be used for this. HOW TO MAKE A RANDOM
SAMPLE To randomly select a sample use the function
sample(). The following code selects 5 numbers between 1 and 10 at
random (without duplication)
sample(1:10,size=5)
[1] 5 4 9 2 7
bar <- data.frame(var1=LETTERS[1:10],var2=1:10)
# CHECK DATA FRAME
bar
n<-5
samplerows <- sample(1:nrow(bar),size=n)
#print sample rows
samplerows
[1] 9 1 6 5 4
#THE VARIABLE SAMPLEROWS CONTAINS THE ROWS OF BAR WHICH MAKE A RANDOM SAMPLE FROM ALL THE ROWS IN BAR. EXTRACT THOSE ROWS FROM BAR WITH:
#extract rows
barsample <- bar[samplerows, ]
#print sample
print(barsample)
THE CODE ABOVE CREATES A NEW DATAFRAME CALLED barsample with
a random sample of rows from bar
#IN A SINGLE LINE OF CODE:
bar[sample(1:nrow(bar),n), ]
USING TABLES The table() command allows us to
look at tables. Its simplest usage looks like table(x) where x is a
categorical variable.
For example, a survey asks people if they support the home team
or not. The data is
Yes, No, No, Yes, Yes
We can enter this into R with the c() command, and summarize with
the table() command as follows
x<-c("Yes","No","No","Yes","Yes")
table(x)
x
No Yes
2 3
NUMERICAL MEASURES OF CENTER AND SPREAD Suppose,
MLB Teams’ CEOs yearly compensations are sampled and the following are
found (in millions)
12 .4 5 2 50 8 3 1 4 0.25
sals <- c(12,.4,5,2,50,8,3,1,4,0.25)
# the average
mean(sals)
[1] 8.565
#THE VARIENCE
var(sals)
[1] 225.5145
#THE STANDARD DEVIATION
sd(sals)
[1] 15.01714
#THE MEDIAN
median(sals)
[1] 3.5
#TUKEYS FIVE NUMBER SUMMARY, USEFULL FOR BOXPLOTS
#FIVE NUMBER: MIN,LOWER HINGE, MEDIAN, UPPER HINGE, MAX
fivenum(sals)
[1] 0.25 1.00 3.50 8.00 50.00
#SUMMARY STATISTICS
summary(sals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.250 1.250 3.500 8.565 7.250 50.000
HOW ABOUT THE MODE
#FUNCTION TO FIND THE MODE, i.e. MOST FREQUENT VALUE
getmode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
As an example, we can use the function defined above to find the
most frequent value of the number of pitches_by_innings
#MOST FREQUENT VALUE IN pitches_by_innings
getmode(pitches_by_innings)
[1] 10
# 7 FIND THE MOST FREQUENT VALUE OF hits_per_9innings
getmode(hits_per_9innings)
[1] 15
# 8 SUMMARIZE THE FOLLOWING SURVEY WITH THE 'table()' command:
# What is your favorite day of the week to watch baseball? a total of 10 fans submitted this survey.
#Saturday, Saturday, Sunday, Monday, Saturday,Tuesday, Sunday, Friday, Friday, Monday
game_day<-c("Saturday", "Saturday", "Sunday", "Monday", "Saturday","Tuesday", "Sunday", "Friday", "Friday", "Monday")
# 9 WHAT IS THE MOST FREQUENT ANSWER RECORDED IN THE SURVEY? USE THE GETMODE FUNCTION TO COMPUTE RESULTS
getmode(game_day)
[1] "Saturday"
