In this Case Study, you will refresh your memory of STOR 155 while you learn some basic commands and tools for analyzing data with R. We’ll be looking at some data from college basketball games last year.
Run the following R code to load the data into your RStudio and take a look at it.
# Load dataset
bball = read.csv("http://kbodwin.web.unc.edu/files/2016/06/basketball.csv")
# Look at dataset
head(bball)
## X Date Team Team.Location Team.Score Opponent
## 1 1 11/13/15 Old Dominion Neutral 67 Niagara
## 2 2 11/13/15 Niagara Neutral 50 Old Dominion
## 3 3 11/13/15 Sacred Heart Neutral 76 Quinnipiac
## 4 4 11/13/15 Quinnipiac Neutral 64 Sacred Heart
## 5 5 11/13/15 Texas Neutral 71 Washington
## 6 6 11/13/15 Wright State Neutral 77 South Dakota
## Opponent.Score Team.Result
## 1 50 Win
## 2 67 Loss
## 3 64 Win
## 4 76 Loss
## 5 77 Loss
## 6 69 Win
summary(bball)
## X Date Team
## Min. : 1 11/13/15: 310 Akron : 35
## 1st Qu.: 2914 2/6/16 : 306 Austin Peay : 35
## Median : 5827 1/9/16 : 300 Bowling Green State: 35
## Mean : 5827 1/16/16 : 288 Buffalo : 35
## 3rd Qu.: 8740 1/30/16 : 286 Fresno State : 35
## Max. :11653 2/13/16 : 286 Green Bay : 35
## (Other) :9877 (Other) :11443
## Team.Location Team.Score Opponent
## Away :5240 Min. : 25.00 Akron : 35
## Home :5251 1st Qu.: 64.00 Austin Peay : 35
## Neutral:1162 Median : 72.00 Bowling Green State: 35
## Mean : 72.54 Buffalo : 35
## 3rd Qu.: 81.00 Fresno State : 35
## Max. :144.00 Green Bay : 35
## (Other) :11443
## Opponent.Score Team.Result
## Min. : 25.00 Loss:5821
## 1st Qu.: 64.00 Win :5832
## Median : 72.00
## Mean : 72.51
## 3rd Qu.: 81.00
## Max. :144.00
##
The command read.csv( ) will read a dataset into R from your computer or from online. “csv” stands for “comma separated value”, a common file type where the data is listed in a text file, with variables separated by commas. For now, you don’t need to worry about the details of read.csv( ). Once you have loaded the data, the command summary( ) will tell you about the variables in the dataset and their values. Another useful function is head( ), which shows you the first 6 rows of the dataset.
Look at the outputs of summary(bball) and head(bball), and describe the variables using vocabulary from STOR 155.
One label, X. Categorical variables: Date, Team, Team.Location, Opponent, Team.Result. Quantitative variables: Team.Score and Opponent.Score.If head( ) shows the first 6 rows of the dataset, what command do you think might show the last 6 rows? Try out your proposed function and see what happens
tail(bball)Try the commands ncol( ), nrow( ), and dim( ). What do these do? How could you get the same information from head( ), summary( ), and/or the command you figured out in part (b)?
ncol() and nrow() count the number of columns or rows in a dataset, dim() gives both dimensions. You can also figure out the number of columns from how many variables there are in head() or summary(). You can find the number of rows by checking the last row in tail().Sometimes, we will want to look at individual entries, rows, or columns of our data matrix. We can do this using brackets [ ] after our dataset. We can also look at a variables (columns) by name using the $ symbol. Try the following examples.
# Look at a single row
bball[123, ]
## X Date Team Team.Location Team.Score Opponent
## 123 123 11/22/15 Old Dominion Neutral 64 Saint Joseph's
## Opponent.Score Team.Result
## 123 66 Loss
# Look at a single column
head(bball[ , 4])
## [1] Neutral Neutral Neutral Neutral Neutral Neutral
## Levels: Away Home Neutral
head(bball$Team.Score)
## [1] 67 50 76 64 71 77
# Look at a single entry
bball[123, 4]
## [1] Neutral
## Levels: Away Home Neutral
bball$Team.Score[123]
## [1] 64
# Calculate mean, median, variance, and standard deviation
mean(bball$Team.Score)
## [1] 72.53832
median(bball$Team.Score)
## [1] 72
var(bball$Team.Score)
## [1] 171.8493
sd(bball$Team.Score)
## [1] 13.10913
What is the difference between mean(bball$Team.Score) and mean(bball[,4])? Why might it be useful to have two ways to get access the variable Team.Score?
First one accesses by name, the other one by column number. Number one is good because sometimes you don't know or even have column names. You can also do things like `bball[, 1:3]` to get a range.In plain English, what were the events of the game represented by the first row of the dataset?
Old Dominion played against Niagara at a neutral location. Old Dominion won 67-50.
(Note: If you don’t know much about basketball - for example, if you don’t know what it means to play a game “Home” versus “Away” - ask people around you.)
All these commands we have been using, like summary( ) and mean( ) are called functions. A function can take all different kinds of input depending on what you are trying to do: datasets, vectors such as bball$Team.Score, etc. An important skill in R is figuring out for yourself how functions work.
For example, type ?boxplot into your R console. A help page will pop up telling you about this function. Notice that under Usage, it says boxplot(x, ...). This tells you that you need to supply something called x to the function, and the rest of the input is optional. But what is x? Ah-ha! There is a section called Arguments, which tells us that x is the vector of values you want to put in a boxplot.
Run the code below to make a boxplot of the team scores of college basketball games.
# make boxplot of team scores
boxplot(bball$Team.Score)
?hist, a function for making histograms. Below is basic code to make a histogram of Team.Scores, and also code for the same histogram but with a lot of the optional input changed. Mess around with these inputs until you understand what each is doing.# Boring histogram
hist(bball$Team.Score)
# Fancy histogram
hist(bball$Team.Score, breaks = 5, main = "I am a title", xlab = "I am an x-axis label", col = "grey", freq = FALSE)
Explain in your own words what
breaks and freq change about the histogram.
"breaks" changes how many bins the histogram uses. "freq = FALSE" changes the y-axis from raw counts to densities (this is almost always desirable).
main, xlab, ylab, and col are common to most plotting functions. Use what you learned in (a) to make a boxplot of Team.Scores with proper axis label(s) and title.# make boxplot of team scores
boxplot(bball$Team.Score, main = "Team scores in college basketball games \n (2015-2016)", ylab ="Score", col = "light blue")
# Boring histogram
hist(bball$Team.Score, freq = FALSE)
# overlay Normal Curve
curve(dnorm(x, mean=120, sd=20),
add = TRUE, col = "blue", lwd = 2)
Explain what the role is of the functions
curve( ) and dnorm( ). Why did we put add = TRUE in the inputs?
dnorm() calculates the density equation, in terms of "x", of a normal distribution with a certain mean and s.d. curve() takes an equation and adds it as a line on a plot. "add = TRUE" tells R to add the curve on top of the current plot, instead of making a new one.
# Boring histogram
hist(bball$Team.Score, freq = FALSE)
# overlay smoothed curve
lines(density(bball$Team.Score),
col = "red", lwd = 2, lty = 2)
What is the difference between lines( ) and curve( )? When might we want to use density( ), and when would it be better to overlay a Normal curve on a histogram?
(This is a tough question!) lines() always adds on top of an existing plot, and it can add a smooth line from discrete data points, while curve() needs its input to be an equation, not numbers. We might want to use density() when the distribution of the data is not known, to visualize its true density. However, if we know (or want to check) that the data is Normal, it makes more sense to force the overlay to be Normal.
# hist
hist(bball$Team.Score, breaks = 50, main = "Team scores in college basketball games \n (2015-2016)", xlab = "Scores", col = "grey", freq = FALSE)
# Find params
mu = mean(bball$Team.Score)
sigma = sd(bball$Team.Score)
# add curve
curve(dnorm(x, mean=mu, sd=sigma),
add = TRUE, col = "blue", lwd = 2)
Yep, looks pretty Normal.
One of the most powerful qualities of R is the ability to subset of a dataset. Suppose we want to look only at games involving UNC or Duke. We would need to figure out which rows of bball involve one of those teams, and then make a new dataset out of only those rows.
For this, we will use booleans, which are variables with the value TRUE or FALSE. Play around with the following code until you feel comfortable with ==, >, <, and %in% as well as & (and) and | (or).
# booleans practice
1 == 1
## [1] TRUE
1 == 2
## [1] FALSE
1 < 2
## [1] TRUE
1 == 1 | 1 > 2
## [1] TRUE
1 == 1 & 1 > 2
## [1] FALSE
You can make up your own vector using the function c( ), which stands for “concatenate”. This is like making a new variable - the variable can contain anything you want, such as numbers, strings, booleans. Try the example below to make a vector and subset it. Note that we can use either <- or = to store information in a variable.
vec <- c("cat", "dog", "horned toad", "Her Majesty Queen Elizabeth", "dog")
vec
## [1] "cat" "dog"
## [3] "horned toad" "Her Majesty Queen Elizabeth"
## [5] "dog"
# Some more booleans
vec == "dog"
## [1] FALSE TRUE FALSE FALSE TRUE
"dog" == vec
## [1] FALSE TRUE FALSE FALSE TRUE
vec %in% c("dog", "cat")
## [1] TRUE TRUE FALSE FALSE TRUE
c("dog", "cat") %in% vec
## [1] TRUE TRUE
# Finding indices
which(vec == "dog")
## [1] 2 5
which(vec %in% c("dog", "cat"))
## [1] 1 2 5
which(c("dog", "cat") %in% vec)
## [1] 1 2
# Subsetting
new = vec[vec %in% c("dog", "cat")]
new
## [1] "cat" "dog" "dog"
vec = c(1, 2, 3, "4")
vec + 2
In the vector "vec", the fourth entry is string containing the character "4", not the actual number 4. Thus, we can't add 2 to it.
vec = c(TRUE, FALSE, FALSE, TRUE)
vec + 2
## [1] 3 2 2 3
"TRUE" is considered to be 1, and "FALSE" is considered to be 0. This can be useful sometimes if you get fancy with subsetting.
bball.Try running each of the following lines of code. None of them will make the datset we want. What was the problem with each one?
# Make new dataset with only UNC or Duke games
#A
my_subset = bball[Team == "North Carolina" | Team == "Duke", ]
#B
my_subset = bball[bball$Team == "North Carolina", bball$Team == "Duke"]
#C
my_subset = bball[bball$Team = "North Carolina" | bball$Team = "Duke", ]
#D
my_subset = bball[bball$Team == "North Carolina" & bball$Team == "Duke", ]
#E
unc_games = which(bball$Team == "North Carolina")
my_subset = bball[unc_games | bball$Team == "Duke", ]
#F
my_subset = bball[bball$Team == "North Carolina" | bball$Team == "Duke"]
A: Need "bball$Team", not just "Team"
B: Misplaced comma
C: "==" rather than "="
D: "&" rather than "|"
E: "which" gives indices. It is okay for subsetting, but you can't mix and match with booleans.
F: No comma at the end.
my_subset = bball[bball$Team == "North Carolina" | bball$Team == "Duke", ]
Alright, enough of that data wrangling. Time to do some statistics.
Check out ?Normal. These are some functions that will help us calculate probabilities about the Normal distribution. (No more using Table A!) The most important ones are pnorm and qnorm.
pnorm(q) will tell you the probability of a standard Normal being below the value q
qnorm(p) will tell you the z-score that has area p below it on a standard Normal curve
pnorm and qnorm.# practice with Normal densities in R
#i
pnorm(0)
## [1] 0.5
qnorm(0)
## [1] -Inf
#ii
pnorm(100)
## [1] 1
qnorm(100)
## Warning in qnorm(100): NaNs produced
## [1] NaN
#iii
qnorm(pnorm(0))
## [1] 0
qnorm(pnorm(7))
## [1] 6.999994
#iv
pnorm(qnorm(0))
## [1] 0
pnorm(qnorm(0.5))
## [1] 0.5
#v
pnorm(0, sd = 10)
## [1] 0.5
pnorm(0, mean = 1, sd = 10)
## [1] 0.4601722
#vi
qnorm(0.05)
## [1] -1.644854
qnorm(0.05, sd = 10)
## [1] -16.44854
qnorm(0.05, mean = 1, sd = 10)
## [1] -15.44854
Why did you get an error in part (ii)?
value in qnorm() needs to be between 0 and 1Now use this code to make a new variable for the total score of a game:
# Make new variable
bball$Total.Score = bball$Team.Score + bball$Opponent.Score
We will use z-scores and t-scores to think about whether a game is unusually high scoring.
bball actually displays each game twice: once for each team. Make a new dataset with each game listed only once by subsetting bball.bball2 = bball[bball$Team.Result == "Win", ]
Total.Score are actually the population mean and standard deviation.)# Find params
mu = mean(bball2$Total.Score)
sigma = sd(bball2$Total.Score)
# obseved value
x = bball2[bball2$Team == "North Carolina" & bball2$Opponent == "Duke", "Total.Score"]
# area of normal dist ABOVE x
1 - pnorm(x, mean = mu, sd = sigma)
## [1] 0.4411833
sum( ) and length( ) will help you answer this question.sum(bball2$Total.Score > x)/nrow(bball2)
## [1] 0.412037
```
Recall that t-scores are used instead of z-scores when the population standard deviation is unknown. The functions pt and qt work almost same way as pnorm and qnorm, but for the t-distribution instead of the Normal. However, be careful, and read ?pt for help! These functions don’t let you enter the mean and standard deviation as input - you need to figure out what do about that!
Use all your new R skills to answer this question: Was the March 5th game between UNC and Duke particularly high scoring for a UNC game? (Do not assume that population values for mean and s.d. are known.)
# make subset
unc_games = bball[bball$Team == "North Carolina",]
# estimates of params
x.bar = mean(unc_games$Total.Score)
sx = sd(unc_games$Total.Score)
n = nrow(unc_games)
# how many games more high scoring, by t-dist
t.val = (x - x.bar)/sx
dof = n - 1
1 - pt(t.val, df = dof)
## [1] 0.5964984
Not high scoring, 60% of UNC games had higher score.