In this Case Study, you will refresh your memory of STOR 155 while you learn some basic commands and tools for analyzing data with R. We’ll be looking at some data from college basketball games last year.

Run the following R code to load the data into your RStudio and take a look at it.

Summarizing data

# Load dataset
bball = read.csv("http://kbodwin.web.unc.edu/files/2016/06/basketball.csv")

# Look at dataset
head(bball)
##   X     Date         Team Team.Location Team.Score     Opponent
## 1 1 11/13/15 Old Dominion       Neutral         67      Niagara
## 2 2 11/13/15      Niagara       Neutral         50 Old Dominion
## 3 3 11/13/15 Sacred Heart       Neutral         76   Quinnipiac
## 4 4 11/13/15   Quinnipiac       Neutral         64 Sacred Heart
## 5 5 11/13/15        Texas       Neutral         71   Washington
## 6 6 11/13/15 Wright State       Neutral         77 South Dakota
##   Opponent.Score Team.Result
## 1             50         Win
## 2             67        Loss
## 3             64         Win
## 4             76        Loss
## 5             77        Loss
## 6             69         Win
summary(bball)
##        X               Date                       Team      
##  Min.   :    1   11/13/15: 310   Akron              :   35  
##  1st Qu.: 2914   2/6/16  : 306   Austin Peay        :   35  
##  Median : 5827   1/9/16  : 300   Bowling Green State:   35  
##  Mean   : 5827   1/16/16 : 288   Buffalo            :   35  
##  3rd Qu.: 8740   1/30/16 : 286   Fresno State       :   35  
##  Max.   :11653   2/13/16 : 286   Green Bay          :   35  
##                  (Other) :9877   (Other)            :11443  
##  Team.Location    Team.Score                    Opponent    
##  Away   :5240   Min.   : 25.00   Akron              :   35  
##  Home   :5251   1st Qu.: 64.00   Austin Peay        :   35  
##  Neutral:1162   Median : 72.00   Bowling Green State:   35  
##                 Mean   : 72.54   Buffalo            :   35  
##                 3rd Qu.: 81.00   Fresno State       :   35  
##                 Max.   :144.00   Green Bay          :   35  
##                                  (Other)            :11443  
##  Opponent.Score   Team.Result
##  Min.   : 25.00   Loss:5821  
##  1st Qu.: 64.00   Win :5832  
##  Median : 72.00              
##  Mean   : 72.51              
##  3rd Qu.: 81.00              
##  Max.   :144.00              
## 

The command read.csv( ) will read a dataset into R from your computer or from online. “csv” stands for “comma separated value”, a common file type where the data is listed in a text file, with variables separated by commas. For now, you don’t need to worry about the details of read.csv( ). Once you have loaded the data, the command summary( ) will tell you about the variables in the dataset and their values. Another useful function is head( ), which shows you the first 6 rows of the dataset.


Question 1:

  1. Look at the outputs of summary(bball) and head(bball), and describe the variables using vocabulary from STOR 155.

    One label, X.  Categorical variables:  Date, Team, Team.Location, Opponent, Team.Result.  Quantitative variables: Team.Score and Opponent.Score.
  2. If head( ) shows the first 6 rows of the dataset, what command do you think might show the last 6 rows? Try out your proposed function and see what happens

    tail(bball)
  3. Try the commands ncol( ), nrow( ), and dim( ). What do these do? How could you get the same information from head( ), summary( ), and/or the command you figured out in part (b)?

    ncol() and nrow() count the number of columns or rows in a dataset, dim() gives both dimensions.  You can also figure out the number of columns from how many variables there are in head() or summary().  You can find the number of rows by checking the last row in tail().

Sometimes, we will want to look at individual entries, rows, or columns of our data matrix. We can do this using brackets [ ] after our dataset. We can also look at a variables (columns) by name using the $ symbol. Try the following examples.

# Look at a single row
bball[123, ]
##       X     Date         Team Team.Location Team.Score       Opponent
## 123 123 11/22/15 Old Dominion       Neutral         64 Saint Joseph's
##     Opponent.Score Team.Result
## 123             66        Loss
# Look at a single column
head(bball[ , 4])
## [1] Neutral Neutral Neutral Neutral Neutral Neutral
## Levels: Away Home Neutral
head(bball$Team.Score)
## [1] 67 50 76 64 71 77
# Look at a single entry
bball[123, 4]
## [1] Neutral
## Levels: Away Home Neutral
bball$Team.Score[123]
## [1] 64
# Calculate mean, median, variance, and standard deviation
mean(bball$Team.Score)
## [1] 72.53832
median(bball$Team.Score)
## [1] 72
var(bball$Team.Score)
## [1] 171.8493
sd(bball$Team.Score)
## [1] 13.10913

Question 2:

  1. What is the difference between mean(bball$Team.Score) and mean(bball[,4])? Why might it be useful to have two ways to get access the variable Team.Score?

    First one accesses by name, the other one by column number.  Number one is good because sometimes you don't know or even have column names.  You can also do things like `bball[, 1:3]` to get a range.
  2. In plain English, what were the events of the game represented by the first row of the dataset?

    Old Dominion played against Niagara at a neutral location.  Old Dominion won 67-50.

    (Note: If you don’t know much about basketball - for example, if you don’t know what it means to play a game “Home” versus “Away” - ask people around you.)


All these commands we have been using, like summary( ) and mean( ) are called functions. A function can take all different kinds of input depending on what you are trying to do: datasets, vectors such as bball$Team.Score, etc. An important skill in R is figuring out for yourself how functions work.

For example, type ?boxplot into your R console. A help page will pop up telling you about this function. Notice that under Usage, it says boxplot(x, ...). This tells you that you need to supply something called x to the function, and the rest of the input is optional. But what is x? Ah-ha! There is a section called Arguments, which tells us that x is the vector of values you want to put in a boxplot.

Run the code below to make a boxplot of the team scores of college basketball games.

# make boxplot of team scores
boxplot(bball$Team.Score)


Question 3:

  1. Now check out ?hist, a function for making histograms. Below is basic code to make a histogram of Team.Scores, and also code for the same histogram but with a lot of the optional input changed. Mess around with these inputs until you understand what each is doing.
# Boring histogram
hist(bball$Team.Score)

# Fancy histogram
hist(bball$Team.Score, breaks = 5, main = "I am a title", xlab = "I am an x-axis label", col = "grey", freq = FALSE)

Explain in your own words what breaks and freq change about the histogram.

"breaks" changes how many bins the histogram uses.  "freq = FALSE" changes the y-axis from raw counts to densities (this is almost always desirable).
  1. The optional inputs main, xlab, ylab, and col are common to most plotting functions. Use what you learned in (a) to make a boxplot of Team.Scores with proper axis label(s) and title.
# make boxplot of team scores
boxplot(bball$Team.Score, main = "Team scores in college basketball games \n (2015-2016)", ylab ="Score", col = "light blue")

  1. To check if the histogram is Normal, or to help visualize its shape, we might want to overlay a Normal curve on top of the histogram. The code below will do so - but the curve doesn’t fit very well.
# Boring histogram
hist(bball$Team.Score, freq = FALSE)

# overlay Normal Curve

curve(dnorm(x, mean=120, sd=20), 
      add = TRUE, col = "blue", lwd = 2)

Explain what the role is of the functions curve( ) and dnorm( ). Why did we put add = TRUE in the inputs?

dnorm() calculates the density equation, in terms of "x", of a normal distribution with a certain mean and s.d.  curve() takes an equation and adds it as a line on a plot.  "add = TRUE" tells R to add the curve on top of the current plot, instead of making a new one.
  1. Alternatively, we can overlay a line that is a “smoothed” version of the histogram of the data, as follows:
# Boring histogram
hist(bball$Team.Score, freq = FALSE)

# overlay smoothed curve
lines(density(bball$Team.Score),
      col = "red", lwd = 2, lty = 2)

What is the difference between lines( ) and curve( )? When might we want to use density( ), and when would it be better to overlay a Normal curve on a histogram?

(This is a tough question!)  lines() always adds on top of an existing plot, and it can add a smooth line from discrete data points, while curve() needs its input to be an equation, not numbers.  We might want to use density() when the distribution of the data is not known, to visualize its true density.  However, if we know (or want to check) that the data is Normal, it makes more sense to force the overlay to be Normal.
  1. Now make your own histogram with well-chosen inputs and with a Normal overlay that fits better. Would you say the data looks Normal?
# hist
hist(bball$Team.Score, breaks = 50, main = "Team scores in college basketball games \n (2015-2016)", xlab = "Scores", col = "grey", freq = FALSE)

# Find params
mu = mean(bball$Team.Score)
sigma = sd(bball$Team.Score)

# add curve
curve(dnorm(x, mean=mu, sd=sigma), 
      add = TRUE, col = "blue", lwd = 2)

Yep, looks pretty Normal.

Subsetting

One of the most powerful qualities of R is the ability to subset of a dataset. Suppose we want to look only at games involving UNC or Duke. We would need to figure out which rows of bball involve one of those teams, and then make a new dataset out of only those rows.

For this, we will use booleans, which are variables with the value TRUE or FALSE. Play around with the following code until you feel comfortable with ==, >, <, and %in% as well as & (and) and | (or).

# booleans practice

1 == 1
## [1] TRUE
1 == 2
## [1] FALSE
1 < 2
## [1] TRUE
1 == 1 | 1  > 2
## [1] TRUE
1 == 1 & 1 > 2
## [1] FALSE

You can make up your own vector using the function c( ), which stands for “concatenate”. This is like making a new variable - the variable can contain anything you want, such as numbers, strings, booleans. Try the example below to make a vector and subset it. Note that we can use either <- or = to store information in a variable.

vec <- c("cat", "dog", "horned toad", "Her Majesty Queen Elizabeth", "dog")
vec
## [1] "cat"                         "dog"                        
## [3] "horned toad"                 "Her Majesty Queen Elizabeth"
## [5] "dog"
# Some more booleans
vec == "dog"
## [1] FALSE  TRUE FALSE FALSE  TRUE
"dog" == vec
## [1] FALSE  TRUE FALSE FALSE  TRUE
vec %in% c("dog", "cat")
## [1]  TRUE  TRUE FALSE FALSE  TRUE
c("dog", "cat") %in% vec
## [1] TRUE TRUE
# Finding indices

which(vec == "dog")
## [1] 2 5
which(vec %in% c("dog", "cat"))
## [1] 1 2 5
which(c("dog", "cat") %in% vec)
## [1] 1 2
# Subsetting
new = vec[vec %in% c("dog", "cat")]
new
## [1] "cat" "dog" "dog"

Question 4:

  1. The following code will give you an error. What happened?
vec = c(1, 2, 3, "4")
vec + 2
In the vector "vec", the fourth entry is string containing the character "4", not the actual number 4.  Thus, we can't add 2 to it.
  1. The following code will NOT give you an error? What is going on here?
vec = c(TRUE, FALSE, FALSE, TRUE)
vec + 2
## [1] 3 2 2 3
"TRUE" is considered to be 1, and "FALSE" is considered to be 0.  This can be useful sometimes if you get fancy with subsetting.
  1. Now we are ready to make a new dataset. We’ll get a list of booleans to tell us where UNC or Duke’s games are, and use that to subset the datset bball.

Try running each of the following lines of code. None of them will make the datset we want. What was the problem with each one?

# Make new dataset with only UNC or Duke games


#A
my_subset = bball[Team == "North Carolina" | Team == "Duke", ]

#B
my_subset = bball[bball$Team == "North Carolina", bball$Team == "Duke"]

#C
my_subset = bball[bball$Team = "North Carolina" | bball$Team = "Duke", ]

#D
my_subset = bball[bball$Team == "North Carolina" & bball$Team == "Duke", ]

#E
unc_games = which(bball$Team == "North Carolina")
my_subset = bball[unc_games | bball$Team == "Duke", ]

#F
my_subset = bball[bball$Team == "North Carolina" | bball$Team == "Duke"]
A: Need "bball$Team", not just "Team"
B: Misplaced comma
C: "==" rather than "="
D: "&" rather than "|"
E: "which" gives indices.  It is okay for subsetting, but you can't mix and match with booleans.
F: No comma at the end.
  1. Now write your own code to make the correct dataset.
my_subset = bball[bball$Team == "North Carolina" | bball$Team == "Duke", ]

Z-Scores and t-scores

Alright, enough of that data wrangling. Time to do some statistics.

Check out ?Normal. These are some functions that will help us calculate probabilities about the Normal distribution. (No more using Table A!) The most important ones are pnorm and qnorm.

pnorm(q) will tell you the probability of a standard Normal being below the value q

qnorm(p) will tell you the z-score that has area p below it on a standard Normal curve


Question 5

  1. For each of the following lines of code, think about what the result will be before running the code. Draw a picture for each one to visualize what is going on with pnorm and qnorm.
# practice with Normal densities in R

#i
pnorm(0)
## [1] 0.5
qnorm(0)
## [1] -Inf
#ii
pnorm(100)
## [1] 1
qnorm(100)
## Warning in qnorm(100): NaNs produced
## [1] NaN
#iii
qnorm(pnorm(0))
## [1] 0
qnorm(pnorm(7))
## [1] 6.999994
#iv
pnorm(qnorm(0))
## [1] 0
pnorm(qnorm(0.5))
## [1] 0.5
#v
pnorm(0, sd = 10)
## [1] 0.5
pnorm(0, mean = 1, sd = 10)
## [1] 0.4601722
#vi
qnorm(0.05)
## [1] -1.644854
qnorm(0.05, sd = 10)
## [1] -16.44854
qnorm(0.05, mean = 1, sd = 10)
## [1] -15.44854
  1. Why did you get an error in part (ii)?

    value in qnorm() needs to be between 0 and 1

Now use this code to make a new variable for the total score of a game:

# Make new variable
bball$Total.Score = bball$Team.Score + bball$Opponent.Score

We will use z-scores and t-scores to think about whether a game is unusually high scoring.


Question 6:

  1. As you may have noticed, the dataset bball actually displays each game twice: once for each team. Make a new dataset with each game listed only once by subsetting bball.
bball2 = bball[bball$Team.Result == "Win", ]
  1. On March 3, 2016, UNC beat Duke. Using the Normal distribution, what percent of games have higher scores than this UNC/Duke game? (Assume that the mean and standard deviation of Total.Score are actually the population mean and standard deviation.)
# Find params
mu = mean(bball2$Total.Score)
sigma = sd(bball2$Total.Score)

# obseved value
x = bball2[bball2$Team == "North Carolina" & bball2$Opponent == "Duke", "Total.Score"]

# area of normal dist ABOVE x
1 - pnorm(x, mean = mu, sd = sigma)
## [1] 0.4411833
  1. What percentage of games in the dataset did we observe to be higher scoring than the UNC/Duke game? The functions sum( ) and length( ) will help you answer this question.
sum(bball2$Total.Score > x)/nrow(bball2)
## [1] 0.412037
  1. What is the difference between what we did in (b) and (c)? Do you think the Normal approximation is reasonable for this data? Why or why not? ``` In (b), we assumed the data was continuous, and we approximated the density by a Normal distribution. Then we calculated the area under the curve above “x”. In (c), we treated the data as discrete, and we approximated the area above “x” by the number of observed values above “x”. Yes, the Normal approx still seems reasonable.

```


Recall that t-scores are used instead of z-scores when the population standard deviation is unknown. The functions pt and qt work almost same way as pnorm and qnorm, but for the t-distribution instead of the Normal. However, be careful, and read ?pt for help! These functions don’t let you enter the mean and standard deviation as input - you need to figure out what do about that!


Question 7:

Use all your new R skills to answer this question: Was the March 5th game between UNC and Duke particularly high scoring for a UNC game? (Do not assume that population values for mean and s.d. are known.)

# make subset
unc_games = bball[bball$Team == "North Carolina",]

# estimates of params
x.bar = mean(unc_games$Total.Score)
sx = sd(unc_games$Total.Score)
n = nrow(unc_games)

# how many games more high scoring, by t-dist
t.val = (x - x.bar)/sx
dof = n - 1

1 - pt(t.val, df = dof)
## [1] 0.5964984
Not high scoring, 60% of UNC games had higher score.