The gpa1 dataset in the Wooldridge package has data on collegiate academic performance, including ACT scores. I was interested in finding the probabilty and percentile of each ACT score, then plotting it as a discreet probability distribution.

However, this is actually not a straight forward thing to do in R, at least with the hist() function and a cursory Google search. Hist() automatically creates arbitrary bins that contain multiple levels, even if the input variable is discreet without many levels. So multiple ACT scores get binned into the same bar as an aggregated probability, which isn’t very meaningful.

In this tutorial, we will go over how to manually create bins in hist() that create probabilities and bars for each individual level of a discreet variable. While we’re at it, we are going to use this example to cover some other useful operations for data analysis in R including: how to extract an element of a list into a vector, creating a new variable of cumulative probabilities, recoding probabilities as percentages and percentiles, creating a function, and how to merge data sets together.

It is also done entirely in Base R. A lot of resources today make use of “Tidyverse”, which is touted for being easier to learn that Base R. But I think the mass adoption of Tidyverse is an enormous mistake. While Tidyverse is easier to learn, it comes at a cost of reduced flexibility. Its functions are like this black box which belies more under the hood programming. So when your code doesn’t work –which becomes more likely as you try to do complex and unique data manipulations– its harder to understand why. Base R has a steep learning curve, but once you get it it is incredibly flexible. There is a clear logic to how Base R functions process various data types, so it is relatively easy to trouble shoot issues. I feel like this is true even for Base R Graphics. GGplot’s only real advantage in this regard is that basic graphs look a lot better, but customized graph building is WAY better with Base R.

Flexibility is what made R great in the first place, so I have a hard time understanding what the point even is of Tidyverse R compared to some other stripped down statistical application. I will say that sources like R For Data Science provide a nice overview of practical data analysis topics with Tidyverse, which is good for getting going. It would not be that much harder to learn from an analogous text in Base R, but unfortunately that book doesn’t exist, or I am not aware of it. In my opinion, treat Tidyverse as training wheels to get the job done while you figure out how to use Base R like a real programmer.

library(wooldridge)
## Warning: package 'wooldridge' was built under R version 4.1.1
dat<-gpa1

The first thing we are going to do is manually set the size of the bins by setting the parameter “breaks” = to a vector of cutoff points. We also set the parameter “probability=T”, which gives us the probability of each bin on the y axis, rather than the frequency.

A somewhat tricky thing to keep in mind with setting up intervals is that the values are allocated to bins preferentially from left to right, when a value is represented in multiple bins. For instance, given the intervals (16-17) and (17-18), all of the 17s go into (16-17) rather than (17-18), because (16-17) is the first bin accepting 17s, so they all go there. Likewise, all of the 16s go into (16-17) because it is the first bin that they can be accepted into

b<-c(15, 16, 17, 18, 19, 20, 21, 22, 23 ,24, 25, 26, 27, 28, 29, 30, 31, 32, 33)


h<-hist(dat$ACT, probability=T, breaks=b, xlab="ACT", main = "ACT Histogram")

As you can see, each bar represents a single ACT score, and there are spaces representing ACT scores with no observations.

Now we are going to move on to building a data frame of ACT scores and their associated probabilities.

One of the things that makes R so versatile is that the sort of meta output of the base R functions are stored as vectors in lists that can then be extracted for other purposes. So what you do is define the output of function as an object in the local environment, and from there you can check out its elements and extract them. In this case we can define the output of hist() as an object and extract the density vector, which is a vector of probabilities.

h
## $breaks
##  [1] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 
## $counts
##  [1]  2  0  0  6  2 13 13 23 25 11 18 13  6  4  3  1  0  1
## 
## $density
##  [1] 0.014184397 0.000000000 0.000000000 0.042553191 0.014184397 0.092198582
##  [7] 0.092198582 0.163120567 0.177304965 0.078014184 0.127659574 0.092198582
## [13] 0.042553191 0.028368794 0.021276596 0.007092199 0.000000000 0.007092199
## 
## $mids
##  [1] 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
## [16] 30.5 31.5 32.5
## 
## $xname
## [1] "dat$ACT"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

What we have here is a list of vectors containing the meta-information of a hist() when it is coerced into an object to the local environment. $density is a vector of probabilities associated with each bin. Because we have set up the bins to represent individual levels, these probabilities represent the individual ACT scores.

We will now create a data frame of these probabilities. Will will also create a vector of cumulative probabilities, which we will use to create variable representing percentile rank.

dat2<-as.data.frame(h$density)
names(dat2)<-"probability" #equivalent to names(dat2)[1]<-"probability"


dat2.1<-as.data.frame(c(16:33))
names(dat2.1)<-"ACT"

dat2<-cbind(dat2, dat2.1)
rm(dat2.1)

dat2$probability<-round(dat2$probability, 3)

dat2$cumulative_probability<-round(cumsum(dat2$probability), 3)

head(dat2,5)
##   probability ACT cumulative_probability
## 1       0.014  16                  0.014
## 2       0.000  17                  0.014
## 3       0.000  18                  0.014
## 4       0.043  19                  0.057
## 5       0.014  20                  0.071

Now that we have variables for probability and cumulative probability, we will transform them into percentages and percentiles, which are easier to interpret.

dat2$probability<-round(dat2$probability, 3)

dat2$percentage<-paste(dat2$probability*100, "%", sep="")


dat2$percentile<-paste(dat2$cumulative_probability*100, "th", sep=" ")


dat2<-dat2[,c(2,1,3,4,5)]

Now let’s build a function that will yield the probability and percentile associated with a particular ACT score.

ACT_Prob<-function(score){
  dat2[dat2$ACT==score, c("percentage", "percentile")]}


ACT_Prob(27)
##    percentage percentile
## 12       9.2%    89.3 th

For the hell of it, lets practice merging the data frame we created with the original data frame. Essentially, you add variables from one data frame to another by matching observations between data frames with a key variable contained in both data frames. In this case, ACT is the key variable. We are going to add the probability variables from dat2 to dat1. This is a case of “one to many.”

This webpage is gives a nice, simple overview of how to do different types of merging.

dat1<-merge(dat, dat2, by="ACT", all.x = T)

dat1<-dat1[,c(1,30:33,2:29)]

head(dat1[,1:8], 10)
##    ACT probability cumulative_probability percentage percentile age soph junior
## 1   16       0.014                  0.014       1.4%     1.4 th  21    0      0
## 2   16       0.014                  0.014       1.4%     1.4 th  22    0      0
## 3   19       0.043                  0.057       4.3%     5.7 th  21    0      0
## 4   19       0.043                  0.057       4.3%     5.7 th  20    0      1
## 5   19       0.043                  0.057       4.3%     5.7 th  21    0      0
## 6   19       0.043                  0.057       4.3%     5.7 th  21    0      0
## 7   19       0.043                  0.057       4.3%     5.7 th  21    0      0
## 8   19       0.043                  0.057       4.3%     5.7 th  20    0      0
## 9   20       0.014                  0.071       1.4%     7.1 th  20    0      1
## 10  20       0.014                  0.071       1.4%     7.1 th  21    0      1

Another interesting thing we can do is group dat1 by ACT, and then find average colege GPA (colGPA), given ACT. Essentially what we are doing when we aggegate is splitting the data into groups and then performing some function within those groups. So we get an average cGPA for each level of ACT.

dat3 = aggregate(dat1[15],
                by = list(dat1$ACT),
                FUN = mean, na.rm=T)
                
names(dat3)[1]<-"ACT"

head(dat3, 10)
##    ACT   colGPA
## 1   16 3.300000
## 2   19 2.850000
## 3   20 3.050000
## 4   21 2.869231
## 5   22 2.953846
## 6   23 3.039130
## 7   24 3.052000
## 8   25 2.872727
## 9   26 3.344444
## 10  27 3.092308

If I didn’t subset dat1, then this code would have given me averages for each variable in dat1, grouped according to ACT.

Lets get a sense of what the data looks like by plotting aggregate cGPA against ACT, and then plotting a regression line over that data.

plot(dat3$ACT, dat3$colGPA, main="Average College GPA Conditional on ACT Score", xlab="ACT", ylab="colGPA")


abline(lm(colGPA~ACT, data=dat3), col="blue")

Keep in mind that isn’t a proper regression because it gives each observation of ACT equal weight.


There are other ways of creating a data frame of ACT score probabilities that are less efficient and require more code than what we did here. However, they are good exercises if you are new to R. Below are some suggested exercises. Feel free to email me at for the answers or if you get stuck.

  1. Create a vector of probabilities from aggregating the dat1 by ACT and using a formula for probability.

  2. How would you create a ordered vector of individual ACT scores based off of the dat$ACT?

  3. In the previous problem you vector will not contain levels for the ACT scores that were not observed: 17,18, and 32. How would you insert those levels into the vector?

  4. The density() function estimates a continuous probability density curve for its input data. Like hist(), it produces probabilities for the input data, except it inputes (creates) a lot of observations for x and y (over 500 in our case), and smoothes out the probabilities as though dat$ ACT were a continuous variable. Another way that it it is similar to hist() is it produces this under the hood meta data that you can extract into vectors. The first two vectors in densities outout list, are variables of imputed values of your input variable, and the the associated probabilities. Extract these vectors, and figure out a function that would allow you to select the values of density(dat$ACT)[1] closest to integer values (e.g. 26.99), and the associated estimated probabilty. Create a data frame based off of this, recode ACT in a way that makes sense, and follow the other steps of the original tutorial.

  5. Repeat the example of adding a regression line to a plot, but use the data from dat1 instead of ACT_cGPA. Plot the lines from both regressions. Are they the same or different? Why?

  6. Create a function that takes two ACT values as inputs for an interval, and tells you what the probability is of getting an ACT score in that interval.