HW9

Q1. Consider the fraction variable a.grad.rate (percentage of freshmen who graduated within a six-year period). Compare the fraction variable across tiers and also compare the froots of the variable across tiers and also compare the flogs of the variable across tiers. Is it necessary to reexpress the data by froots or flogs in this example? Explain.

college.ratings <- read.delim("~/data/college.ratings.txt")
boxplot(a.grad.rate ~ Tier, data=college.ratings,
horizontal=TRUE, main="Fraction Scale",
xlab="Graduate Rate", ylab="Tier")

summarize(group_by(college.ratings, Tier),
          IQR=IQR(a.grad.rate, na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier   IQR
##   <int> <dbl>
## 1     1 0.153
## 2     2 0.11 
## 3     3 0.1  
## 4     4 0.118

froot <- function(p) sqrt(p) - sqrt(1- p)
flog <- function(p) log(p) - log(1 - p)
boxplot(froot(a.grad.rate) ~ Tier, data=college.ratings,
        horizontal=TRUE, main="Froot Scale",
        xlab="Graduate Rate", ylab="Tier")

boxplot(flog(a.grad.rate) ~ Tier, data=college.ratings,
        horizontal=TRUE, main="Flog Scale",
        xlab="Graduate Rate", ylab="Tier")

summarize(group_by(college.ratings, Tier),
          IQR=IQR(froot(a.grad.rate), na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier   IQR
##   <int> <dbl>
## 1     1 0.272
## 2     2 0.161
## 3     3 0.142
## 4     4 0.171

summarize(group_by(college.ratings, Tier),
          IQR=IQR(flog(a.grad.rate), na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier   IQR
##   <int> <dbl>
## 1     1 1.13 
## 2     2 0.483
## 3     3 0.402
## 4     4 0.508

If we focus again on comparisons between tiers 2, 3, and 4, it appears that the original fraction expression is best for equalizing spreads. If look at the ratio of the largest IQR to the smallest IQR, then we compute the ratio 0.171/0.142 = 1.204225 for froots, 0.508/0.402 = 1.263682 for flogs and 0.118/0.1 = 1.18 for fraction.

Since the origin fraction has approximately equalized spreads, we can make comparisons between the graduate rate directly between Tiers 2, 3, and 4 by computing medians. There is no need to reexpress the data by froots or flogs in this example.

summarize(group_by(college.ratings, Tier),
M=median(a.grad.rate, na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier     M
##   <int> <dbl>
## 1     1 0.845
## 2     2 0.64 
## 3     3 0.48 
## 4     4 0.37

On the original fraction scale, the Tier 2 “Graduate Rate” fractions tend to be 0.64 - 0.48 = 0.16 higher than the Tier 3 fractions. Similarly, the Tier 3 “top 10” fractions tend to be 0.48 - 0.37 = 0.11 higher than the Tier 4 fractions.

Q2. Think of a second variable that you think will distinguish these four groups of colleges. (Explain why you chose this variable.) For the variable that you chose, compare the four tiers of schools using stemplots, parallel boxplots, and any needed reexpression. How do the tiers compare with respect to the variable? Are there unusual schools with respect to the variable?

aplpack::stem.leaf(college.ratings$F.retention,depths = TRUE)

## 1 | 2: represents 0.12
##  leaf unit: 0.01
##             n: 249
##     2     5. | 88
##     5     6* | 000
##    11      t | 222223
##    16      f | 44555
##    20      s | 6777
##    26     6. | 888999
##    40     7* | 00000111111111
##    57      t | 22222223333333333
##    74      f | 44444445555555555
##    96      s | 6666666777777777777777
##   114     7. | 888888889999999999
##   (11)    8* | 00000111111
##   124      t | 222222233333333333333
##   103      f | 44444444444455555555
##    83      s | 66666666677777777
##    66     8. | 88888888999999
##    52     9* | 000001111111
##    40      t | 222222333
##    31      f | 4444444455
##    21      s | 666666666667777
##     6     9. | 888889

From the stemplot, we notice that the shape of variable “F.retention” (the average freshmen retention rate) seems symmetric and normal distribution. Although there is a little hump in the middle and seems a slightly left skewed, considering the sample selection, it is quite a nice distribution.

boxplot(F.retention ~ Tier, data=college.ratings,
        horizontal=TRUE, main="Fraction Scale",
        xlab="Retention", ylab="Tier")

summarize(group_by(college.ratings, Tier),
          IQR=IQR(F.retention, na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier    IQR
##   <int>  <dbl>
## 1     1 0.0400
## 2     2 0.0475
## 3     3 0.0600
## 4     4 0.0800

From the parallel boxplot above, we notice that the spread of the Tier 4 is quite larger than the other 3 groups. If we focus on the fourth-spreads, the Tier 1 has the smallest variation, Then following the Tier 2, Tier 3 and Tier 4. Also, we can see four outliers here.

If we focus on comparisons of tiers 1, 2, 3, we see that the spread of the Tier 3 values (0.06) is about 1.5 times larger than the spread of the Tier 1 values (0.04). Next, we transform the “F.retention” variable to the froot and flog scales.

froot <- function(p) sqrt(p) - sqrt(1- p)
flog <- function(p) log(p) - log(1 - p)
boxplot(froot(F.retention) ~ Tier, data=college.ratings,
        horizontal=TRUE, main="Froot Scale",
        xlab="Retention", ylab="Tier")

boxplot(flog(F.retention) ~ Tier, data=college.ratings,
        horizontal=TRUE, main="Flog Scale",
        xlab="Retention", ylab="Tier")

summarize(group_by(college.ratings, Tier),
          IQR=IQR(froot(F.retention), na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier    IQR
##   <int>  <dbl>
## 1     1 0.103 
## 2     2 0.0865
## 3     3 0.0958
## 4     4 0.122

summarize(group_by(college.ratings, Tier),
          IQR=IQR(flog(F.retention), na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier   IQR
##   <int> <dbl>
## 1     1 0.736
## 2     2 0.367
## 3     3 0.330
## 4     4 0.390

If we focus again on comparisons between tiers 1, 2, and 3, it appears that the froot expression is best for equalizing spreads. If look at the ratio of the largest IQR to the smallest IQR, then we compute the ratio 0.103/0.0865 = 1.190751 for froots and 0.736/0.330 = 2.23 for flogs.

Besides, we notice that there is one outlier in Tier 1, 2 and 3 respectively, which means there are total 3 unusual schools shown in the froot expression.

Then we can make comparisons between the retention froot fractions between tiers 1, 2, and 3 by computing medians.

summarize(group_by(college.ratings, Tier),
          M=median(froot(F.retention), na.rm=TRUE))

## # A tibble: 4 x 2
##    Tier     M
##   <int> <dbl>
## 1     1 0.725
## 2     2 0.526
## 3     3 0.398
## 4     4 0.304

On the froot scale, the Tier 1 “Retention” fractions tend to be 0.725 - 0.526 = 0.199 higher than the Tier 2 fractions. Similarly, the Tier 2 “Retention” fractions tend to be 0.526 - 0.398 = 0.128 higher than the Tier 3 fractions. And the Tier 3 tend to be 0.1 higher than the Tier 4.

Therefore, it is quite clear to distinguish these four groups of colleges (Tier 1 to 4) by comparing the variable “F. retention”.

HW9

Xuejun Gu

11/28/2021