1. How many observations are recorded in the dataset? How many colleges are recorded?
setwd("C:/Users/qizhe/Desktop/STA 141A/HW1") # set working directory

cs <- readRDS("college_scorecard_2013.rds") # give a name to the data

# 1 Total of 3312 observations, 2431 different colleges.
dim(cs) # 3312 observations
## [1] 3312   51
# number of colleges
sum(cs$main_campus) # 2431 main campuses
## [1] 2431

There are a total of 3,312 observations within the data. For number of colleges there are 2,431.

  1. How many features are there? How many of these are categorical? How many are discrete? Are there any other kinds of features in this dataset?
# 2
# examine the number of different classes of variables
tab1 <- sapply(cs, class)
table(tab1)
## tab1
## character    factor   integer   logical   numeric 
##         4         4        15         3        25
# examine each variable type using str()
str(cs)
## 'data.frame':    3312 obs. of  51 variables:
##  $ unit_id                : int  100654 100663 100690 100706 100724 100751 100812 100830 100858 100937 ...
##  $ ope_id                 : chr  "001002" "001052" "025034" "001055" ...
##  $ main_campus            : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ branches               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ open_admissions        : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ name                   : chr  "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
##  $ city                   : chr  "Normal" "Birmingham" "Montgomery" "Huntsville" ...
##  $ state                  : Factor w/ 56 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ zip                    : chr  "35762" "35294-0110" "36117-3553" "35899" ...
##  $ online_only            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ primary_degree         : Factor w/ 5 levels "Other","Certificate",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ highest_degree         : Factor w/ 5 levels "Other","Certificate",..: 5 5 5 5 5 5 4 5 5 4 ...
##  $ ownership              : Factor w/ 3 levels "Public","Nonprofit",..: 1 1 2 1 1 1 1 1 1 2 ...
##  $ avg_sat                : int  823 1146 NA 1180 830 1171 NA 970 1215 1177 ...
##  $ undergrad_pop          : int  4051 11200 322 5525 5354 28692 2999 4322 19761 1181 ...
##  $ grad_pop               : int  969 7066 309 1680 719 5312 NA 762 5065 NA ...
##  $ cost                   : int  18888 19990 12300 20306 17400 26717 NA 16556 23788 44167 ...
##  $ tuition                : int  7182 7206 6870 9192 8720 9450 NA 8750 9852 30690 ...
##  $ tuition_nonresident    : int  12774 16398 6870 21506 15656 23950 NA 24950 26364 30690 ...
##  $ revenue_per_student    : int  9063 9033 12057 8322 7813 12198 5094 7625 13186 11928 ...
##  $ spend_per_student      : int  7459 17208 5123 9352 7393 9817 6176 6817 11324 9990 ...
##  $ avg_faculty_salary     : int  7079 10170 3849 9341 6557 9605 7672 7173 9429 7513 ...
##  $ ft_faculty             : num  0.886 0.911 0.672 0.655 0.664 ...
##  $ admission              : num  0.899 0.867 NA 0.806 0.512 ...
##  $ retention              : num  0.631 0.802 0.375 0.81 0.622 ...
##  $ completion             : num  0.291 0.538 0.667 0.483 0.252 ...
##  $ fed_loan               : num  0.82 0.54 0.763 0.473 0.874 ...
##  $ pell_grant             : num  0.712 0.35 0.684 0.328 0.827 ...
##  $ avg_family_inc         : num  29590 50153 22714 51013 28113 ...
##  $ med_family_inc         : num  21281 35163 16619 33787 20269 ...
##  $ avg_10yr_salary        : int  34300 46400 46100 50500 29500 49900 42200 37900 54100 51000 ...
##  $ sd_10yr_salary         : int  25300 36300 34900 32700 20500 42600 28000 27800 41800 36600 ...
##  $ med_10yr_salary        : int  29900 40200 40100 45600 26700 42700 38500 33500 47100 45600 ...
##  $ med_debt               : num  30968 21282 24897 23106 32000 ...
##  $ med_debt_withdraw      : num  9500 8925 6731 8068 9500 ...
##  $ default_3yr_rate       : num  0.163 0.08 0.089 0.077 0.191 0.081 0.094 0.131 0.061 0.099 ...
##  $ repay_5yr_rate_withdraw: num  0.324 0.606 0.449 0.613 0.261 ...
##  $ repay_5yr_rate         : num  0.53 0.731 0.529 0.794 0.446 ...
##  $ avg_entry_age          : num  20.5 23.5 33.9 24.1 20.6 ...
##  $ veteran                : num  NA 0.00354 NA 0.01013 0.00353 ...
##  $ first_gen              : num  0.388 0.336 0.54 0.326 0.374 ...
##  $ male                   : num  0.486 0.413 0.391 0.558 0.408 ...
##  $ female                 : num  0.514 0.587 0.609 0.442 0.592 ...
##  $ race_white             : num  0.0279 0.5987 0.2919 0.7012 0.0161 ...
##  $ race_black             : num  0.95 0.259 0.422 0.131 0.928 ...
##  $ race_hispanic          : num  0.0089 0.0258 0.0093 0.0338 0.0114 0.0313 0.0213 0.0079 0.0253 0.0305 ...
##  $ race_asian             : num  0.0022 0.0518 0.0031 0.0364 0.0015 0.0112 0.0047 0.0245 0.0213 0.039 ...
##  $ race_native            : num  0.0012 0.0026 0.0031 0.0145 0.0009 0.0044 0.019 0.0037 0.0077 0.0102 ...
##  $ race_pacific           : num  0.001 0.0007 0.0031 0.0002 0.0007 0.0011 0.0007 0.0002 0 0 ...
##  $ net_cost               : int  13415 14805 7455 17520 11936 21513 NA 11915 17541 21406 ...
##  $ race_other             : num  0.0086 0.0614 0.2671 0.0828 0.0409 ...

In total there are 51 variables, so that makes for 51 features for this dataset. The categorical variables are: unit_id, ope_id, name, city, state, and zip. The numerical discrete variables are: branches, undergrad_pop, and grad_pop. Additionally, there are various numerical variables which are continuous: avg_sat, cost, tuition, tuition_nonresident, revenue_per_student, spend_per_student, avg_faculty_salary, ft_faculty, admission, retention, completion, fed_loan, pell_grant, avg_family_inc, med_family_inc, avg_10yr_salary, sd_10yr_salary, med_10yr_salary, med_debt, med_debt_withdraw, default_3yr_rate, repay_5yr_rate_withdraw, repay_5yr_rate, avg_entry_age, veteran, first_gen, male, female, race_white, race_black, race_hispanic, race_asian, race_native, race_pacific, net_cost, and race_other. Ordinal variables are: primary_degree and highest_degree. Logical variables (R only): main_campus, open_admissions, online_only.

  1. How many missing values are in the dataset? Which feature has the most missing values? Are there any patterns?
# 3
sum(is.na(cs)) # number of NA's 23197
## [1] 23197
# find number of NA's per variable, find the largest
count_na_by_features = sapply(cs, function(x) sum(is.na(x)))
t(t(count_na_by_features)) # avg_sat is largest with 1,923
##                         [,1]
## unit_id                    0
## ope_id                     0
## main_campus                0
## branches                   0
## open_admissions            0
## name                       0
## city                       0
## state                      0
## zip                        0
## online_only              199
## primary_degree             0
## highest_degree             0
## ownership                  0
## avg_sat                 1923
## undergrad_pop            490
## grad_pop                1149
## cost                     717
## tuition                  472
## tuition_nonresident      635
## revenue_per_student      201
## spend_per_student        201
## avg_faculty_salary       317
## ft_faculty               558
## admission               1416
## retention                966
## completion               864
## fed_loan                 510
## pell_grant               510
## avg_family_inc           267
## med_family_inc           267
## avg_10yr_salary          480
## sd_10yr_salary           480
## med_10yr_salary          480
## med_debt                 487
## med_debt_withdraw        489
## default_3yr_rate         183
## repay_5yr_rate_withdraw  538
## repay_5yr_rate           792
## avg_entry_age            267
## veteran                 1735
## first_gen                505
## male                     490
## female                   490
## race_white               490
## race_black               490
## race_hispanic            490
## race_asian               490
## race_native              490
## race_pacific             490
## net_cost                 689
## race_other               490
# graphically display the NA's per feature
NA_counts<-colSums(is.na(cs)) # save NA's to variable
plot(NA_counts, main="Number of Missing Data per Feature",
     xlab = "Feature Index",
     ylab = "Number NA's", cex = 1, pch=15)
abline(h=c(500, 0), col="Green") # many around this region

There are 23,197 NA’s in the dataset. The greatest number of NA’s are in the variable avg_sat, with a total of 1,923 NA’s. Revenue_per_student and spend_per_student both have 201 NA’s. They both are related to the money used by each student, so it makes sense that information either exists for both or none of them. The variables fed_loan and pell_grant both have 510 NA’s, both are related to the grant money provided by the government, so it makes sense that data exists for either both or neither. The variables avg_family_inc and med_family_inc both have 267 NA’s, and it makes sense that data either exists for both the average and median family income or neither. The variables avg_10yr_salary, sd_10yr_salary, med_10yr_salary all have 480 NA’s, and it makes sense that average, standard deviation, and median information exists for 10-year salaries or none exists at all. The variables med_debt and med_debt_withdraw are similar at 487 and 489 NA’s. The definitions are similar, so the difference between the two is likely an anomaly.

When plotting the number of NA’s for each feature index, it seems that most of the NA’s are within the range of 0-500 for each feature that has NA’s.

  1. Are there more public colleges or private colleges recorded? For each of these, what are the proportions of highest degree awarded? Display this information in one graph and comment on what you see.
# 4
# number of private versus public colleges
length(which(cs$ownership=="Public")) # public 716
## [1] 716
length(which(cs$ownership!="Public")) # private aka For Profit and Nonprofit 2,596
## [1] 2596
# separate public and private colleges into variables
cs_public = cs[cs$ownership=="Public",]
cs_private = cs[cs$ownership!="Public",]

There are 715 public colleges, while there are 2,596 private colleges which means that there are many more private than public colleges.

# graphically display proportions of degrees for private and public colleges
public_highest_deg_table = table(cs_public$highest_degree)/sum(table(cs_public$highest_degree))
private_highest_deg_table = table(cs_private$highest_degree)/sum(table(cs_private$highest_degree))
par(mfrow=c(1,2))
barplot(public_highest_deg_table, main = "Degree Proportion in Public College", ylim=c(0,0.8))
barplot(private_highest_deg_table, main = "Degree Proportion in Private College", ylim=c(0,0.8))

dev.off() # reset frames
## null device 
##           1

From the plot it seems that public colleges have a larger proportion of graduate in comparison to private colleges where the proportion between the graduate and bachelor students is more balanced. It seems that public universities in general offer the traditional track of a bachelor’s degree and afterwards have some graduate programs within their possible education tracks. Private universities seem more balanced in that they offer students a variety of other degrees, possibly professional degrees which balance the number of bachelors and graduate students.

  1. What is the average undergraduate population? What is the median? What are the deciles? Display these statistics and the distribution graphically. Do you notice anything unusual?
# 5
# mean, median, and decile for undergraduate populations
ugmean <- mean(cs$undergrad_pop,na.rm = TRUE) #3599.502
ugmed <- median(cs$undergrad_pop,na.rm = TRUE) #1295
decile <- quantile(cs$undergrad_pop, prob = seq(0,1,length=11),type = 5,na.rm = TRUE)
decile
##       0%      10%      20%      30%      40%      50%      60%      70% 
##      0.0    152.7    319.0    536.0    847.2   1295.0   1812.1   2677.5 
##      80%      90%     100% 
##   4551.0   9650.5 166816.0

The average undergraduate population is 3599.502 when the mean of the category is taken. The median for this category is 1295.

hist(cs$undergrad_pop, main = "Histogram of undergraduate population", xlab = "Undergraduate Population")
abline(v = ugmean, col='red', lwd=5)
abline(v = decile, col='pink', lwd=2)
abline(v = ugmed, col='blue', lwd=5)
legend(x=120000, y=2400, c("mean", "median", "decile"), col = c("red", "blue", "pink"), lty=c(1,1,1))

#There is an extreme outlier, the Universit of Phoenix, Arizona whose population of over
#150,000 students creates an extreme skew to the histogram.

# testing for without the previous outlier, little difference in overall skew
cs1 = cs[-which(cs$undergrad_pop==166816),] # removes outlier
ugmean <- mean(cs1$undergrad_pop,na.rm = TRUE) #3599.502
ugmed <- median(cs1$undergrad_pop,na.rm = TRUE) #1295
decile <- quantile(cs1$undergrad_pop, prob = seq(0,1,length=11),type = 5,na.rm = TRUE)
hist(cs1$undergrad_pop, main = "Histogram of undergraduate population")
abline(v = ugmean, col='red', lwd=5)
abline(v = decile, col='pink', lwd=2)
abline(v = ugmed, col='blue', lwd=5)
legend(x=120000, y=2400, c("mean", "median", "decile"), col = c("red", "blue", "pink"), lty=c(1,1,1))

Above is a histogram of the undergraduate population, along with the frequency for the populations. Most of the population seems to be on the far left side, with only a few exceptions towards the right. There is one extreme population to the right. Much of the skew seems attributable to the University of Phoenix, Arizona. This school has a population of over 150,000 students in their undergraduate population. Removing this piece of data and redoing the histogram plot seems to spread out the statistics on the data somewhat more evenly. However, the data still appears quite skew in appearance.

  1. Compare tuition graphically in the 5 most populous states. Discuss conclusions you can draw from your results.
# 6
# 5 most populous states: California, Texas, New York, Illinois, Florida
top = cs[cs$state %in% c("CA", "TX", "NY", "IL", "FL"),]
top$state <- droplevels(top$state)
boxplot(top$tuition~top$state, main = "Boxplot of 5 Most Populous States", ylab = "Tuition per State")

From the boxplot it seems that all the states have a skew towards higher tuition. New York seems to have both the highest and lowest tuition. Florida and Texas tend to have a greater number of outliers than the other states. Texas and Florida also seem to have a lower tuition than others. California, the largest state seems to have the most balanced boxplot, with the median and the different quartiles looking slightly more balanced than the others. This makes more sense since it has almost twice the population of other states, so it has more data which can begin to appear more normal and balanced. The variance in New York is also the largest.

  1. For the following questions, use code to justify your answer: Part a. What is the name of the university with the largest value of avg_sat?
# 7
# part a
# name of university with largest avg_sat
cs[which.max(cs$avg_sat), 'name'] #California Institute of Technology
## [1] "California Institute of Technology"

Part a) Using the following line in R: cs[which.max(cs$avg_sat), ‘name’], it is possible to determine that the college with the largest ‘avg_sat’ is ‘California Institute of Technology.’

Part b. Does the university with the largest amount of undergrad_pop have open admissions?

# part b
# largest university & open admissions
cs[which.max(cs$undergrad_pop), c('name', 'open_admissions')]
##                                     name open_admissions
## 2371 University of Phoenix-Online Campus            TRUE

Part b) Using the following line in R: cs[which.max(cs$undergrad_pop), c(‘name’, ‘open_admissions’)], it is possible to determine that the largest university with open admissions, the ‘University of Phoenix-Online Campus’ does have open admissions set to ‘TRUE’, so no.

Part c. List the zip code of the public university with the smallest value of avg_family_inc.

# part c
# zip code of smallest avg_family_inc for public schools
publicUni = subset(cs,ownership=="Public") # subset by public
publicUni$zip[which.min(publicUni$avg_family_inc)] # 11101
## [1] "11101"

Part c) After using the subset function to create a variable which selects public universities, it is then possible to determine the index of the university with the smallest ‘avg_family_inc.’ The zip code of this school is ‘11101.’

Part d. Does the university you found in part b. also have the largest amount of grad_pop?

# part d
# also largest grad_pop (referring to part b)
cs[which.max(cs$undergrad_pop), 
   'grad_pop'] == max(cs$grad_pop, na.rm=TRUE) # checks if part b is the max grad pop
## [1] FALSE

Part d) The index of the school in part B was ‘2371.’ After using this to check the ‘grad_pop’, the same school was determined to have ’41,900’ for ‘grad_pop.’ Then checking the dataset again to determine what school had the largest ‘grad_pop’ in the data, it was found out that there was a different larger number for the maximum ‘grad_pop.’ So the answer for this question is no.

  1. For schools that are for-profit in ownership and issue Bachelor’s degrees as their primary_degree, do the following: Part a. Visualize revenue_per_student and spending_per_student and describe the relationship. What issues may arise when fitting a linear regression model?
# 8
# subset For Profit & Bachelor primary degree schools
profbach = subset(cs, ownership=="For Profit" & primary_degree=="Bachelor")
money_per_student <- profbach[,c('revenue_per_student', 'spend_per_student')]

# part a
plot(money_per_student, main = "Revenue versus Spending per Student", xlab = "Revenue per student"
     , ylab = "Spending per student")

# There are a few outliers which may violate the assumptions of linearity.

The relationship between the variables spend_per_student and revenue_per_student seems to have a somewhat linear relationship. The plot seems to imply that the greater the revenue that a college receives from a student, the greater the college will also be spending on each student. The majority of this happens around the $10,000 - $20,000 revenue per student and the $0 – 5,000 range for spending per student. This seems to imply that a college will receive up to twice however much it may spend on each student, which makes sense since they still need to make some sort of profit. There are some outliers, such as where certain colleges spend much more on their students than others, and some colleges make much more revenue than others per student. Some of the outliers to the top left of the plot may possibly violate certain assumptions of linearity.

Part b. Create a new variable called total_net_income. Think carefully about how this variable would be calculated. Visualize the top 5 earning schools.

# part b
# create new variable to use
profbach[is.na(profbach)] <- 0 # set NA's to 0
net_income <- (profbach$revenue_per_student - profbach$spend_per_student) # net income for colleges
total_net_income <- (net_income*(profbach$undergrad_pop + profbach$grad_pop)) # total net income
# order profbach, and then choose top 5
profbach_top = profbach[order(total_net_income, decreasing=TRUE),]
profbach_top5 = profbach_top[1:5,] #University of Phoenix-Online Campus, Ashford University,
#Capella University, Grand Canyon University, Kaplan University-Davenport Campus

# plot net income and number of students
# create total student population, net income, and total net income categories for sub8_top5
profbach_top5$student_pop <- profbach_top5$grad_pop + profbach_top5$undergrad_pop
profbach_top5$net_income <- (profbach_top5$revenue_per_student - profbach_top5$spend_per_student)
profbach_top5$total_net_income <- (profbach_top5$net_income*(profbach_top5$undergrad_pop + profbach_top5$grad_pop))
plot(profbach_top5$student_pop, profbach_top5$total_net_income, xlab = "Student population", 
     ylab = "Total income per student", main = "Top 5 Total Net Income for Colleges", 
     pch=1:5)
legend("topleft", profbach_top5$name, pch=1:5)

The variable for total net income comes from subtracting the spending per student from the revenue per student for each college. This gives the total profit received from each student received by the college. After this calculation, the total profit per student is multiplied by the entire student population which consists of graduate students summed together with undergraduate students. These top 5 are then plotted which are: University of Phoenix-Online Campus, Ashford University, Capella University, Grand Canyon University, and Kaplan UniversityDavenport Campus. There is one outlier here which is the University of Phoenix-Online Campus which seems to make a great deal more revenue than the others which seem quite near to each other in terms of their total net income. The reason for this is likely that as an online school it can enroll many more students at one time for certain courses, giving them an ability to receive much more revenue from these online students, and the money can be focused on improving their online experience rather than building infrastructure or maintaining a campus. Much of the teaching for these classes can likely also be done in a large online classroom setting, giving more profit for enrolled students in comparison to fewer teachers being hired.

  1. Now, examine the relationship between avg_sat and admission for all schools. Part a.Use an appropriate plot to visualize the relationship. Split the data into two groups based on their combination of avg_sat and admission. Justify your answer. Hint: How does the variance of admission depend on values of avg_sat?. Define this variable as group.
# 9
plot(cs[,c("avg_sat", "admission")], main = "Average SAT versus Admissions", xlab = "Average SAT",
     ylab = "Admissions")

# Around avg_sat = 1200, the admission begins to narrow.

There seems to be a density around 800-1200 for average SAT scores which make up most of the colleges. An appropriate cutoff for the data from this plot seems to be around 1200 for SAT scores. For admissions it seems to be below 0.4 when it narrows. Therefore, the variable group can be created by distinguish low SAT scores (<1200) versus high SAT scores (>= 1200) and admissions which are (<=0.4) or (>0.4). As admissions become much narrower in the 1200+ range, there is a narrowing of colleges, and this seems to imply that the admissions rate for colleges become less varied, or in a sense are much more difficult to get accepted into depending on the SAT score of the students that enroll.

Part b. Using code to justify your answers, comment on how the following continuous variables change depending on group:

  1. med_10yr_salary
# split according to avg_sat >= 1200 and admission <= 0.4
group <- ifelse(cs$avg_sat >= 1200 & cs$admission <= 0.4, "High_SAT", "Low_SAT")
group <- ifelse(is.na(group),"Low_SAT",group)
cs$group <- group
boxplot(cs$med_10yr_salary~cs$group, ylab = "Median 10 year salary", main = "Median Salary for High/Low SAT")

The 10 year median salary for students graduating from high SAT colleges tend to earn more than those with lower SAT’s. There are still outliers in the low SAT colleges which make it so that their students can also make higher salaries. It is still more common however for high SAT college graduates to be earning more than their counterparts.

  1. The percentage of race_white and race_asian combined
cs$race_aw <- cs$race_asian + cs$race_white # create asian and white variable
boxplot(cs$race_aw~group, ylab = "Asian and White Proportion", main = "Asian and White Combination for High/Low SAT")

Colleges which have lower proportions of Asian and Whites can also have lower SAT scores. In high SAT average schools, this sort of pattern is not common and doesn’t become apparent.

  1. The percentage of graduate students enrolled at a university
cs$grad_proportion <- (cs$grad_pop/(cs$grad_pop + cs$undergrad_pop)) # create proportion variable
boxplot(cs$grad_proportion~group, ylab = "Graduate Student Proportion", 
        main = "SAT Scores According to Graduate Student Proportion")

lsat=subset(cs, cs$group=="Low_SAT") #low SAT's
lsat$grad_proportion <- (lsat$grad_pop/(lsat$grad_pop + lsat$undergrad_pop))
lsat[order(lsat$grad_proportion, decreasing = TRUE),][1:5,]
##      unit_id ope_id main_campus branches open_admissions
## 93    109086 020992        TRUE        1           FALSE
## 182   117672 001229        TRUE        1           FALSE
## 1000  173683 040443        TRUE        1           FALSE
## 2571  439394 034297        TRUE        1           FALSE
## 43    103778 001070        TRUE        2           FALSE
##                                                          name
## 93                              American Conservatory Theater
## 182         Southern California University of Health Sciences
## 1000 Hazelden Betty Ford Graduate School of Addiction Studies
## 2571                    East West College of Natural Medicine
## 43                    Thunderbird School of Global Management
##               city state        zip online_only primary_degree
## 93   San Francisco    CA 94108-5834       FALSE    Certificate
## 182       Whittier    CA      90604       FALSE       Graduate
## 1000   Center City    MN 55012-0011       FALSE    Certificate
## 2571      Sarasota    FL 34234-0000       FALSE       Graduate
## 43        Glendale    AZ 85306-6000       FALSE    Certificate
##      highest_degree  ownership avg_sat undergrad_pop grad_pop cost tuition
## 93         Graduate  Nonprofit      NA             0       31   NA      NA
## 182        Graduate  Nonprofit      NA             0      564   NA      NA
## 1000       Graduate  Nonprofit      NA             0       90   NA      NA
## 2571       Graduate For Profit      NA             0      123   NA      NA
## 43         Graduate  Nonprofit      NA             2     1020   NA      NA
##      tuition_nonresident revenue_per_student spend_per_student
## 93                    NA               31619             21970
## 182                   NA               20054             10158
## 1000                  NA               20791              4983
## 2571                  NA                8592              4901
## 43                    NA               62758             53366
##      avg_faculty_salary ft_faculty admission retention completion fed_loan
## 93                 8434     1.0000        NA        NA         NA       NA
## 182                5620     0.4188        NA        NA         NA       NA
## 1000               6337         NA        NA        NA         NA        0
## 2571               4287         NA        NA        NA         NA       NA
## 43                10953     0.6027        NA        NA         NA        0
##      pell_grant avg_family_inc med_family_inc avg_10yr_salary
## 93           NA             NA             NA              NA
## 182          NA       36583.54        31521.0              NA
## 1000          0             NA             NA              NA
## 2571         NA       30865.32        12000.0              NA
## 43            0       25728.15        22590.5          136600
##      sd_10yr_salary med_10yr_salary med_debt med_debt_withdraw
## 93               NA              NA       NA                NA
## 182              NA              NA     8144              4541
## 1000             NA              NA       NA                NA
## 2571             NA              NA       NA                NA
## 43           122800          111600     8000                NA
##      default_3yr_rate repay_5yr_rate_withdraw repay_5yr_rate avg_entry_age
## 93              0.023                      NA             NA            NA
## 182             0.032                      NA             NA      33.77143
## 1000            0.018                      NA             NA            NA
## 2571            0.023                      NA             NA      37.57895
## 43              0.012                      NA             NA      30.33871
##      veteran first_gen male female race_white race_black race_hispanic
## 93        NA        NA    0      0          0          0             0
## 182       NA        NA    0      0          0          0             0
## 1000      NA        NA    0      0          0          0             0
## 2571      NA        NA    0      0          0          0             0
## 43        NA        NA    1      0          1          0             0
##      race_asian race_native race_pacific net_cost race_other   group
## 93            0           0            0       NA          0 Low_SAT
## 182           0           0            0       NA          0 Low_SAT
## 1000          0           0            0       NA          0 Low_SAT
## 2571          0           0            0       NA          0 Low_SAT
## 43            0           0            0       NA          0 Low_SAT
##      race_aw grad_proportion
## 93         0       1.0000000
## 182        0       1.0000000
## 1000       0       1.0000000
## 2571       0       1.0000000
## 43         1       0.9980431

It follows that colleges with higher SAT’s will have a greater proportion of Graduate students than colleges that have a lower SAT average. However, there are some outliers in the lower SAT range. The exceptional outliers in the low SAT group seem to come from colleges which are focused on giving graduate level degrees, which explains how their student population can have both lower SAT’s on average while still consisting mainly of graduate students. Part c) (in the following mosaic plots the table() function was used to check whether a variable had 0 for a particular category, e.g., in the next plot Open Admissions which are High_SAT have 0 true values)

Part c. Using code to justify your answers, comment on whether the categorical variables are dependent or independent of group:

  1. open_admission
# part C
mosaicplot(~  group + open_admissions, data = cs, main = "Open Admissions vs. Group",
           xlab = "Group", ylab = "Open Admissions")

table(as.character(cs$group), as.character(cs$open_admissions))
##           
##            FALSE TRUE
##   High_SAT    83    0
##   Low_SAT   2445  784
# examine table of 'group' and 'open_admissions, also view the distribution'
subgroup = cs$group
subgroup[is.na(subgroup)] = "NA"
suboa = cs$open_admissions
suboa[is.na(suboa)] = "NA"
table(suboa,subgroup)
##        subgroup
## suboa   High_SAT Low_SAT
##   FALSE       83    2445
##   TRUE         0     784
subas = cs$avg_sat
subas[is.na(subas)] = "NA"
hist(cs$avg_sat)

From the mosaic plot it becomes clear that if a school has a high SAT average, then it doesn’t have open admissions. This makes sense, since those high SAT schools tend to be more exclusive and require high SAT’s for acceptance. This wouldn’t necessarily have to apply for schools with low SAT averages and accept most people who apply. From the table and plot it’s clear that the two ratios are quite different.

  1. main_campus
# part C, part B
mosaicplot(~  group + main_campus, data = cs, main = "Main Campus vs. Group",
           xlab = "Group", ylab = "Main Campus")

table(as.character(cs$group), as.character(cs$main_campus)) # High_SAT FALSE = 0
##           
##            FALSE TRUE
##   High_SAT     0   83
##   Low_SAT    881 2348

Here it’s apparent that if a school is the main campus for a college, then it’ll have a high SAT average which is an interesting observation. This seems to imply that schools which have high SAT averages are never off-campuses for a college. Although this may not have any special meaning, it is strange that no off-campuses have a high SAT average, despite which main campus it may come from. So clearly here, the main campus and group variable are dependent. From the table and plot it’s clear that the two ratios are quite different.

  1. ownership
# part C, part C
mosaicplot(~  group + ownership, data = cs, main = "Ownership vs. Group",
           xlab = "Group", ylab = "Ownership")

table(as.character(cs$group), as.character(cs$ownership)) # High_SAT For Profit = 0
##           
##            For Profit Nonprofit Public
##   High_SAT          0        72     11
##   Low_SAT         886      1638    705

Here it becomes apparent that high SAT average schools are either Nonprofit or Public universities. Most them which have high SAT averages are actually Nonprofit schools, while a much smaller proportion belong to the Public university system. For low SAT average schools, they can be either of the three, with about an equal amount being either Public or For Profit. Still the largest portion is the Nonprofit for low SAT schools. So, in this case again the two variables ownership and group are dependent on each other. From the table and plot it’s clear that the three ratios are quite different.

  1. Whether the university has more than 1 branch or not
# part C, part D
branches_counts = ifelse(cs$branches == 1, "one branches", "more than one branches")
mosaicplot(~ group + branches_counts, data=cs, main = "Branches Count vs. Group",
           xlab = "Group", ylab = "Branches Counts")

cs$branches_counts <- branches_counts
table(as.character(cs$group), as.character(cs$branches_counts)) # High_SAT For Profit = 0
##           
##            more than one branches one branches
##   High_SAT                      3           80
##   Low_SAT                    1087         2142

According to the mosaic plot, it seems that there is bias towards high SAT schools having only one branch, while low SAT schools are more likely to have more than one branch. Therefore, the fact that a school has one or multiple branches is dependent on the school belonging to a high or low SAT group. From the table and plot it’s clear that the two ratios are quite different.

  1. Examine the relationship between avg_10yr_salary using avg_family_inc for all schools.

Part a. Use an appropriate plot for these two variables. Fit a linear regression model that predicts avg_10yr_salary using avg_family_inc. Add this line to the plot you used. Investigate the groups of points that may be affecting the regression line.

# 10
# part A
plot(cs$avg_family_inc, cs$avg_10yr_salary, main = "Avg. Family Income vs. Avg. 10 yr. Salary",
     xlab = "Avg. Family Income", ylab = "Avg. 10 yr. Salary")
abline(lm(avg_10yr_salary~avg_family_inc, data=cs))

outliers = cs[cs$avg_10yr_salary > 150000 & !is.na(cs$avg_10yr_salary) & !is.na(cs$avg_family_inc),]
dim(outliers) # 9 observations
## [1]  9 55
outliers$name
## [1] "University of Massachusetts Medical School Worcester"        
## [2] "Icahn School of Medicine at Mount Sinai"                     
## [3] "SUNY Downstate Medical Center"                               
## [4] "Northeast Ohio Medical University"                           
## [5] "Philadelphia College of Osteopathic Medicine"                
## [6] "University of North Texas Health Science Center"             
## [7] "West Virginia School of Osteopathic Medicine"                
## [8] "Medical College of Wisconsin"                                
## [9] "Louisiana State University Health Sciences Center-Shreveport"

There is also a regression line through the data which is created by predicting the average 10 year salary according to average family income. After looking at the regression line, it seems that there are a few points in the top left region of the plot which may be affecting the usefulness of the regression line. A possible cutoff region to separate these pieces of data would be when the average 10 year salary is greater than 150,000. There are only 9 of these data points, after creating a variable for the outliers, and checking the number of observations. After looking at the 9 data points in that region it is clear from the names they all seem to be medical-related colleges.

Part b. Describe a categorical variable that would improve the fit of the regression line based on your investigation in part a. What would the levels of this variable be?

# part B
# check the number of factor levels
levels(cs$highest_degree)
## [1] "Other"       "Certificate" "Associate"   "Bachelor"    "Graduate"
# optional R/Stats practice, test the new categorical variable's factor levels to see
# if there's an improvement of fit
plot(cs$avg_family_inc, cs$avg_10yr_salary, main = "Avg. Family Income vs. Avg. 10 yr. Salary",
     xlab = "Avg. Family Income", ylab = "Avg. 10 yr. Salary")
fit1 = lm(avg_10yr_salary~avg_family_inc + highest_degree, data=cs)
fit1$coef
##               (Intercept)            avg_family_inc 
##              3.874917e+04              2.177299e-01 
## highest_degreeCertificate   highest_degreeAssociate 
##             -1.486046e+04             -9.377607e+03 
##    highest_degreeBachelor    highest_degreeGraduate 
##             -8.754538e+03             -1.313234e+03
abline(a = fit1$coef[1], b = fit1$coef[2], col="orange")
abline(a = fit1$coef[1]+fit1$coef[3], b = fit1$coef[2], lty=2, col="pink")
abline(a = fit1$coef[1]+fit1$coef[4], b = fit1$coef[2], lty=3, col="green")
abline(a = fit1$coef[1]+fit1$coef[5], b = fit1$coef[2], lty=4, col="blue")
abline(a = fit1$coef[1]+fit1$coef[6], b = fit1$coef[2], lty=5, col="red")
abline(lm(avg_10yr_salary~avg_family_inc, data=cs))
legend("topright", c("other", "Certificate", "Associate", "Bachelor", "Graduate",
                     "Original regression line"),
       col = c("orange", "pink", "green", "blue", "red", "black"), lty = c(1,2,3,4,5,1))

Looking further into the data it seems that all of them have a Graduate degree as their highest available degree, and all except two of them have a Graduate program as their primary degree offered. This would help explain the high salary, since it is sensible to conclude that students graduating from medical schools can obtain high salary jobs despite their family’s income. An interesting variable that could be used to identify a variable to help improve the fit is ‘highest_degree’ which has 5 factor levels: Other, Certificate, Associate, Bachelor, and Graduate.