1. Introduction

This document provides an analysis of the data provided at https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv The data contains information about 4 year colleges and several statistics about the school. The goal is to explore different aspects of colleges and gain an understanding about various metrics of them

I propose the following 10 questions based on my understanding of the data.

  1. What is the mean debt from all colleges in the data?
  2. Is there a corralation between the cost of the school and the mediam income?
  3. What is the distribution of faculty salaries for all schools within the data?
  4. Is there a correlation between the average SAT score and the admission rate?
  5. What is the distribution of part-time students across colleges?
  6. What is the medain completion rate among schools?
  7. What is the distribution of different ethinicities of students attending schools?
  8. How much does the net price of school vary across all the schools in the data?
  9. What is the spread in the enrollment size across colleges?
  10. What is the distribution of schools among different regions?

Analysis

We will explore the questions in detail and summarize the results.

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Q1: What is the mean debt from all colleges in the data?

The mean debt from all the colleges in the data is $2365.65.

Q2: What is the corralation between the cost of the school and the mediam income?

The correlation between the cost of tuition and the median income is 0.589288. This is a weak positive correlation meaning that as tuition increases, you can sometimes expect a slight increase to median income.

Q3: What is the distribution of faculty salaries for all schools within the data?

The historgram shows the distribution of faculty salaries for the schools in the data. The meadian appears to be around $7,500 with some making below that and some making above that. The graph resembles a normal distribution meaing the majority of faculty earn around the median salary with a few outliers in both directions.

Q4: Is there a correlation between the average SAT score and the admission rate?

The correlation between SAT scores and admission rate is a weak negative correlation. This means that as SAT scores go up, you can reasonably assume admission rates go down slightly. This is probably due to schools with lower acceptance rates having higher standards and only accepting students with higher SAT scores.

Q5: What is the distribution of part-time students across colleges?

The histogram shows the distribution of part time students accros schools. The data shows that most schools have between 0% and 20% part time attendance with some having higher. This shows that most students across schools in the dataset are full time students.

Q6:What is the medain completion rate among schools?

The median completion rate among schools is 52.45%

Q7: What is the distribution of different ethinicities of students attending schools?

The pie chart shows the distribution of ethnicity of people attending schools in the data. Over half the students in the studies population are of white ethincity while serveral other ethinicities make up smaller potions of the student population.

Q8: What is the standard deviation of the net price for schoools?

The net price of schools has a standard deviation of $7854.1 showing how much net price varies among schools.

Q9: What is the spread in the enrollment size across colleges?

The schools enrollment have a standard deviation of 7473.072 idicating the spread in enrollment across schools.

Q10: What is the distribution of schools among different regions?

The pie chart shows the distribution of schools among defferent regions. The distribution of schools is relatively even among the four main regions of the U.S. while only a few schools are located in the US territories.

Appendix

Question 1 code:

mean(college$Debt, na.rm = TRUE)
## [1] 2365.655

Question 2 code:

cor(college$Cost, college$MedIncome, use = "complete.obs")
## [1] 0.589288

Question 3 code:

hist(college$FacSalary, main = "Distribution of Faculty Salary", ylab = "# of Employees", xlab = "Salary in Dollars", col = "lightblue")

Question 4 code:

cor(college$AvgSAT, college$AdmitRate, use = "complete.obs")
## [1] -0.4221255

Question 5 code:

hist(college$PartTime, 
     main = "Distribution of Part-Time Students", 
     xlab = "Percentage of Part-Time Students", 
     col = "orange", 
    )

Question 6 code:

median(college$CompRate, na.rm = TRUE)
## [1] 52.45

Question 7 code:

ethnicity <- colSums(college[, c("White", "Black", "Hispanic", "Asian")], na.rm = TRUE)

pie(ethnicity, 
        main = "Distribution of Ethnicities Across Colleges", 
        xlab = "Ethnicity", 
        ylab = "Frequency", 
        col = c("lightblue", "lightgreen", "purple", "yellow"),
        names.arg = c("White", "Black", "Hispanic", "Asian"))
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in title(main = main, ...): "names.arg" is not a graphical parameter

Question 8 code:

sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096

Question 9 code:

sd(college$Enrollment, na.rm = TRUE)
## [1] 7473.072

Question 10 code:

region_counts <- table(college$Region)

pie(region_counts, 
        main = "Distribution of Schools Among Regions", 
        xlab = "Region", 
        ylab = "Frequency", 
        col = c("lightblue", "lightgreen", "purple", "yellow", "lightcoral"),
        names.arg = c("Southeast", "West", "Midwest", "Northeast", "Territory"))
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in title(main = main, ...): "names.arg" is not a graphical parameter