Introduction

This is the first project for the course of STAT 353 Statistical Methods for Engineering. In this project, we will use the data of “CollegeScores4yr” from the website of Lock5 Datasets. The CSV link for the data can be found as follows: https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv

Questions proposed by me

I propose the following 10 questions based on my understanding of the data:

  1. What is the mean of the average combined SAT scores for all the colleges in the data?
  2. What is the variance of the cost for all the colleges in the data?
  3. What is the standard deviation of the enrollment number for all the colleges in the data?
  4. What is the mean for the net price of public college?
  5. What is the mean for the net price of a private college?
  6. What is the distribution of the median family income? Create a histogram for the data.
  7. What is the mean for the percentage of students who are reported to be White, Black, Hispanic, Asian and Other? Represent the data in a pie chart.
  8. What is the correlation between the average SAT and the admission rate? Plot a scatter diagram for the data.
  9. What is the correlation between faculty salary and the completion rate of the students? Plot a scatter diagram for the data.
  10. Calculate the minimum value, first quartile, median, third quartile, and maximum value for the debt of students. Plot a box plot for the data.

Questions proposed by ChatGPT

According to the assignment requirement, ChatGPT also proposes the following 10 questions from the data:

  1. What are the mean, median, and standard deviation of the admission rates across all colleges?
  2. What is the distribution of average SAT scores among the colleges?
  3. How does the median net price of attendance vary by college control (public vs. private)?
  4. What is the average and range of in-state tuition and fees?
  5. How do enrollment numbers vary by state, and what are the top 5 states with the highest average enrollment?
  6. What percentage of students receive Pell Grants, and how does this compare across different institution types?
  7. What is the average student debt upon graduation, and how does it vary by region?
  8. How does the proportion of part-time students compare to full-time students across different institutions?
  9. What is the median instructional expenditure per full-time equivalent student?
  10. How does the percentage of first-generation college students compare across institutions with different levels of selectivity (admission rate categories)?

The final 10 questions

After careful consideration, I choose to implement the following 10 questions and analyse them by the R software:

  1. What is the mean of the average combined SAT scores for all the colleges in the data?
  2. What is the variance of the cost for all the colleges in the data?
  3. What is the standard deviation of the enrollment number for all the colleges in the data?
  4. What is the distribution of average SAT scores among the colleges?
  5. What is the distribution of the median family income? Create a histogram for the data.
  6. What is the average and range of in-state tuition and fees?
  7. What is the correlation between the average SAT and the admission rate? Plot a scatter diagram for the data.
  8. What is the correlation between faculty salary and the completion rate of the students? Plot a scatter diagram for the data.
  9. Calculate the minimum value, first quartile, median, third quartile, and maximum value for the debt of students. Plot a box plot for the data.
  10. What is the mean for the percentage of students who are reported to be White, Black, Hispanic, Asian and Other? Represent the data in a pie chart.

Analysis

First, we need to import the dataset and store the dataframe in a variable. We can do that with the following R code:

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")

Then we can start to dive deeper into each question.

Q1: What is the mean of the average combined SAT scores for all the colleges in the data?

mean(college$AvgSAT, na.rm = TRUE)
## [1] 1135.25

As we can see, the average value for the combined SAT will be 1135.25 for the given data. An empty value will be ignored when na.rm is set to TRUE.

Q2: What is the variance of the cost for all the colleges in the data?

var(college$Cost, na.rm = TRUE)
## [1] 233433900

The variance for the cost of colleges in the data is 233,433,900.

Q3: What is the standard deviation of the enrollment number for all the colleges in the data?

sd(college$Enrollment, na.rm = TRUE)
## [1] 7473.072

The standard deviation for the enrollment number is 7,473.072.

Q4: What is the distribution of average SAT scores among the colleges?

hist(college$AvgSAT, main = "Distribution of average SAT", xlab = "Average SAT", col = "red")

As we can see, the majority of average SAT scores fall between 1000 and 1200 and accounting for about 2/3 of reported values (excluding N/A).

Q5: What is the distribution of the median family income? Create a histogram for the data.

hist(college$MedIncome, breaks = 15, main = "Distribution of students median family income", xlab="Median family income (in $1000)", col = "blue")

We can clearly observe that the histogram is right-skewed. The majority of students’ family income is below $75,000 and is specifically concentrated around $30,000 and $40,000. This might show us that students who come from a lower-middle-class family choose to place their interest in higher education.

Q6: What is the average and range of in-state tuition and fees?

mean(college$TuitionIn, na.rm = TRUE)
## [1] 21948.55

The average in-state tuition and fees is 21948.55.

range(college$TuitionIn, na.rm = TRUE)
## [1]   480 88000

The in-state tuition ranges from the minimum of 480 to the maximum of 88,000. I believe that the maximum value of 88,000 is an outlier for a private college based on my understanding of the data.

Q7: What is the correlation between the average SAT and the admission rate? Plot a scatter diagram for the data.

cor(college$AvgSAT, college$AdmitRate, use = "complete.obs")
## [1] -0.4221255

The correlation between the average SAT and the admission rate is -0.4221.

plot(college$AvgSAT, college$AdmitRate, main ="Average SAT vs Admission Rate", xlab = "Average SAT", ylab = "Admission Rate")

We can easily observe that there is a somewhat strong negative relationship between the average SAT score and the Admission rate of a college. This is true since better schools will have more people apply to so they will require higher SAT scores to get in, therefore the admission rate is lower to counter the larger pool of applicants.

Q8: What is the correlation between faculty salary and the completion rate of the students? Plot a scatter diagram for the data.

cor(college$FacSalary, college$CompRate, use = "complete.obs")
## [1] 0.577221

The correlation between faculty monthly salary and the percentage of students who finish their program within 150% of normal time is 0.5772.

plot(college$FacSalary, college$CompRate, main ="Faculty Salary vs Students Completion Rate", xlab = "Faculty Salary", ylab = "Completion Rate")

This represents a strong positive relationship between the two variables. When we ignore the incomplete cases, it’s easy to observe that well-paid faculty put more effort into their teaching and are more likely to promote student success. This relationship could also be affected by the third variable, where college is cheaper in low-income areas, causing lower pay for faculty and student in low-income areas also have less access to quality materials and advanced technologies to pursue their education. This also results in lower completion rate.

Q9: Calculate the minimum value, first quartile, median, third quartile, and maximum value for the debt of students. Plot a box plot for the data.

summary(college$Debt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0   325.0   713.5  2365.7  2203.2 48216.0     152

The minimum value of student debt is 10, the first quartile is 325.0, the median is 713.5, the third quartile is 2203.2, and the maximum value is 48216.0. The summary function also displays the mean and the number of N/A in the data.

boxplot(college$Debt, main = "Distribution of Students Debt", col="yellow", xlab="Students Debt", horizontal = TRUE, outline = FALSE)

The box plot is right-skewed. I think the summary function is one of the most useful ones since it provides so much information compared to implementing that descriptive statistic separately. The box plot also demonstrates itself to be useful in displaying the data.

Q10: What is the mean for the percentage of students who are reported to be White, Black, Hispanic, Asian and Other? Represent the data in a pie chart.

mean(college$White, na.rm = TRUE)
## [1] 55.10905
mean(college$Black, na.rm = TRUE)
## [1] 13.92342
mean(college$Hispanic, na.rm = TRUE)
## [1] 13.10273
mean(college$Asian, na.rm = TRUE)
## [1] 4.422476
mean(college$Other, na.rm = TRUE)
## [1] 13.46579

We got the data for the student body of 55.11% White, 13.92% Black, 13.1% Hispanic, 4.42% Asian, and 13.47% Other.

slices <- c(mean(college$White, na.rm = TRUE), mean(college$Black, na.rm = TRUE), mean(college$Hispanic, na.rm = TRUE), mean(college$Asian, na.rm = TRUE), mean(college$Other, na.rm = TRUE))
lbls <- c("White", "Black", "Hispanic", "Asian", "Other")
pie(slices, labels = lbls, main="Pie Chart of Student Body")

It is so much easier to visualize the student body with a pie chart. The percentage of students who reported being White is the majority and occupied for over half of the students’ population. The sum of all elements in the pie chart equals 1.

Summary

In conclusion, this is quite an interesting project as I discovered that R programming is so useful in statistical analysis. It is powerful enough to deal with a large data set and also provide a visual result for the data set. I love it since it is a great tool and has definitely helped me shorten the amount of time in implementing descriptive statistics.

Appendix


# Q1 code
mean(college$AvgSAT, na.rm = TRUE)

# Q2 code
var(college$Cost, na.rm = TRUE)

# Q3 code
sd(college$Enrollment, na.rm = TRUE)

# Q4 code
hist(college$AvgSAT, main = "Distribution of average SAT", xlab = "Average SAT", col = "red")

# Q5 code
hist(college$MedIncome, breaks = 15, main = "Distribution of students median family income", xlab="Median family income (in $1000)", col = "blue")

# Q6 code
mean(college$TuitionIn, na.rm = TRUE)
range(college$TuitionIn, na.rm = TRUE)

# Q7 code
cor(college$AvgSAT, college$AdmitRate, use = "complete.obs")
plot(college$AvgSAT, college$AdmitRate, main ="Average SAT vs Admission Rate", xlab = "Average SAT", ylab = "Admission Rate")

# Q8 code
cor(college$FacSalary, college$CompRate, use = "complete.obs")
plot(college$FacSalary, college$CompRate, main ="Faculty Salary vs Students Completion Rate", xlab = "Faculty Salary", ylab = "Completion Rate")

# Q9 code
summary(college$Debt)
boxplot(college$Debt, main = "Distribution of Students Debt", col="yellow", xlab="Students Debt", horizontal = TRUE, outline = FALSE)

# Q10 code
mean(college$White, na.rm = TRUE)
mean(college$Black, na.rm = TRUE)
mean(college$Hispanic, na.rm = TRUE)
mean(college$Asian, na.rm = TRUE)
mean(college$Other, na.rm = TRUE)
slices <- c(mean(college$White, na.rm = TRUE), mean(college$Black, na.rm = TRUE), mean(college$Hispanic, na.rm = TRUE), mean(college$Asian, na.rm = TRUE), mean(college$Other, na.rm = TRUE))
lbls <- c("White", "Black", "Hispanic", "Asian", "Other")
pie(slices, labels = lbls, main="Pie Chart of Student Body")