This is the first project for the course of STAT 353 Statistical Methods for Engineering. In this project, we will use the data of “CollegeScores4yr” from the website of Lock5 Datasets. The CSV link for the data can be found as follows: https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv
I propose the following 10 questions based on my understanding of the data:
According to the assignment requirement, ChatGPT also proposes the following 10 questions from the data:
After careful consideration, I choose to implement the following 10 questions and analyse them by the R software:
First, we need to import the dataset and store the dataframe in a variable. We can do that with the following R code:
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
Then we can start to dive deeper into each question.
mean(college$AvgSAT, na.rm = TRUE)
## [1] 1135.25
As we can see, the average value for the combined SAT will be 1135.25 for the given data. An empty value will be ignored when na.rm is set to TRUE.
var(college$Cost, na.rm = TRUE)
## [1] 233433900
The variance for the cost of colleges in the data is 233,433,900.
sd(college$Enrollment, na.rm = TRUE)
## [1] 7473.072
The standard deviation for the enrollment number is 7,473.072.
hist(college$AvgSAT, main = "Distribution of average SAT", xlab = "Average SAT", col = "red")
As we can see, the majority of average SAT scores fall between 1000 and 1200 and accounting for about 2/3 of reported values (excluding N/A).
hist(college$MedIncome, breaks = 15, main = "Distribution of students median family income", xlab="Median family income (in $1000)", col = "blue")
We can clearly observe that the histogram is right-skewed. The majority of students’ family income is below $75,000 and is specifically concentrated around $30,000 and $40,000. This might show us that students who come from a lower-middle-class family choose to place their interest in higher education.
mean(college$TuitionIn, na.rm = TRUE)
## [1] 21948.55
The average in-state tuition and fees is 21948.55.
range(college$TuitionIn, na.rm = TRUE)
## [1] 480 88000
The in-state tuition ranges from the minimum of 480 to the maximum of 88,000. I believe that the maximum value of 88,000 is an outlier for a private college based on my understanding of the data.
cor(college$AvgSAT, college$AdmitRate, use = "complete.obs")
## [1] -0.4221255
The correlation between the average SAT and the admission rate is -0.4221.
plot(college$AvgSAT, college$AdmitRate, main ="Average SAT vs Admission Rate", xlab = "Average SAT", ylab = "Admission Rate")
We can easily observe that there is a somewhat strong negative relationship between the average SAT score and the Admission rate of a college. This is true since better schools will have more people apply to so they will require higher SAT scores to get in, therefore the admission rate is lower to counter the larger pool of applicants.
cor(college$FacSalary, college$CompRate, use = "complete.obs")
## [1] 0.577221
The correlation between faculty monthly salary and the percentage of students who finish their program within 150% of normal time is 0.5772.
plot(college$FacSalary, college$CompRate, main ="Faculty Salary vs Students Completion Rate", xlab = "Faculty Salary", ylab = "Completion Rate")
This represents a strong positive relationship between the two variables. When we ignore the incomplete cases, it’s easy to observe that well-paid faculty put more effort into their teaching and are more likely to promote student success. This relationship could also be affected by the third variable, where college is cheaper in low-income areas, causing lower pay for faculty and student in low-income areas also have less access to quality materials and advanced technologies to pursue their education. This also results in lower completion rate.
summary(college$Debt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 325.0 713.5 2365.7 2203.2 48216.0 152
The minimum value of student debt is 10, the first quartile is 325.0, the median is 713.5, the third quartile is 2203.2, and the maximum value is 48216.0. The summary function also displays the mean and the number of N/A in the data.
boxplot(college$Debt, main = "Distribution of Students Debt", col="yellow", xlab="Students Debt", horizontal = TRUE, outline = FALSE)
The box plot is right-skewed. I think the summary function is one of the most useful ones since it provides so much information compared to implementing that descriptive statistic separately. The box plot also demonstrates itself to be useful in displaying the data.
mean(college$White, na.rm = TRUE)
## [1] 55.10905
mean(college$Black, na.rm = TRUE)
## [1] 13.92342
mean(college$Hispanic, na.rm = TRUE)
## [1] 13.10273
mean(college$Asian, na.rm = TRUE)
## [1] 4.422476
mean(college$Other, na.rm = TRUE)
## [1] 13.46579
We got the data for the student body of 55.11% White, 13.92% Black, 13.1% Hispanic, 4.42% Asian, and 13.47% Other.
slices <- c(mean(college$White, na.rm = TRUE), mean(college$Black, na.rm = TRUE), mean(college$Hispanic, na.rm = TRUE), mean(college$Asian, na.rm = TRUE), mean(college$Other, na.rm = TRUE))
lbls <- c("White", "Black", "Hispanic", "Asian", "Other")
pie(slices, labels = lbls, main="Pie Chart of Student Body")
It is so much easier to visualize the student body with a pie chart. The percentage of students who reported being White is the majority and occupied for over half of the students’ population. The sum of all elements in the pie chart equals 1.
In conclusion, this is quite an interesting project as I discovered that R programming is so useful in statistical analysis. It is powerful enough to deal with a large data set and also provide a visual result for the data set. I love it since it is a great tool and has definitely helped me shorten the amount of time in implementing descriptive statistics.
# Q1 code
mean(college$AvgSAT, na.rm = TRUE)
# Q2 code
var(college$Cost, na.rm = TRUE)
# Q3 code
sd(college$Enrollment, na.rm = TRUE)
# Q4 code
hist(college$AvgSAT, main = "Distribution of average SAT", xlab = "Average SAT", col = "red")
# Q5 code
hist(college$MedIncome, breaks = 15, main = "Distribution of students median family income", xlab="Median family income (in $1000)", col = "blue")
# Q6 code
mean(college$TuitionIn, na.rm = TRUE)
range(college$TuitionIn, na.rm = TRUE)
# Q7 code
cor(college$AvgSAT, college$AdmitRate, use = "complete.obs")
plot(college$AvgSAT, college$AdmitRate, main ="Average SAT vs Admission Rate", xlab = "Average SAT", ylab = "Admission Rate")
# Q8 code
cor(college$FacSalary, college$CompRate, use = "complete.obs")
plot(college$FacSalary, college$CompRate, main ="Faculty Salary vs Students Completion Rate", xlab = "Faculty Salary", ylab = "Completion Rate")
# Q9 code
summary(college$Debt)
boxplot(college$Debt, main = "Distribution of Students Debt", col="yellow", xlab="Students Debt", horizontal = TRUE, outline = FALSE)
# Q10 code
mean(college$White, na.rm = TRUE)
mean(college$Black, na.rm = TRUE)
mean(college$Hispanic, na.rm = TRUE)
mean(college$Asian, na.rm = TRUE)
mean(college$Other, na.rm = TRUE)
slices <- c(mean(college$White, na.rm = TRUE), mean(college$Black, na.rm = TRUE), mean(college$Hispanic, na.rm = TRUE), mean(college$Asian, na.rm = TRUE), mean(college$Other, na.rm = TRUE))
lbls <- c("White", "Black", "Hispanic", "Asian", "Other")
pie(slices, labels = lbls, main="Pie Chart of Student Body")