The College data set is a built-in data set in R’s “ISLR” package that contains information on 777 US colleges and universities. The data set has 18 variables, including demographic information, college statistics, and financial data.
We remove 6 irrelevant columns of data such as - Pct. new students from top 10% and 25% of the HS class, the number of fulltime / parttime undergraduates etc, as these variables are not that relevant to the variables we want to look at and are also not present in the other data set. We do not consider the names of the universities as data, but give each row a name corresponding to the appropriate university.
College <- College[-c(6,5,11,13,14,16,17)]
summary(College)
## Private Apps Accept Enroll F.Undergrad
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 139
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.: 992
## Median : 1558 Median : 1110 Median : 434 Median : 1707
## Mean : 3002 Mean : 2019 Mean : 780 Mean : 3700
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.: 4005
## Max. :48094 Max. :26330 Max. :6392 Max. :31643
## P.Undergrad Outstate Room.Board Personal
## Min. : 1.0 Min. : 2340 Min. :1780 Min. : 250
## 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 850
## Median : 353.0 Median : 9990 Median :4200 Median :1200
## Mean : 855.3 Mean :10441 Mean :4358 Mean :1341
## 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.:1700
## Max. :21836.0 Max. :21700 Max. :8124 Max. :6800
## S.F.Ratio Grad.Rate
## Min. : 2.50 Min. : 10.00
## 1st Qu.:11.50 1st Qu.: 53.00
## Median :13.60 Median : 65.00
## Mean :14.09 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.: 78.00
## Max. :39.80 Max. :118.00
We can see that as the outstate tuition increases, the number of students enrolling in the university is pretty low. We can also see that there are a lot of spread out points throughout the graph. The next graph is between room and board costs and enrollment, and we can see the same thing here as well, but due to the number of points outside the clustered regions, we can say that the relationships in the last 2 graphs, would make a little but not a huge difference on the enrollment of students. As for enrollment and number of applications accepted we can see that there is a positive relationship between the two. We can also see that - Positive correlation between Accept and Enroll: As the number of applicants accepted by a college increases, the number of enrolled students also tends to increase.
Positive correlation between Room.Board and Private: Private colleges generally have higher room and board costs compared to public colleges.
par(mfrow=c(3,2))
plot(College$Outstate, College$Enroll)
plot(College$Room.Board, College$Enroll)
plot(College$Accept, College$Enroll)
plot(College$Apps, College$Accept)
creating histograms for a few variables-
par(mfrow=c(3,2))
hist(College$Enroll, col = "purple")
hist(College$Accept, col = "purple")
hist(College$Outstate, col = "purple")
hist(College$Room.Board, col = "purple")
I have performed k-means clustering on the data to divide it into 3 clusters and then used a linear regression model to calculate the RMSE.
K-means clustering is an unsupervised machine learning technique used to partition a dataset into k clusters based on the similarity between the observations. The algorithm iteratively assigns each observation to the cluster whose mean is closest, and updates the mean of each cluster based on the new assignments until convergence.
cols <- c(2:11)
k <- 4
set.seed(123)
km <- kmeans(College[,cols], centers=k, nstart=25)
College$cluster <- as.factor(km$cluster)
ggplot(College, aes(Accept, Enroll, color=cluster)) +
geom_point() +
labs(title="College Clustering", x="Accept", y="Enroll")
ggplot(College, aes(Outstate, Enroll, color=cluster)) +
geom_point() +
labs(title="College Clustering", x="Outstate", y="Enroll")
College$cluster <- as.factor(km$cluster)
set.seed(123)
split <- sample.split(College$Apps, SplitRatio = 0.7)
train <- subset(College, split == TRUE)
test <- subset(College, split == FALSE)
model <- lm(Accept ~ . + cluster, data = train)
pred <- predict(model, newdata = test)
rmse <- sqrt(mean((test$Accept - pred)^2))
rmse
## [1] 626.574
summary(model)
##
## Call:
## lm(formula = Accept ~ . + cluster, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5006.9 -125.5 35.9 177.0 3462.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.010e+02 2.563e+02 -0.784 0.43337
## PrivateYes 1.029e+02 1.005e+02 1.024 0.30633
## Apps 4.106e-01 1.388e-02 29.582 < 2e-16 ***
## Enroll 1.201e+00 1.331e-01 9.020 < 2e-16 ***
## F.Undergrad -4.100e-02 2.806e-02 -1.461 0.14462
## P.Undergrad -3.235e-02 2.209e-02 -1.465 0.14364
## Outstate 5.325e-03 1.531e-02 0.348 0.72811
## Room.Board 2.269e-02 3.251e-02 0.698 0.48565
## Personal -7.465e-02 4.321e-02 -1.728 0.08461 .
## S.F.Ratio 2.774e+01 8.578e+00 3.234 0.00129 **
## Grad.Rate -4.318e+00 1.922e+00 -2.247 0.02508 *
## cluster2 -3.049e+00 1.010e+02 -0.030 0.97593
## cluster3 4.048e+02 2.694e+02 1.502 0.13360
## cluster4 1.245e+00 1.348e+02 0.009 0.99263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 618.4 on 529 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.938
## F-statistic: 632.2 on 13 and 529 DF, p-value: < 2.2e-16
For the models,we first use all the variables, but as the variables in the two datasets are too different from each other,I decided to choose 3 closest/ common variables and used those to fit the models for both the datasets respectively. Doing this increased the value of RMSE. T the R-squared value is quite high at 0.9395, indicating that a large fraction of the variance in the dependent variable is explained by the independent variables, by looking at the values we can say that the model is a good fit for the data. The red cluster represents high outstate tuition but low accept and enroll.
vars <- c("Enroll", "Accept", "Outstate")
set.seed(123)
km2 <- kmeans(College[, vars], centers=3)
College$cluster <- as.factor(km2$cluster)
fig2 <- plot_ly(College, x = ~Enroll, y = ~Accept, z = ~Outstate,
color = ~cluster, colors = "Set1", marker = list(size = 3),
type = "scatter3d", mode = "markers")
fig2 <- fig2 %>% layout(scene = list(xaxis = list(title = vars[1]),
yaxis = list(title = vars[2]),
zaxis = list(title = vars[3])),
margin = list(l = 0, r = 0, b = 0, t = 0))
fig2
College$cluster <- as.factor(km2$cluster)
set.seed(123)
split <- sample.split(College$Apps, SplitRatio = 0.7)
train2 <- subset(College, split == TRUE)
test2 <- subset(College, split == FALSE)
model2 <- lm(Accept ~ . + cluster, data = train2)
pred <- predict(model, newdata = test2)
rmse2 <- sqrt(mean((test2$Accept - pred)^2))
rmse2
## [1] 697.23
summary(model2)
##
## Call:
## lm(formula = Accept ~ . + cluster, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5094.0 -136.5 32.4 198.1 3424.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.87315 342.01477 -1.099 0.272268
## PrivateYes 84.42544 105.21123 0.802 0.422660
## Apps 0.41913 0.01350 31.035 < 2e-16 ***
## Enroll 1.18673 0.13258 8.951 < 2e-16 ***
## F.Undergrad -0.02522 0.02545 -0.991 0.322162
## P.Undergrad -0.03111 0.02181 -1.426 0.154397
## Outstate 0.01595 0.01802 0.885 0.376419
## Room.Board 0.01263 0.03274 0.386 0.699882
## Personal -0.08452 0.04310 -1.961 0.050425 .
## S.F.Ratio 29.70417 8.57201 3.465 0.000573 ***
## Grad.Rate -4.91178 1.91370 -2.567 0.010542 *
## cluster2 163.13712 117.25339 1.391 0.164712
## cluster3 48.88521 189.79450 0.258 0.796839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 618 on 530 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9381
## F-statistic: 685.7 on 12 and 530 DF, p-value: < 2.2e-16
Simpson’s paradox occurs when an association between two variables in a dataset is reversed or disappears when the data is segmented into groups. In the college dataset, one possible example of Simpson’s paradox could be the relationship between the acceptance rate and the graduation rate. We create two line graphs showing the relationship between acceptance rate and graduation rate for private and public colleges. The first graph shows a negative relationship between acceptance rate and graduation rate for private colleges, while the second graph shows a positive relationship between acceptance rate and graduation rate for public colleges.
This demonstrates how the relationship between two variables can appear to be different when we don’t account for the influence of a third variable. In this case, the confounding variable is whether the institution is private or public. We also plot graduation rate vs. Acceptance rate for all of the data and we can see that the two variables have a positive relationship. From this plot we can see that the plot for the aggregated data differs from the plot that we made for the Public and Private Colleges individually.
simp <- data.frame(Group = c("Private", "Public"),
Correlation = c(cor(College$Accept[College$Private == "Yes"], College$Grad.Rate[College$Private == "Yes"]),
cor(College$Accept[College$Private == "No"], College$Grad.Rate[College$Private == "No"])))
# Plot the correlations as a bar chart
ggplot(simp, aes(x = Group, y = Correlation, fill = Group)) +
geom_bar(stat = "identity", width = 0.5) +
scale_fill_manual(values = c("red", "blue")) +
labs(title = "Correlation between Acceptance Rate and Graduation Rate",
x = "College Type",
y = "Correlation") +
theme_minimal()
college_new <- College %>%select(Accept, Grad.Rate, Private)
private_plot <- ggplot(college_new %>% filter(Private == "Yes"), aes(x = Accept, y = Grad.Rate)) +
geom_line(color = "blue", size = 1) +
labs(title = "Private Colleges",
x = "Acceptance Rate",
y = "Graduation Rate")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
public_plot <- ggplot(college_new %>% filter(Private == "No"), aes(x = Accept, y = Grad.Rate)) +
geom_line(color = "red", size = 1) +
labs(title = "Public Colleges",
x = "Acceptance Rate",
y = "Graduation Rate")
private_plot
public_plot
all_plot <- ggplot(College, aes(x = Accept, y = Grad.Rate)) +
geom_line(color = "blue", size = 1) +
labs(title = "All colleges",
x = "Acceptance Rate",
y = "Graduation Rate")
all_plot
We then try to fit a linear model on the data using graduation rate and acceptance rate.
The R-squared value is very small at 0.0006633, indicating that the independent variable(s) explain only a tiny fraction of the variation in the dependent variable, also the adjusted R-squared is negative, which suggests that the model is not a good fit for the data.
out_state <- College$Outstate
accept_rate <- College$Accept
linmodel <- lm(out_state ~ accept_rate, data = College)
summary(linmodel)
##
## Call:
## lm(formula = out_state ~ accept_rate, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7957.7 -3080.4 -512.2 2425.3 11187.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.053e+04 1.871e+02 56.264 <2e-16 ***
## accept_rate -4.227e-02 5.894e-02 -0.717 0.473
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4024 on 775 degrees of freedom
## Multiple R-squared: 0.0006633, Adjusted R-squared: -0.0006262
## F-statistic: 0.5144 on 1 and 775 DF, p-value: 0.4735
We fit a random forest model on the data. From the output we can see how much each of the variable would affect the prediction of Accept.
train_index5 <- sample(nrow(College), nrow(College)*0.7)
train_data5 <- College[train_index5,]
test_data5 <- College[-train_index5 ,]
rf_model <- randomForest(Accept ~ Outstate + Room.Board + Apps, data = train_data5, importance = TRUE, ntree = 500)
predictions <- predict(rf_model, test_data5)
# feature importance
importance(rf_model)
## %IncMSE IncNodePurity
## Outstate 11.089176 416839110
## Room.Board 8.379324 345484081
## Apps 77.811787 2311814688
University Statistics Dataset
mydata <- jsonlite::fromJSON('https://storage.googleapis.com/kagglesdsdata/datasets/10525/14746/schoolInfo.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20230324%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230324T224132Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=4004c2353a583e399b538573846d2aef08434f988f3e436a68330ef6da9c617bafa44dbbcf0af0ccaadc12eb552f56e2a5798450e3ff466865c37fd19ddca8c411d634ab778a4834c37048d63225ac18c0387c416e9b902844647089bfd7f827d1264c4a87137cfcd4c2cabec06403339802726fde7d703238a00d05234a33a4dc06279f3f19533f248bc9d2a787b7054623a8a0b1eb7ee0a68eb44f43df514866c368c27fc1b4df04eee7b2b300e76e3112a6bfc331afbb71f6bd1012b47d97c85694b3a30d4214cef76234d08ba8ec8261ef0311942c18c7d509d650073599bfd9a7282ac6a49fa311330c0db833dea197567c0dd12b86d1c2a96c3509342e')
This data was collected from the U.S. News dataset and contains data on 298 universities.The dataset had 39 variables, we remove the data such as urls, duplicate data such as university names, rankings, photos, city, zip codes etc. After dimensionality reduction, we have 11 variables which contain the numerical values, such as enrollment numbers, the ranking of the colleges, room and board costs, along with the names of the colleges.
df <- as.data.frame(mydata)
write.csv(df, "mydata", row.names = FALSE)
mydata<- mydata[-c(1,2,3,4,5,6,9,11,12,17,19,20,21,22,23,25,26,27,29,30,31,32,33,34,36,37,38,39)]
I have performed k-means clustering on the data to divide the data
into 3 clusters and then tried to use a linear regression model to
calculate the RMSE. I was unable to perform linear regression on this
dataset, as there was a lot of issues that popped up.
After further work, I removed the names of the univerities along with
xwalkid and displayranking variables as they were creating issues and
are not related to the variables we use to fit the model.
I was able to fit the model but get an warning stating- “prediction from a rank-deficient fit may be misleading” and we do not get the rmse value . This means that the predictors used in the model are highly correlated or linearly dependent, and as a result, the model cannot estimate the effect of each predictor independently.To address this issue we use other modeling techniques that are less sensitive to collinearity, such as random forests.
modeldata <- mydata[-c(3,5,9)]
cols <- c("enrollment", "acceptance-rate", "tuition")
# Remove incomplete cases from the data
modeldata <- modeldata[complete.cases(modeldata[, cols]), ]
# Run k-means clustering on the selected columns
k <- 3
km <- kmeans(modeldata[,cols], centers=k, nstart=25)
# Create a scatter 3D plot of the clustered data
plot_ly(modeldata, x = ~modeldata$enrollment, y = ~modeldata$`acceptance-rate`, z = ~modeldata$tuition, color = km$cluster, type = "scatter3d", mode = "markers")
modeldata$cluster <- as.factor(km$cluster)
set.seed(123)
split <- sample.split(modeldata$enrollment, SplitRatio = 0.7)
train3 <- subset(modeldata, split == TRUE)
test3 <- subset(modeldata, split == FALSE)
# Fit the linear regression model
model3 <- lm(train3$tuition ~ . + cluster, data = train3)
# Make predictions on the test data
pred <- predict(model3, newdata = test3)
## Warning in predict.lm(model3, newdata = test3): prediction from a rank-deficient
## fit may be misleading
# Calculate the RMSE
rmse3 <- sqrt(mean((test3$Accept - pred)^2))
rmse3
## [1] NaN
For enrollment, against tuition and overall rank, we do not see any significant relationship. When we look at tuition and overall rank we can see that there is a negative relationship. If we plot all 3 of them together however we can see that most of the nodes are clustered towards the center of the graph.
par(mfrow=c(3,2))
plot(mydata$tuition, mydata$enrollment)
plot(mydata$enrollment,mydata$overallRank)
plot(mydata$overallRank, mydata$`tuition`)
plot(mydata$`acceptance-rate`, mydata$rankingSortRank)
plot_ly(mydata, x = ~mydata$enrollment, y = ~mydata$overallRank, z = ~mydata$tuition, color = "Set1", type = "scatter3d", mode = "markers")
## Warning: Ignoring 11 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
We then try to fit a linear model on the data using graduation rate and acceptance rate. The R-squared value is 0.07246, indicating that the independent variable explains about 7.2% of the variation in the dependent variable, also as the adjusted R-squared is slightly smaller than the multiple R-squared, which would suggest that the additional predictor in the model may not be contributing much to the explanation of the dependent variable. By looking at the relatively high rse, we can say that the model may not fit the data well.
tut <- mydata$tuition
enrl_rate <- mydata$enrollment
linmodel2 <- lm(tut ~ enrl_rate, data = College)
summary(linmodel2)
##
## Call:
## lm(formula = tut ~ enrl_rate, data = College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25428 -8500 -1916 9148 23653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.583e+04 1.133e+03 31.613 < 2e-16 ***
## enrl_rate -2.781e-01 6.192e-02 -4.492 1.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11410 on 298 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.06341, Adjusted R-squared: 0.06026
## F-statistic: 20.17 on 1 and 298 DF, p-value: 1.012e-05
Trying to fit a random forest model-
train_index6 <- sample(nrow(modeldata), nrow(modeldata)*0.7)
train_data6 <- modeldata[train_index6,] test_data6 <-
modeldata[-train_index6 ,] rf_model <- randomForest(modeldata\(`acceptance-rate` ~
modeldata\)cost-after-aid+ modeldata\(rankingSortRank + modeldata\)tuition, data
= train_data6, importance = TRUE, ntree = 500) predictions <-
predict(rf_model, test_data6) # feature importance
importance(rf_model)
When running the code, I tried multiple different combinations of variables but got a variable lengths differ error for the most part and then after resolving the error, got another error and was unable to resolve it.
Conclusions and Impact:
The data itself for the university statistics dataset did cause some problems as there were multiple variables which had incomplete data and it did take sometime to figure out the variables that were causing problems and filling in the data with a stand-in value(0) so that the data related to that variable was usable.
This was mainly a research project to see how accessible and usable datasets are as compared to the datasets in the ISLR textbook as such there is no particular impact of the project in the real world. Overall, the ISLR dataset was a lot easier to use and maneuver as the data for each variable was complete.