The College data set is a built-in data set in R’s “ISLR” package that contains information on 777 US colleges and universities. The data set has 18 variables, including demographic information, college statistics, and financial data.

We remove 6 irrelevant columns of data such as - Pct. new students from top 10% and 25% of the HS class, the number of fulltime / parttime undergraduates etc, as these variables are not that relevant to the variables we want to look at and are also not present in the other data set. We do not consider the names of the universities as data, but give each row a name corresponding to the appropriate university.

College <- College[-c(6,5,11,13,14,16,17)]
summary(College)
##  Private        Apps           Accept          Enroll      F.Undergrad   
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   :  139  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:  992  
##            Median : 1558   Median : 1110   Median : 434   Median : 1707  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   : 3700  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.: 4005  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :31643  
##   P.Undergrad         Outstate       Room.Board      Personal   
##  Min.   :    1.0   Min.   : 2340   Min.   :1780   Min.   : 250  
##  1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597   1st Qu.: 850  
##  Median :  353.0   Median : 9990   Median :4200   Median :1200  
##  Mean   :  855.3   Mean   :10441   Mean   :4358   Mean   :1341  
##  3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050   3rd Qu.:1700  
##  Max.   :21836.0   Max.   :21700   Max.   :8124   Max.   :6800  
##    S.F.Ratio       Grad.Rate     
##  Min.   : 2.50   Min.   : 10.00  
##  1st Qu.:11.50   1st Qu.: 53.00  
##  Median :13.60   Median : 65.00  
##  Mean   :14.09   Mean   : 65.46  
##  3rd Qu.:16.50   3rd Qu.: 78.00  
##  Max.   :39.80   Max.   :118.00

We can see that as the outstate tuition increases, the number of students enrolling in the university is pretty low. We can also see that there are a lot of spread out points throughout the graph. The next graph is between room and board costs and enrollment, and we can see the same thing here as well, but due to the number of points outside the clustered regions, we can say that the relationships in the last 2 graphs, would make a little but not a huge difference on the enrollment of students. As for enrollment and number of applications accepted we can see that there is a positive relationship between the two. We can also see that - Positive correlation between Accept and Enroll: As the number of applicants accepted by a college increases, the number of enrolled students also tends to increase.

Positive correlation between Room.Board and Private: Private colleges generally have higher room and board costs compared to public colleges.

par(mfrow=c(3,2))
plot(College$Outstate, College$Enroll)
plot(College$Room.Board, College$Enroll)
plot(College$Accept, College$Enroll)
plot(College$Apps, College$Accept)

creating histograms for a few variables-

par(mfrow=c(3,2))
hist(College$Enroll, col = "purple")
hist(College$Accept, col = "purple")
hist(College$Outstate, col = "purple")
hist(College$Room.Board, col = "purple")

I have performed k-means clustering on the data to divide it into 3 clusters and then used a linear regression model to calculate the RMSE.

K-means clustering is an unsupervised machine learning technique used to partition a dataset into k clusters based on the similarity between the observations. The algorithm iteratively assigns each observation to the cluster whose mean is closest, and updates the mean of each cluster based on the new assignments until convergence.

cols <- c(2:11) 
k <- 4
set.seed(123)
km <- kmeans(College[,cols], centers=k, nstart=25)
College$cluster <- as.factor(km$cluster)

ggplot(College, aes(Accept, Enroll, color=cluster)) +
  geom_point() +
  labs(title="College Clustering", x="Accept", y="Enroll")

ggplot(College, aes(Outstate, Enroll, color=cluster)) +
  geom_point() +
  labs(title="College Clustering", x="Outstate", y="Enroll")

College$cluster <- as.factor(km$cluster)

set.seed(123)
split <- sample.split(College$Apps, SplitRatio = 0.7)
train <- subset(College, split == TRUE)
test <- subset(College, split == FALSE)


model <- lm(Accept ~ . + cluster, data = train)
pred <- predict(model, newdata = test)
rmse <- sqrt(mean((test$Accept - pred)^2))
rmse
## [1] 626.574
summary(model)
## 
## Call:
## lm(formula = Accept ~ . + cluster, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5006.9  -125.5    35.9   177.0  3462.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.010e+02  2.563e+02  -0.784  0.43337    
## PrivateYes   1.029e+02  1.005e+02   1.024  0.30633    
## Apps         4.106e-01  1.388e-02  29.582  < 2e-16 ***
## Enroll       1.201e+00  1.331e-01   9.020  < 2e-16 ***
## F.Undergrad -4.100e-02  2.806e-02  -1.461  0.14462    
## P.Undergrad -3.235e-02  2.209e-02  -1.465  0.14364    
## Outstate     5.325e-03  1.531e-02   0.348  0.72811    
## Room.Board   2.269e-02  3.251e-02   0.698  0.48565    
## Personal    -7.465e-02  4.321e-02  -1.728  0.08461 .  
## S.F.Ratio    2.774e+01  8.578e+00   3.234  0.00129 ** 
## Grad.Rate   -4.318e+00  1.922e+00  -2.247  0.02508 *  
## cluster2    -3.049e+00  1.010e+02  -0.030  0.97593    
## cluster3     4.048e+02  2.694e+02   1.502  0.13360    
## cluster4     1.245e+00  1.348e+02   0.009  0.99263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 618.4 on 529 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.938 
## F-statistic: 632.2 on 13 and 529 DF,  p-value: < 2.2e-16

For the models,we first use all the variables, but as the variables in the two datasets are too different from each other,I decided to choose 3 closest/ common variables and used those to fit the models for both the datasets respectively. Doing this increased the value of RMSE. T the R-squared value is quite high at 0.9395, indicating that a large fraction of the variance in the dependent variable is explained by the independent variables, by looking at the values we can say that the model is a good fit for the data. The red cluster represents high outstate tuition but low accept and enroll.

vars <- c("Enroll", "Accept", "Outstate")
set.seed(123)
km2 <- kmeans(College[, vars], centers=3)
College$cluster <- as.factor(km2$cluster)
fig2 <- plot_ly(College, x = ~Enroll, y = ~Accept, z = ~Outstate, 
               color = ~cluster, colors = "Set1", marker = list(size = 3),
               type = "scatter3d", mode = "markers")
fig2 <- fig2 %>% layout(scene = list(xaxis = list(title = vars[1]),
                                   yaxis = list(title = vars[2]),
                                   zaxis = list(title = vars[3])),
                      margin = list(l = 0, r = 0, b = 0, t = 0))
fig2
College$cluster <- as.factor(km2$cluster)
set.seed(123)
split <- sample.split(College$Apps, SplitRatio = 0.7)
train2 <- subset(College, split == TRUE)
test2 <- subset(College, split == FALSE)

model2 <- lm(Accept ~ . + cluster, data = train2)
pred <- predict(model, newdata = test2)
rmse2 <- sqrt(mean((test2$Accept - pred)^2))
rmse2
## [1] 697.23
summary(model2)
## 
## Call:
## lm(formula = Accept ~ . + cluster, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5094.0  -136.5    32.4   198.1  3424.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.87315  342.01477  -1.099 0.272268    
## PrivateYes    84.42544  105.21123   0.802 0.422660    
## Apps           0.41913    0.01350  31.035  < 2e-16 ***
## Enroll         1.18673    0.13258   8.951  < 2e-16 ***
## F.Undergrad   -0.02522    0.02545  -0.991 0.322162    
## P.Undergrad   -0.03111    0.02181  -1.426 0.154397    
## Outstate       0.01595    0.01802   0.885 0.376419    
## Room.Board     0.01263    0.03274   0.386 0.699882    
## Personal      -0.08452    0.04310  -1.961 0.050425 .  
## S.F.Ratio     29.70417    8.57201   3.465 0.000573 ***
## Grad.Rate     -4.91178    1.91370  -2.567 0.010542 *  
## cluster2     163.13712  117.25339   1.391 0.164712    
## cluster3      48.88521  189.79450   0.258 0.796839    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 618 on 530 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9381 
## F-statistic: 685.7 on 12 and 530 DF,  p-value: < 2.2e-16

Simpson’s paradox occurs when an association between two variables in a dataset is reversed or disappears when the data is segmented into groups. In the college dataset, one possible example of Simpson’s paradox could be the relationship between the acceptance rate and the graduation rate. We create two line graphs showing the relationship between acceptance rate and graduation rate for private and public colleges. The first graph shows a negative relationship between acceptance rate and graduation rate for private colleges, while the second graph shows a positive relationship between acceptance rate and graduation rate for public colleges.

This demonstrates how the relationship between two variables can appear to be different when we don’t account for the influence of a third variable. In this case, the confounding variable is whether the institution is private or public. We also plot graduation rate vs. Acceptance rate for all of the data and we can see that the two variables have a positive relationship. From this plot we can see that the plot for the aggregated data differs from the plot that we made for the Public and Private Colleges individually.

simp <- data.frame(Group = c("Private", "Public"),
                     Correlation = c(cor(College$Accept[College$Private == "Yes"], College$Grad.Rate[College$Private == "Yes"]),
                                     cor(College$Accept[College$Private == "No"], College$Grad.Rate[College$Private == "No"])))

# Plot the correlations as a bar chart
ggplot(simp, aes(x = Group, y = Correlation, fill = Group)) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_fill_manual(values = c("red", "blue")) +
  labs(title = "Correlation between Acceptance Rate and Graduation Rate",
       x = "College Type",
       y = "Correlation") +
  theme_minimal()

college_new <- College %>%select(Accept, Grad.Rate, Private)

private_plot <- ggplot(college_new %>% filter(Private == "Yes"), aes(x = Accept, y = Grad.Rate)) +
  geom_line(color = "blue", size = 1) +
  labs(title = "Private Colleges",
       x = "Acceptance Rate",
       y = "Graduation Rate")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
public_plot <- ggplot(college_new %>% filter(Private == "No"), aes(x = Accept, y = Grad.Rate)) +
  geom_line(color = "red", size = 1) +
  labs(title = "Public Colleges",
       x = "Acceptance Rate",
       y = "Graduation Rate")

private_plot

public_plot

all_plot <- ggplot(College, aes(x = Accept, y = Grad.Rate)) +
  geom_line(color = "blue", size = 1) +
  labs(title = "All colleges",
       x = "Acceptance Rate",
       y = "Graduation Rate")
all_plot

We then try to fit a linear model on the data using graduation rate and acceptance rate.

The R-squared value is very small at 0.0006633, indicating that the independent variable(s) explain only a tiny fraction of the variation in the dependent variable, also the adjusted R-squared is negative, which suggests that the model is not a good fit for the data.

out_state <- College$Outstate
accept_rate <- College$Accept
linmodel <- lm(out_state ~ accept_rate, data = College)
summary(linmodel)
## 
## Call:
## lm(formula = out_state ~ accept_rate, data = College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7957.7 -3080.4  -512.2  2425.3 11187.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.053e+04  1.871e+02  56.264   <2e-16 ***
## accept_rate -4.227e-02  5.894e-02  -0.717    0.473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4024 on 775 degrees of freedom
## Multiple R-squared:  0.0006633,  Adjusted R-squared:  -0.0006262 
## F-statistic: 0.5144 on 1 and 775 DF,  p-value: 0.4735

We fit a random forest model on the data. From the output we can see how much each of the variable would affect the prediction of Accept.

train_index5 <- sample(nrow(College), nrow(College)*0.7)
train_data5 <- College[train_index5,]
test_data5 <- College[-train_index5 ,]
rf_model <- randomForest(Accept ~ Outstate + Room.Board + Apps, data = train_data5, importance = TRUE, ntree = 500)
predictions <- predict(rf_model, test_data5)
# feature importance
importance(rf_model)
##              %IncMSE IncNodePurity
## Outstate   11.089176     416839110
## Room.Board  8.379324     345484081
## Apps       77.811787    2311814688

University Statistics Dataset

mydata <- jsonlite::fromJSON('https://storage.googleapis.com/kagglesdsdata/datasets/10525/14746/schoolInfo.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20230324%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230324T224132Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=4004c2353a583e399b538573846d2aef08434f988f3e436a68330ef6da9c617bafa44dbbcf0af0ccaadc12eb552f56e2a5798450e3ff466865c37fd19ddca8c411d634ab778a4834c37048d63225ac18c0387c416e9b902844647089bfd7f827d1264c4a87137cfcd4c2cabec06403339802726fde7d703238a00d05234a33a4dc06279f3f19533f248bc9d2a787b7054623a8a0b1eb7ee0a68eb44f43df514866c368c27fc1b4df04eee7b2b300e76e3112a6bfc331afbb71f6bd1012b47d97c85694b3a30d4214cef76234d08ba8ec8261ef0311942c18c7d509d650073599bfd9a7282ac6a49fa311330c0db833dea197567c0dd12b86d1c2a96c3509342e')

This data was collected from the U.S. News dataset and contains data on 298 universities.The dataset had 39 variables, we remove the data such as urls, duplicate data such as university names, rankings, photos, city, zip codes etc. After dimensionality reduction, we have 11 variables which contain the numerical values, such as enrollment numbers, the ranking of the colleges, room and board costs, along with the names of the colleges.

df <- as.data.frame(mydata)
write.csv(df, "mydata", row.names = FALSE)
mydata<- mydata[-c(1,2,3,4,5,6,9,11,12,17,19,20,21,22,23,25,26,27,29,30,31,32,33,34,36,37,38,39)]

I have performed k-means clustering on the data to divide the data into 3 clusters and then tried to use a linear regression model to calculate the RMSE. I was unable to perform linear regression on this dataset, as there was a lot of issues that popped up.
After further work, I removed the names of the univerities along with xwalkid and displayranking variables as they were creating issues and are not related to the variables we use to fit the model.

I was able to fit the model but get an warning stating- “prediction from a rank-deficient fit may be misleading” and we do not get the rmse value . This means that the predictors used in the model are highly correlated or linearly dependent, and as a result, the model cannot estimate the effect of each predictor independently.To address this issue we use other modeling techniques that are less sensitive to collinearity, such as random forests.

modeldata <- mydata[-c(3,5,9)]
cols <- c("enrollment", "acceptance-rate", "tuition")

# Remove incomplete cases from the data
modeldata <- modeldata[complete.cases(modeldata[, cols]), ]

# Run k-means clustering on the selected columns
k <- 3
km <- kmeans(modeldata[,cols], centers=k, nstart=25)

# Create a scatter 3D plot of the clustered data
plot_ly(modeldata, x = ~modeldata$enrollment, y = ~modeldata$`acceptance-rate`, z = ~modeldata$tuition, color = km$cluster, type = "scatter3d", mode = "markers")
modeldata$cluster <- as.factor(km$cluster)


set.seed(123)
split <- sample.split(modeldata$enrollment, SplitRatio = 0.7)
train3 <- subset(modeldata, split == TRUE)
test3 <- subset(modeldata, split == FALSE)


# Fit the linear regression model
model3 <- lm(train3$tuition ~ . + cluster, data = train3)

# Make predictions on the test data
pred <- predict(model3, newdata = test3)
## Warning in predict.lm(model3, newdata = test3): prediction from a rank-deficient
## fit may be misleading
# Calculate the RMSE
rmse3 <- sqrt(mean((test3$Accept - pred)^2))

rmse3
## [1] NaN

For enrollment, against tuition and overall rank, we do not see any significant relationship. When we look at tuition and overall rank we can see that there is a negative relationship. If we plot all 3 of them together however we can see that most of the nodes are clustered towards the center of the graph.

par(mfrow=c(3,2))
plot(mydata$tuition, mydata$enrollment)
plot(mydata$enrollment,mydata$overallRank)
plot(mydata$overallRank, mydata$`tuition`)

plot(mydata$`acceptance-rate`, mydata$rankingSortRank)

plot_ly(mydata, x = ~mydata$enrollment, y = ~mydata$overallRank, z = ~mydata$tuition, color = "Set1", type = "scatter3d", mode = "markers")
## Warning: Ignoring 11 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

We then try to fit a linear model on the data using graduation rate and acceptance rate. The R-squared value is 0.07246, indicating that the independent variable explains about 7.2% of the variation in the dependent variable, also as the adjusted R-squared is slightly smaller than the multiple R-squared, which would suggest that the additional predictor in the model may not be contributing much to the explanation of the dependent variable. By looking at the relatively high rse, we can say that the model may not fit the data well.

tut <- mydata$tuition
enrl_rate <- mydata$enrollment
linmodel2 <- lm(tut ~ enrl_rate, data = College)
summary(linmodel2)
## 
## Call:
## lm(formula = tut ~ enrl_rate, data = College)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25428  -8500  -1916   9148  23653 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.583e+04  1.133e+03  31.613  < 2e-16 ***
## enrl_rate   -2.781e-01  6.192e-02  -4.492 1.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11410 on 298 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.06341,    Adjusted R-squared:  0.06026 
## F-statistic: 20.17 on 1 and 298 DF,  p-value: 1.012e-05

Trying to fit a random forest model-

train_index6 <- sample(nrow(modeldata), nrow(modeldata)*0.7) train_data6 <- modeldata[train_index6,] test_data6 <- modeldata[-train_index6 ,] rf_model <- randomForest(modeldata\(`acceptance-rate` ~ modeldata\)cost-after-aid+ modeldata\(rankingSortRank + modeldata\)tuition, data = train_data6, importance = TRUE, ntree = 500) predictions <- predict(rf_model, test_data6) # feature importance importance(rf_model)

When running the code, I tried multiple different combinations of variables but got a variable lengths differ error for the most part and then after resolving the error, got another error and was unable to resolve it.

Conclusions and Impact:

The data itself for the university statistics dataset did cause some problems as there were multiple variables which had incomplete data and it did take sometime to figure out the variables that were causing problems and filling in the data with a stand-in value(0) so that the data related to that variable was usable.

This was mainly a research project to see how accessible and usable datasets are as compared to the datasets in the ISLR textbook as such there is no particular impact of the project in the real world. Overall, the ISLR dataset was a lot easier to use and maneuver as the data for each variable was complete.