Instructions

You get to decide which dataset you want to work on. The data set must be different You can work on a problem from your work, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portal.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:

describe the problem you are trying to solve.
describe your datases and what you did to prepare the data for analysis.
methodologies you used for analyzing the data
why you did what you did
make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

College Admissions

For six years I was an Admissions Officer at a Big 10 flagship university in the Midwest, so the topic of university admission has been a long term interest for me. I will use the college dataset provided as part of the materials included in the textbook for this course “Practical Machine Learning in R” by Fred Nwanganga and Mike Chapple, published in 2020 by John Wiley & Sons Inc, Indianapolis, Indiana..

Acquire the data

Reading in the college dataset from .csv format requires the read.csv function. The data has 1270 observations of 17 variables, 8 of which are categorical data and 9 of which are numeric data.

college <- read.csv("C:/data/college.csv")
str(college)

## 'data.frame':    1270 obs. of  17 variables:
##  $ id                : int  102669 101648 100830 101879 100858 100663 101480 102049 101709 100751 ...
##  $ name              : chr  "Alaska Pacific University" "Marion Military Institute" "Auburn University at Montgomery" "University of North Alabama" ...
##  $ city              : chr  "Anchorage" "Marion" "Montgomery" "Florence" ...
##  $ state             : chr  "AK" "AL" "AL" "AL" ...
##  $ region            : chr  "West" "South" "South" "South" ...
##  $ highest_degree    : chr  "Graduate" "Associate" "Graduate" "Graduate" ...
##  $ control           : chr  "Private" "Public" "Public" "Public" ...
##  $ gender            : chr  "CoEd" "CoEd" "CoEd" "CoEd" ...
##  $ admission_rate    : num  0.421 0.614 0.802 0.679 0.835 ...
##  $ sat_avg           : int  1054 1055 1009 1029 1215 1107 1041 1165 1070 1185 ...
##  $ undergrads        : int  275 433 4304 5485 20514 11383 7060 3033 2644 29851 ...
##  $ tuition           : int  19610 8778 9080 7412 10200 7510 7092 27324 10660 9826 ...
##  $ faculty_salary_avg: int  5804 5916 7255 7424 9487 9957 6801 8367 7437 9667 ...
##  $ loan_default_rate : chr  "0.077" "0.136" "0.106" "0.111" ...
##  $ median_debt       : num  23250 11500 21335 21500 21831 ...
##  $ lon               : num  -149.9 -87.3 -86.3 -87.7 -85.5 ...
##  $ lat               : num  61.2 32.6 32.4 34.8 32.6 ...

Explore the data

Examining the summary stats there are no missing values in any of the 8 chr variables or 9 numeric variables, so we can now decide which two methodologies to apply to the data, which will be clustering from week 10 and Neural Network from week 15.

skim(college)

Data summary
Name	college
Number of rows	1270
Number of columns	17
_______________________
Column type frequency:
character	8
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	8	65	1261
city	1	3	23	834
state	1	2	2	51
region	1	4	9	4
highest_degree	1	8	9	4
control	1	6	7	2
gender	1	3	5	3
loan_default_rate	1	1	5	198

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	187221.94	53474.50	100654.00	153254.75	186327.00	215290.75	484905.00	▆▇▁▁▁
admission_rate	1	0.65	0.19	0.05	0.53	0.67	0.79	1.00	▁▂▆▇▅
sat_avg	1	1059.58	133.01	720.00	973.25	1040.50	1120.75	1545.00	▁▇▆▂▁
undergrads	1	5624.98	7377.91	47.00	1293.75	2554.50	6712.75	52280.00	▇▁▁▁▁
tuition	1	21011.31	12550.28	2732.00	8966.25	19995.00	30354.50	51008.00	▇▃▆▃▂
faculty_salary_avg	1	7655.02	2222.86	1451.00	6191.00	7268.50	8670.50	20650.00	▁▇▂▁▁
median_debt	1	23476.80	4618.13	6056.00	21250.00	24544.25	27000.00	41000.00	▁▂▇▁▁
lon	1	-88.29	13.93	-157.92	-94.17	-84.88	-78.63	-68.59	▁▁▁▅▇
lat	1	38.60	4.64	19.71	35.20	39.74	41.81	61.22	▁▃▇▁▁

Clustering

Indiana Universities

K-means clustering is one of the most commonly used clustering approaches, so let us see how it fares in this case.

First we need to decide what we want to examine from the dataset. Since I live in Indiana, we can look at Indiana college admission rates and tease out different clusters based on SAT average scores. The resulting dataset has 40 observations of 16 variables.

indiana_college <- college %>%
  filter(state == "IN") %>%
  column_to_rownames(var = "name")

head(indiana_college)

##                                      id         city state  region
## Oakland City University          152099 Oakland City    IN Midwest
## Indiana University-Kokomo        151333       Kokomo    IN Midwest
## Goshen College                   150668       Goshen    IN Midwest
## Indiana University-East          151388     Richmond    IN Midwest
## University of Notre Dame         152080   Notre Dame    IN Midwest
## Purdue University-Calumet Campus 152248      Hammond    IN Midwest
##                                  highest_degree control gender admission_rate
## Oakland City University                Graduate Private   CoEd         0.5504
## Indiana University-Kokomo              Graduate  Public   CoEd         0.6726
## Goshen College                         Graduate Private   CoEd         0.5400
## Indiana University-East                Graduate  Public   CoEd         0.6376
## University of Notre Dame               Graduate Private   CoEd         0.2114
## Purdue University-Calumet Campus       Graduate  Public   CoEd         0.5968
##                                  sat_avg undergrads tuition faculty_salary_avg
## Oakland City University              960        616   19800               4290
## Indiana University-Kokomo            934       2724    6811               5696
## Goshen College                      1109        746   29700               5910
## Indiana University-East              943       3200    6787               5819
## University of Notre Dame            1450       8427   46237              13102
## Purdue University-Calumet Campus     965       7316    6758               8118
##                                  loan_default_rate median_debt       lon
## Oakland City University                      0.078       16973 -87.34501
## Indiana University-Kokomo                    0.108       19052 -86.13360
## Goshen College                               0.033       20671 -85.83444
## Indiana University-East                      0.132       21172 -84.89024
## University of Notre Dame                     0.006       21250 -86.23534
## Purdue University-Calumet Campus             0.094       22028 -87.50004
##                                       lat
## Oakland City University          38.33866
## Indiana University-Kokomo        40.48643
## Goshen College                   41.58227
## Indiana University-East          39.82894
## University of Notre Dame         41.70557
## Purdue University-Calumet Campus 41.58337

Next we need to decide which variables we will cluster around. We can group by admission_rate and sat_avg to see what happens, so let’s take a look at the distributions of each of those metrics.

indiana_college %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg    
##  Min.   :0.2114   Min.   : 913  
##  1st Qu.:0.6279   1st Qu.: 965  
##  Median :0.7096   Median :1032  
##  Mean   :0.7215   Mean   :1060  
##  3rd Qu.:0.8240   3rd Qu.:1132  
##  Max.   :1.0000   Max.   :1450

The ranges of each variable are quite different, so we need to scale them for the K-means process.

indiana_college_scaled <- indiana_college %>%
  select(admission_rate, sat_avg) %>%
  scale()

head(indiana_college_scaled)

##                                  admission_rate    sat_avg
## Oakland City University              -1.1180533 -0.8567232
## Indiana University-Kokomo            -0.3194205 -1.0788050
## Goshen College                       -1.1860221  0.4159763
## Indiana University-East              -0.5481615 -1.0019305
## University of Notre Dame             -3.3335732  3.3286644
## Purdue University-Calumet Campus     -0.8148081 -0.8140152

How Many Clusters?

Before clustering the data, it would be good to see how many clusters are recommended based on the data itself. There are three methods we can use to recommend the number of clusters, so we will try them all.

The first method is the Elbow method, which uses a measure of within-cluster sum of squares (WCSS) which is called by setting the method = "wss". This method measures the distance between each item in a cluster and the cluster’s centroid. The Elbow method suggests that 6 clusters are optimal.

fviz_nbclust(indiana_college_scaled, kmeans, method = "wss")

The second method is the Average Silhouette Method which is called by setting the method = "silhouette". This method compares each item in a cluster with other items in the cluster as well as items in neighboring clusters. The Silhouette Method suggests that 6 clusters are optimal.

fviz_nbclust(indiana_college_scaled, kmeans, method = "silhouette")

The third method is the Gap Statistic, which generates a random reference dataset and measures the difference in WCSS between the original and random datasets, which is called by setting the method = "gap_stat". The Gap Statistic method appears to show that 1 cluster is optimal, but since 1 cluster does not separate the data in any way the second best choice appears to be 9 clusters. However, given that there are only 40 unique colleges in the data using 9 clusters could result in clusters with only one or two colleges, so the third best choice appears to be 6 clusters in the Gap Statistic method.

fviz_nbclust(indiana_college_scaled, kmeans, method = "gap_stat")

Since all three methods suggested 6 clusters being optimal, we can create a six cluster grouping to visualize the data.

First we will try six centers with 25 initial configurations. The K-means algorithm selected the best result with six clusters of sizes 1, 7, 6, 14, 8 and 4.

set.seed(101)

k_6 <- kmeans(indiana_college_scaled, centers = 6, nstart = 25)
k_6$size

## [1]  1  7  6 14  8  4

Visualization of the Clusters

It may be more useful to analyze the clusters visually using the factoextra package’s fviz_cluster function. Here we see that with six clusters the University of Notre Dame stands alone with the lowest admission rate and highest SAT averages. This makes sense since Notre Dame is a highly competitive private university.

The other five clusters show that Private Colleges and universities that specialize in Engineering are the second most selective with lower admission rates and higher SAT averages (clusters 2 and 6), while more public and liberal arts focused institutions are less selective with higher admission rates and lower SAT averages (clusters 3, 4 and 5).

fviz_cluster(k_6, data = indiana_college_scaled, repel = TRUE)

Principal Component Analysis (PCA)

Principal component analysis (PCA) helps to determine the relative importance of predictors in the dataset, but each principal component may be a combination of individual predictor variables. However, knowing what predictors influence the outcomes the most is a valuable thing to know as a data scientist, so we will now calculate the principal components and their influence on the total variance for the Indiana colleges dataset.

Before we can dive into PCA we need to clean up the dataset by either removing non-numeric predictors or converting them to factors. We will remove the id,city,state, region, lon and lat predictors, since all institutions are in the state of Indiana in the Midwest region. We will convert loan_default_rate, highest_degree, control and gender to factors then to numeric values.

Last we will scale the data to level the analysis in terms of weighting values.

indiana_college_PCA <- indiana_college %>%
  select(-id,-city,-state,-region,-lon,-lat) %>%
  mutate(loan_default_rate = as.numeric(loan_default_rate),
         highest_degree = as.numeric(as.factor(highest_degree)),
         control = as.numeric(as.factor(control)),
         gender = as.numeric(as.factor(gender))) %>%
  mutate(loan_default_rate = ifelse(is.na(loan_default_rate), mean(loan_default_rate, na.rm = TRUE), loan_default_rate)) %>%
  scale()

#view first ten rows of data
head(indiana_college_PCA,10)

##                                        highest_degree    control     gender
## Oakland City University                     0.4147997 -0.7245688 -0.2143418
## Indiana University-Kokomo                   0.4147997  1.3456278 -0.2143418
## Goshen College                              0.4147997 -0.7245688 -0.2143418
## Indiana University-East                     0.4147997  1.3456278 -0.2143418
## University of Notre Dame                    0.4147997 -0.7245688 -0.2143418
## Purdue University-Calumet Campus            0.4147997  1.3456278 -0.2143418
## Grace College and Theological Seminary      0.4147997 -0.7245688 -0.2143418
## Indiana University-Southeast                0.4147997  1.3456278 -0.2143418
## Purdue University-Main Campus               0.4147997  1.3456278 -0.2143418
## University of Southern Indiana              0.4147997  1.3456278 -0.2143418
##                                        admission_rate    sat_avg  undergrads
## Oakland City University                   -1.11805331 -0.8567232 -0.67177055
## Indiana University-Kokomo                 -0.31942046 -1.0788050 -0.38736183
## Goshen College                            -1.18602206  0.4159763 -0.65423111
## Indiana University-East                   -0.54816146 -1.0019305 -0.32314051
## University of Notre Dame                  -3.33357324  3.3286644  0.38207977
## Purdue University-Calumet Campus          -0.81480810 -0.8140152  0.23218504
## Grace College and Theological Seminary     0.52103931  0.4245179 -0.55547059
## Indiana University-Southeast               0.53411023 -0.8652648 -0.01606545
## Purdue University-Main Campus             -0.84029639  1.3043034  3.28957871
## University of Southern Indiana            -0.05277382 -0.5065173  0.35590553
##                                           tuition faculty_salary_avg
## Oakland City University                -0.2360107         -1.6917626
## Indiana University-Kokomo              -1.3015085         -0.8615583
## Goshen College                          0.5760939         -0.7351972
## Indiana University-East                -1.3034773         -0.7889302
## University of Notre Dame                1.9326369          3.5114805
## Purdue University-Calumet Campus       -1.3058561          0.5685659
## Grace College and Theological Seminary  0.1634791         -0.4311039
## Indiana University-Southeast           -1.3001960         -0.3135999
## Purdue University-Main Campus          -1.0397483          2.3535641
## University of Southern Indiana         -1.2895320         -0.2185338
##                                        loan_default_rate median_debt
## Oakland City University                        0.4301220  -3.0585360
## Indiana University-Kokomo                      1.3064316  -2.2432741
## Goshen College                                -0.8843424  -1.6083972
## Indiana University-East                        2.0074792  -1.4119344
## University of Notre Dame                      -1.6730211  -1.3813474
## Purdue University-Calumet Campus               0.8974871  -1.0762614
## Grace College and Theological Seminary        -0.5046083  -0.9401883
## Indiana University-Southeast                   1.2188006  -0.8911707
## Purdue University-Main Campus                 -0.8551321  -0.8911707
## University of Southern Indiana                 0.1088084  -0.7476469

Having completed the transformations, we can look at a summary of the data for any other issues that need to be addressed. There are no missing values and the data is appropriately scaled, so we may proceed with the PCA.

skim(indiana_college_PCA)

Data summary
Name	indiana_college_PCA
Number of rows	40
Number of columns	10
_______________________
Column type frequency:
numeric	10
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
highest_degree	1	1	-2.35	0.41	0.41	0.41	0.41	▂▁▁▁▇
control	1	1	-0.72	-0.72	-0.72	1.35	1.35	▇▁▁▁▅
gender	1	1	-0.21	-0.21	-0.21	-0.21	5.50	▇▁▁▁▁
admission_rate	1	1	-3.33	-0.61	-0.08	0.67	1.82	▁▁▆▇▃
sat_avg	1	1	-1.26	-0.81	-0.24	0.61	3.33	▇▅▃▁▁
undergrads	1	1	-0.69	-0.56	-0.42	0.14	3.60	▇▂▁▁▁
tuition	1	1	-1.31	-1.14	0.25	0.57	1.93	▇▁▇▃▂
faculty_salary_avg	1	1	-1.69	-0.63	-0.09	0.28	3.51	▃▇▁▁▁
loan_default_rate	1	1	-1.67	-0.81	-0.18	0.91	2.01	▅▇▅▆▂
median_debt	1	1	-3.06	-0.71	0.19	0.87	1.27	▁▂▃▆▇

Start by calculating the principal components and the relative contribution each predictor variable makes to each component. For example, PC1 is heavily influenced by tuition, loan_default_rate, sat_avg scores and control (public/private status) in terms of admission status, while PC2 is more influenced by faculty_salary_avg and the number of undergrads.

#calculate principal components
results <- prcomp(indiana_college_PCA, scale = TRUE)

#reverse the signs
results$rotation <- -1*results$rotation

#display principal components
results$rotation

##                           PC1        PC2         PC3         PC4         PC5
## highest_degree      0.2430693  0.2080608  0.53477385 -0.26825327  0.30370254
## control             0.4083806  0.2885812 -0.29550726 -0.03929811 -0.13478781
## gender             -0.1934048 -0.1431852 -0.68501589  0.14937905  0.29354754
## admission_rate      0.1375515 -0.3716774 -0.17284915 -0.56628499  0.41318496
## sat_avg            -0.4263132  0.3342207  0.03151237 -0.06296505  0.12259994
## undergrads          0.1326689  0.4834893 -0.26631091 -0.36478111  0.08598119
## tuition            -0.4969885 -0.1340877  0.17234101  0.05177564 -0.03685151
## faculty_salary_avg -0.2397520  0.4926921 -0.14188722 -0.12269200 -0.30675182
## loan_default_rate   0.4240709 -0.1610670 -0.06132624  0.21617780 -0.41864440
## median_debt        -0.1848191 -0.2874935 -0.04538663 -0.61806659 -0.58272094
##                              PC6          PC7         PC8          PC9
## highest_degree      6.553912e-01 -0.092027525  0.07301850 -0.011559371
## control             7.586281e-02  0.080529989 -0.57112957  0.241274194
## gender              5.902866e-01  0.007402989  0.12427419 -0.004697188
## admission_rate     -2.679841e-01 -0.467358804 -0.15829474 -0.070913100
## sat_avg            -2.123008e-02 -0.255622593 -0.23292506  0.668935168
## undergrads         -2.313999e-01  0.171737032  0.64528468  0.151384782
## tuition             6.163027e-05 -0.222822020  0.18965778  0.133473266
## faculty_salary_avg  8.246360e-02 -0.457885352 -0.10967517 -0.582846939
## loan_default_rate   1.198811e-01 -0.589110974  0.32975146  0.303382270
## median_debt         2.632208e-01  0.255675366 -0.02443088  0.129012913
##                           PC10
## highest_degree     -0.06307069
## control            -0.49625299
## gender              0.02005803
## admission_rate     -0.03650373
## sat_avg             0.34502646
## undergrads         -0.12209993
## tuition            -0.77320627
## faculty_salary_avg  0.04788588
## loan_default_rate   0.07914808
## median_debt         0.08753946

#reverse the signs of the scores
results$x <- -1*results$x

#display the first 10 scores
head(results$x)

##                                         PC1         PC2         PC3       PC4
## Oakland City University           1.2392756 -0.27951718  1.23972801 2.9960221
## Indiana University-Kokomo         2.8784039  0.26073130  0.01502028 1.9011763
## Goshen College                   -0.7686998  0.33679815  1.30593997 1.6922221
## Indiana University-East           2.9498999  0.08661942 -0.05149099 1.6311566
## University of Notre Dame         -4.2370843  4.58168387  1.16224755 1.5852997
## Purdue University-Calumet Campus  2.0497516  1.26845093 -0.28755370 0.9536492
##                                         PC5         PC6        PC7         PC8
## Oakland City University           1.6658160 -0.32940856  0.3198925  0.71786375
## Indiana University-Kokomo         0.6566547 -0.05943309 -0.2315532 -0.37973910
## Goshen College                    1.1771848 -0.03921299  0.5555178  0.02373297
## Indiana University-East          -0.2230460  0.29423028 -0.3665293 -0.11747332
## University of Notre Dame         -0.4187500  0.55015432 -0.7314700 -0.12013461
## Purdue University-Calumet Campus -0.4096729  0.30042858 -0.0758907 -0.28424879
##                                          PC9        PC10
## Oakland City University          -0.08367304  0.02412672
## Indiana University-Kokomo        -0.00138606 -0.13938760
## Goshen College                    0.11432001 -0.09563016
## Indiana University-East           0.35332599  0.02090632
## University of Notre Dame         -0.13218895 -0.02689376
## Purdue University-Calumet Campus -0.50297001  0.03604563

Graphing the data is also helpful to see how the data looks. The following graph shows the relative influences of PC1 vs PC2.

#Graph of individuals. Individuals with a similar profile are grouped together.

fviz_pca_ind(results,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
)

This graph displays how the variables correlate to each other, with positively correlated variables pointing to the same side of the plot, and negative correlated variables pointing to the opposite side. The lengths of the vectors also indicate relative strength of the variable in predicting. For example, sat_avg is strongly correlated to faculty_salary_avg, but negatively correlated to loan_default_rate.

#Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(results,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
)

Here we see a graph that shows both individual institutions and predictor variables mapped together.

#Biplot of individuals and variables
fviz_pca_biplot(results, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

As a reminder, we are looking at determining what variables, or components, are the strongest predictors of admission rates, so here are the top 10 admission rates in Indiana colleges and universities.

#display colleges with highest admission rates in original dataset
highest_admission <- indiana_college %>%
  select(city,admission_rate)

head(highest_admission[order(-highest_admission$admission_rate),],10)

##                                                                    city
## Saint Mary-of-the-Woods College                 Saint Mary of the Woods
## Huntington University                                        Huntington
## University of Saint Francis-Fort Wayne                       Fort Wayne
## Indiana Wesleyan University-Marion                               Marion
## Holy Cross College                                           Notre Dame
## Indiana University-Purdue University-Fort Wayne              Fort Wayne
## Taylor University                                                Upland
## Saint Mary's College                                         Notre Dame
## University of Evansville                                     Evansville
## Indiana State University                                    Terre Haute
##                                                 admission_rate
## Saint Mary-of-the-Woods College                         1.0000
## Huntington University                                   0.9711
## University of Saint Francis-Fort Wayne                  0.9686
## Indiana Wesleyan University-Marion                      0.9505
## Holy Cross College                                      0.9316
## Indiana University-Purdue University-Fort Wayne         0.9087
## Taylor University                                       0.8803
## Saint Mary's College                                    0.8334
## University of Evansville                                0.8295
## Indiana State University                                0.8254

The relative explanation of variance for each PC is listed below in the order of the PCs. So PC1 explains 0.357140538 of the total variance in the data, and PC2 explains 0.252811901 of the variance. This means that the first two PCs explain a cumulative 60.99% of the total variance in college admission data given the predictors in this dataset.

#calculate total variance explained by each principal component
results$sdev^2 / sum(results$sdev^2)

##  [1] 0.357140538 0.252811901 0.124284785 0.107002164 0.061211599 0.043576857
##  [7] 0.028453637 0.018722855 0.005118028 0.001677635

#calculate total variance explained by each principal component
var_explained = results$sdev^2 / sum(results$sdev^2)

The scree plot graphically depicts the variance explained by each distinct PC.

#create scree plot
qplot(c(1:10), var_explained) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  ylim(0, 1)

Lastly we can look at the correlation plot to see graphically what predictor variables influence each PC (or Dim on the top axis). Again, PC1 is composed mostly of control, sat_avg, tuition and loan_default_rate.

var <- get_pca_var(results)
corrplot(var$cos2, is.corr=TRUE)

Conclusion

Both the clustering exercise and the PCA exercise pointed to certain institutions being more selective than others. The principal reasons were more clearly explained in the PCA than in the clustering, and as it turns out the admission rate of an institution appears to be predictable based on a combination of the total tuition rate, loan default rate of its graduates, SAT average scores and whether it is a public or private institution (control). Secondarily the admission rate also reflects the total number of undergraduates that attend the institiion as well as the average faculty salary, assuming that if both the student body count and faculty salaries are higher so will be the admission rate. So if an institution wants to have a good admission rate, then it should be a private school that enrolls more students who pay higher tuition and have higher SAT scores, and it should also pay its faculty higher average salaries.

Data 622 HW 4(Final project)

Douglas Barley

5/7/2022