You get to decide which dataset you want to work on. The data set must be different You can work on a problem from your work, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portal.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:
describe the problem you are trying to solve.
describe your datases and what you did to prepare the data for analysis.
methodologies you used for analyzing the data
why you did what you did
make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.
For six years I was an Admissions Officer at a Big 10 flagship
university in the Midwest, so the topic of university admission has been
a long term interest for me. I will use the college dataset
provided as part of the materials included in the textbook for this
course “Practical Machine Learning in R” by Fred Nwanganga and Mike
Chapple, published in 2020 by John Wiley & Sons Inc, Indianapolis,
Indiana..
Reading in the college dataset from .csv format requires
the read.csv function. The data has 1270 observations of 17
variables, 8 of which are categorical data and 9 of which are numeric
data.
college <- read.csv("C:/data/college.csv")
str(college)
## 'data.frame': 1270 obs. of 17 variables:
## $ id : int 102669 101648 100830 101879 100858 100663 101480 102049 101709 100751 ...
## $ name : chr "Alaska Pacific University" "Marion Military Institute" "Auburn University at Montgomery" "University of North Alabama" ...
## $ city : chr "Anchorage" "Marion" "Montgomery" "Florence" ...
## $ state : chr "AK" "AL" "AL" "AL" ...
## $ region : chr "West" "South" "South" "South" ...
## $ highest_degree : chr "Graduate" "Associate" "Graduate" "Graduate" ...
## $ control : chr "Private" "Public" "Public" "Public" ...
## $ gender : chr "CoEd" "CoEd" "CoEd" "CoEd" ...
## $ admission_rate : num 0.421 0.614 0.802 0.679 0.835 ...
## $ sat_avg : int 1054 1055 1009 1029 1215 1107 1041 1165 1070 1185 ...
## $ undergrads : int 275 433 4304 5485 20514 11383 7060 3033 2644 29851 ...
## $ tuition : int 19610 8778 9080 7412 10200 7510 7092 27324 10660 9826 ...
## $ faculty_salary_avg: int 5804 5916 7255 7424 9487 9957 6801 8367 7437 9667 ...
## $ loan_default_rate : chr "0.077" "0.136" "0.106" "0.111" ...
## $ median_debt : num 23250 11500 21335 21500 21831 ...
## $ lon : num -149.9 -87.3 -86.3 -87.7 -85.5 ...
## $ lat : num 61.2 32.6 32.4 34.8 32.6 ...
Examining the summary stats there are no missing values in any of the 8 chr variables or 9 numeric variables, so we can now decide which two methodologies to apply to the data, which will be clustering from week 10 and Neural Network from week 15.
skim(college)
| Name | college |
| Number of rows | 1270 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1 | 8 | 65 | 0 | 1261 | 0 |
| city | 0 | 1 | 3 | 23 | 0 | 834 | 0 |
| state | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| region | 0 | 1 | 4 | 9 | 0 | 4 | 0 |
| highest_degree | 0 | 1 | 8 | 9 | 0 | 4 | 0 |
| control | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
| gender | 0 | 1 | 3 | 5 | 0 | 3 | 0 |
| loan_default_rate | 0 | 1 | 1 | 5 | 0 | 198 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 187221.94 | 53474.50 | 100654.00 | 153254.75 | 186327.00 | 215290.75 | 484905.00 | ▆▇▁▁▁ |
| admission_rate | 0 | 1 | 0.65 | 0.19 | 0.05 | 0.53 | 0.67 | 0.79 | 1.00 | ▁▂▆▇▅ |
| sat_avg | 0 | 1 | 1059.58 | 133.01 | 720.00 | 973.25 | 1040.50 | 1120.75 | 1545.00 | ▁▇▆▂▁ |
| undergrads | 0 | 1 | 5624.98 | 7377.91 | 47.00 | 1293.75 | 2554.50 | 6712.75 | 52280.00 | ▇▁▁▁▁ |
| tuition | 0 | 1 | 21011.31 | 12550.28 | 2732.00 | 8966.25 | 19995.00 | 30354.50 | 51008.00 | ▇▃▆▃▂ |
| faculty_salary_avg | 0 | 1 | 7655.02 | 2222.86 | 1451.00 | 6191.00 | 7268.50 | 8670.50 | 20650.00 | ▁▇▂▁▁ |
| median_debt | 0 | 1 | 23476.80 | 4618.13 | 6056.00 | 21250.00 | 24544.25 | 27000.00 | 41000.00 | ▁▂▇▁▁ |
| lon | 0 | 1 | -88.29 | 13.93 | -157.92 | -94.17 | -84.88 | -78.63 | -68.59 | ▁▁▁▅▇ |
| lat | 0 | 1 | 38.60 | 4.64 | 19.71 | 35.20 | 39.74 | 41.81 | 61.22 | ▁▃▇▁▁ |
K-means clustering is one of the most commonly used clustering approaches, so let us see how it fares in this case.
First we need to decide what we want to examine from the dataset. Since I live in Indiana, we can look at Indiana college admission rates and tease out different clusters based on SAT average scores. The resulting dataset has 40 observations of 16 variables.
indiana_college <- college %>%
filter(state == "IN") %>%
column_to_rownames(var = "name")
head(indiana_college)
## id city state region
## Oakland City University 152099 Oakland City IN Midwest
## Indiana University-Kokomo 151333 Kokomo IN Midwest
## Goshen College 150668 Goshen IN Midwest
## Indiana University-East 151388 Richmond IN Midwest
## University of Notre Dame 152080 Notre Dame IN Midwest
## Purdue University-Calumet Campus 152248 Hammond IN Midwest
## highest_degree control gender admission_rate
## Oakland City University Graduate Private CoEd 0.5504
## Indiana University-Kokomo Graduate Public CoEd 0.6726
## Goshen College Graduate Private CoEd 0.5400
## Indiana University-East Graduate Public CoEd 0.6376
## University of Notre Dame Graduate Private CoEd 0.2114
## Purdue University-Calumet Campus Graduate Public CoEd 0.5968
## sat_avg undergrads tuition faculty_salary_avg
## Oakland City University 960 616 19800 4290
## Indiana University-Kokomo 934 2724 6811 5696
## Goshen College 1109 746 29700 5910
## Indiana University-East 943 3200 6787 5819
## University of Notre Dame 1450 8427 46237 13102
## Purdue University-Calumet Campus 965 7316 6758 8118
## loan_default_rate median_debt lon
## Oakland City University 0.078 16973 -87.34501
## Indiana University-Kokomo 0.108 19052 -86.13360
## Goshen College 0.033 20671 -85.83444
## Indiana University-East 0.132 21172 -84.89024
## University of Notre Dame 0.006 21250 -86.23534
## Purdue University-Calumet Campus 0.094 22028 -87.50004
## lat
## Oakland City University 38.33866
## Indiana University-Kokomo 40.48643
## Goshen College 41.58227
## Indiana University-East 39.82894
## University of Notre Dame 41.70557
## Purdue University-Calumet Campus 41.58337
Next we need to decide which variables we will cluster around. We can
group by admission_rate and sat_avg to see
what happens, so let’s take a look at the distributions of each of those
metrics.
indiana_college %>%
select(admission_rate, sat_avg) %>%
summary()
## admission_rate sat_avg
## Min. :0.2114 Min. : 913
## 1st Qu.:0.6279 1st Qu.: 965
## Median :0.7096 Median :1032
## Mean :0.7215 Mean :1060
## 3rd Qu.:0.8240 3rd Qu.:1132
## Max. :1.0000 Max. :1450
The ranges of each variable are quite different, so we need to scale them for the K-means process.
indiana_college_scaled <- indiana_college %>%
select(admission_rate, sat_avg) %>%
scale()
head(indiana_college_scaled)
## admission_rate sat_avg
## Oakland City University -1.1180533 -0.8567232
## Indiana University-Kokomo -0.3194205 -1.0788050
## Goshen College -1.1860221 0.4159763
## Indiana University-East -0.5481615 -1.0019305
## University of Notre Dame -3.3335732 3.3286644
## Purdue University-Calumet Campus -0.8148081 -0.8140152
Before clustering the data, it would be good to see how many clusters are recommended based on the data itself. There are three methods we can use to recommend the number of clusters, so we will try them all.
The first method is the Elbow method, which uses a measure
of within-cluster sum of squares (WCSS) which is called by
setting the method = "wss". This method measures the
distance between each item in a cluster and the cluster’s centroid. The
Elbow method suggests that 6 clusters are optimal.
fviz_nbclust(indiana_college_scaled, kmeans, method = "wss")
The second method is the Average Silhouette Method which is
called by setting the method = "silhouette". This method
compares each item in a cluster with other items in the cluster as well
as items in neighboring clusters. The Silhouette Method suggests that 6
clusters are optimal.
fviz_nbclust(indiana_college_scaled, kmeans, method = "silhouette")
The third method is the Gap Statistic, which generates a
random reference dataset and measures the difference in WCSS between the
original and random datasets, which is called by setting the
method = "gap_stat". The Gap Statistic method appears to
show that 1 cluster is optimal, but since 1 cluster does not separate
the data in any way the second best choice appears to be 9 clusters.
However, given that there are only 40 unique colleges in the data using
9 clusters could result in clusters with only one or two colleges, so
the third best choice appears to be 6 clusters in the Gap Statistic
method.
fviz_nbclust(indiana_college_scaled, kmeans, method = "gap_stat")
Since all three methods suggested 6 clusters being optimal, we can create a six cluster grouping to visualize the data.
First we will try six centers with 25 initial configurations. The K-means algorithm selected the best result with six clusters of sizes 1, 7, 6, 14, 8 and 4.
set.seed(101)
k_6 <- kmeans(indiana_college_scaled, centers = 6, nstart = 25)
k_6$size
## [1] 1 7 6 14 8 4
It may be more useful to analyze the clusters visually using the
factoextra package’s fviz_cluster function.
Here we see that with six clusters the University of Notre Dame stands
alone with the lowest admission rate and highest SAT averages. This
makes sense since Notre Dame is a highly competitive private
university.
The other five clusters show that Private Colleges and universities that specialize in Engineering are the second most selective with lower admission rates and higher SAT averages (clusters 2 and 6), while more public and liberal arts focused institutions are less selective with higher admission rates and lower SAT averages (clusters 3, 4 and 5).
fviz_cluster(k_6, data = indiana_college_scaled, repel = TRUE)
Principal component analysis (PCA) helps to determine the relative importance of predictors in the dataset, but each principal component may be a combination of individual predictor variables. However, knowing what predictors influence the outcomes the most is a valuable thing to know as a data scientist, so we will now calculate the principal components and their influence on the total variance for the Indiana colleges dataset.
Before we can dive into PCA we need to clean up the dataset by either
removing non-numeric predictors or converting them to factors. We will
remove the id,city,state,
region, lon and lat predictors,
since all institutions are in the state of Indiana in the Midwest
region. We will convert loan_default_rate,
highest_degree, control and
gender to factors then to numeric values.
Last we will scale the data to level the analysis in terms of weighting values.
indiana_college_PCA <- indiana_college %>%
select(-id,-city,-state,-region,-lon,-lat) %>%
mutate(loan_default_rate = as.numeric(loan_default_rate),
highest_degree = as.numeric(as.factor(highest_degree)),
control = as.numeric(as.factor(control)),
gender = as.numeric(as.factor(gender))) %>%
mutate(loan_default_rate = ifelse(is.na(loan_default_rate), mean(loan_default_rate, na.rm = TRUE), loan_default_rate)) %>%
scale()
#view first ten rows of data
head(indiana_college_PCA,10)
## highest_degree control gender
## Oakland City University 0.4147997 -0.7245688 -0.2143418
## Indiana University-Kokomo 0.4147997 1.3456278 -0.2143418
## Goshen College 0.4147997 -0.7245688 -0.2143418
## Indiana University-East 0.4147997 1.3456278 -0.2143418
## University of Notre Dame 0.4147997 -0.7245688 -0.2143418
## Purdue University-Calumet Campus 0.4147997 1.3456278 -0.2143418
## Grace College and Theological Seminary 0.4147997 -0.7245688 -0.2143418
## Indiana University-Southeast 0.4147997 1.3456278 -0.2143418
## Purdue University-Main Campus 0.4147997 1.3456278 -0.2143418
## University of Southern Indiana 0.4147997 1.3456278 -0.2143418
## admission_rate sat_avg undergrads
## Oakland City University -1.11805331 -0.8567232 -0.67177055
## Indiana University-Kokomo -0.31942046 -1.0788050 -0.38736183
## Goshen College -1.18602206 0.4159763 -0.65423111
## Indiana University-East -0.54816146 -1.0019305 -0.32314051
## University of Notre Dame -3.33357324 3.3286644 0.38207977
## Purdue University-Calumet Campus -0.81480810 -0.8140152 0.23218504
## Grace College and Theological Seminary 0.52103931 0.4245179 -0.55547059
## Indiana University-Southeast 0.53411023 -0.8652648 -0.01606545
## Purdue University-Main Campus -0.84029639 1.3043034 3.28957871
## University of Southern Indiana -0.05277382 -0.5065173 0.35590553
## tuition faculty_salary_avg
## Oakland City University -0.2360107 -1.6917626
## Indiana University-Kokomo -1.3015085 -0.8615583
## Goshen College 0.5760939 -0.7351972
## Indiana University-East -1.3034773 -0.7889302
## University of Notre Dame 1.9326369 3.5114805
## Purdue University-Calumet Campus -1.3058561 0.5685659
## Grace College and Theological Seminary 0.1634791 -0.4311039
## Indiana University-Southeast -1.3001960 -0.3135999
## Purdue University-Main Campus -1.0397483 2.3535641
## University of Southern Indiana -1.2895320 -0.2185338
## loan_default_rate median_debt
## Oakland City University 0.4301220 -3.0585360
## Indiana University-Kokomo 1.3064316 -2.2432741
## Goshen College -0.8843424 -1.6083972
## Indiana University-East 2.0074792 -1.4119344
## University of Notre Dame -1.6730211 -1.3813474
## Purdue University-Calumet Campus 0.8974871 -1.0762614
## Grace College and Theological Seminary -0.5046083 -0.9401883
## Indiana University-Southeast 1.2188006 -0.8911707
## Purdue University-Main Campus -0.8551321 -0.8911707
## University of Southern Indiana 0.1088084 -0.7476469
Having completed the transformations, we can look at a summary of the data for any other issues that need to be addressed. There are no missing values and the data is appropriately scaled, so we may proceed with the PCA.
skim(indiana_college_PCA)
| Name | indiana_college_PCA |
| Number of rows | 40 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| highest_degree | 0 | 1 | 0 | 1 | -2.35 | 0.41 | 0.41 | 0.41 | 0.41 | ▂▁▁▁▇ |
| control | 0 | 1 | 0 | 1 | -0.72 | -0.72 | -0.72 | 1.35 | 1.35 | ▇▁▁▁▅ |
| gender | 0 | 1 | 0 | 1 | -0.21 | -0.21 | -0.21 | -0.21 | 5.50 | ▇▁▁▁▁ |
| admission_rate | 0 | 1 | 0 | 1 | -3.33 | -0.61 | -0.08 | 0.67 | 1.82 | ▁▁▆▇▃ |
| sat_avg | 0 | 1 | 0 | 1 | -1.26 | -0.81 | -0.24 | 0.61 | 3.33 | ▇▅▃▁▁ |
| undergrads | 0 | 1 | 0 | 1 | -0.69 | -0.56 | -0.42 | 0.14 | 3.60 | ▇▂▁▁▁ |
| tuition | 0 | 1 | 0 | 1 | -1.31 | -1.14 | 0.25 | 0.57 | 1.93 | ▇▁▇▃▂ |
| faculty_salary_avg | 0 | 1 | 0 | 1 | -1.69 | -0.63 | -0.09 | 0.28 | 3.51 | ▃▇▁▁▁ |
| loan_default_rate | 0 | 1 | 0 | 1 | -1.67 | -0.81 | -0.18 | 0.91 | 2.01 | ▅▇▅▆▂ |
| median_debt | 0 | 1 | 0 | 1 | -3.06 | -0.71 | 0.19 | 0.87 | 1.27 | ▁▂▃▆▇ |
Start by calculating the principal components and the relative contribution each predictor variable makes to each component. For example, PC1 is heavily influenced by tuition, loan_default_rate, sat_avg scores and control (public/private status) in terms of admission status, while PC2 is more influenced by faculty_salary_avg and the number of undergrads.
#calculate principal components
results <- prcomp(indiana_college_PCA, scale = TRUE)
#reverse the signs
results$rotation <- -1*results$rotation
#display principal components
results$rotation
## PC1 PC2 PC3 PC4 PC5
## highest_degree 0.2430693 0.2080608 0.53477385 -0.26825327 0.30370254
## control 0.4083806 0.2885812 -0.29550726 -0.03929811 -0.13478781
## gender -0.1934048 -0.1431852 -0.68501589 0.14937905 0.29354754
## admission_rate 0.1375515 -0.3716774 -0.17284915 -0.56628499 0.41318496
## sat_avg -0.4263132 0.3342207 0.03151237 -0.06296505 0.12259994
## undergrads 0.1326689 0.4834893 -0.26631091 -0.36478111 0.08598119
## tuition -0.4969885 -0.1340877 0.17234101 0.05177564 -0.03685151
## faculty_salary_avg -0.2397520 0.4926921 -0.14188722 -0.12269200 -0.30675182
## loan_default_rate 0.4240709 -0.1610670 -0.06132624 0.21617780 -0.41864440
## median_debt -0.1848191 -0.2874935 -0.04538663 -0.61806659 -0.58272094
## PC6 PC7 PC8 PC9
## highest_degree 6.553912e-01 -0.092027525 0.07301850 -0.011559371
## control 7.586281e-02 0.080529989 -0.57112957 0.241274194
## gender 5.902866e-01 0.007402989 0.12427419 -0.004697188
## admission_rate -2.679841e-01 -0.467358804 -0.15829474 -0.070913100
## sat_avg -2.123008e-02 -0.255622593 -0.23292506 0.668935168
## undergrads -2.313999e-01 0.171737032 0.64528468 0.151384782
## tuition 6.163027e-05 -0.222822020 0.18965778 0.133473266
## faculty_salary_avg 8.246360e-02 -0.457885352 -0.10967517 -0.582846939
## loan_default_rate 1.198811e-01 -0.589110974 0.32975146 0.303382270
## median_debt 2.632208e-01 0.255675366 -0.02443088 0.129012913
## PC10
## highest_degree -0.06307069
## control -0.49625299
## gender 0.02005803
## admission_rate -0.03650373
## sat_avg 0.34502646
## undergrads -0.12209993
## tuition -0.77320627
## faculty_salary_avg 0.04788588
## loan_default_rate 0.07914808
## median_debt 0.08753946
#reverse the signs of the scores
results$x <- -1*results$x
#display the first 10 scores
head(results$x)
## PC1 PC2 PC3 PC4
## Oakland City University 1.2392756 -0.27951718 1.23972801 2.9960221
## Indiana University-Kokomo 2.8784039 0.26073130 0.01502028 1.9011763
## Goshen College -0.7686998 0.33679815 1.30593997 1.6922221
## Indiana University-East 2.9498999 0.08661942 -0.05149099 1.6311566
## University of Notre Dame -4.2370843 4.58168387 1.16224755 1.5852997
## Purdue University-Calumet Campus 2.0497516 1.26845093 -0.28755370 0.9536492
## PC5 PC6 PC7 PC8
## Oakland City University 1.6658160 -0.32940856 0.3198925 0.71786375
## Indiana University-Kokomo 0.6566547 -0.05943309 -0.2315532 -0.37973910
## Goshen College 1.1771848 -0.03921299 0.5555178 0.02373297
## Indiana University-East -0.2230460 0.29423028 -0.3665293 -0.11747332
## University of Notre Dame -0.4187500 0.55015432 -0.7314700 -0.12013461
## Purdue University-Calumet Campus -0.4096729 0.30042858 -0.0758907 -0.28424879
## PC9 PC10
## Oakland City University -0.08367304 0.02412672
## Indiana University-Kokomo -0.00138606 -0.13938760
## Goshen College 0.11432001 -0.09563016
## Indiana University-East 0.35332599 0.02090632
## University of Notre Dame -0.13218895 -0.02689376
## Purdue University-Calumet Campus -0.50297001 0.03604563
Graphing the data is also helpful to see how the data looks. The following graph shows the relative influences of PC1 vs PC2.
#Graph of individuals. Individuals with a similar profile are grouped together.
fviz_pca_ind(results,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
This graph displays how the variables correlate to each other, with positively correlated variables pointing to the same side of the plot, and negative correlated variables pointing to the opposite side. The lengths of the vectors also indicate relative strength of the variable in predicting. For example, sat_avg is strongly correlated to faculty_salary_avg, but negatively correlated to loan_default_rate.
#Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(results,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Here we see a graph that shows both individual institutions and predictor variables mapped together.
#Biplot of individuals and variables
fviz_pca_biplot(results, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
As a reminder, we are looking at determining what variables, or components, are the strongest predictors of admission rates, so here are the top 10 admission rates in Indiana colleges and universities.
#display colleges with highest admission rates in original dataset
highest_admission <- indiana_college %>%
select(city,admission_rate)
head(highest_admission[order(-highest_admission$admission_rate),],10)
## city
## Saint Mary-of-the-Woods College Saint Mary of the Woods
## Huntington University Huntington
## University of Saint Francis-Fort Wayne Fort Wayne
## Indiana Wesleyan University-Marion Marion
## Holy Cross College Notre Dame
## Indiana University-Purdue University-Fort Wayne Fort Wayne
## Taylor University Upland
## Saint Mary's College Notre Dame
## University of Evansville Evansville
## Indiana State University Terre Haute
## admission_rate
## Saint Mary-of-the-Woods College 1.0000
## Huntington University 0.9711
## University of Saint Francis-Fort Wayne 0.9686
## Indiana Wesleyan University-Marion 0.9505
## Holy Cross College 0.9316
## Indiana University-Purdue University-Fort Wayne 0.9087
## Taylor University 0.8803
## Saint Mary's College 0.8334
## University of Evansville 0.8295
## Indiana State University 0.8254
The relative explanation of variance for each PC is listed below in the order of the PCs. So PC1 explains 0.357140538 of the total variance in the data, and PC2 explains 0.252811901 of the variance. This means that the first two PCs explain a cumulative 60.99% of the total variance in college admission data given the predictors in this dataset.
#calculate total variance explained by each principal component
results$sdev^2 / sum(results$sdev^2)
## [1] 0.357140538 0.252811901 0.124284785 0.107002164 0.061211599 0.043576857
## [7] 0.028453637 0.018722855 0.005118028 0.001677635
#calculate total variance explained by each principal component
var_explained = results$sdev^2 / sum(results$sdev^2)
The scree plot graphically depicts the variance explained by each distinct PC.
#create scree plot
qplot(c(1:10), var_explained) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot") +
ylim(0, 1)
Lastly we can look at the correlation plot to see graphically what predictor variables influence each PC (or Dim on the top axis). Again, PC1 is composed mostly of control, sat_avg, tuition and loan_default_rate.
var <- get_pca_var(results)
corrplot(var$cos2, is.corr=TRUE)
Both the clustering exercise and the PCA exercise pointed to certain institutions being more selective than others. The principal reasons were more clearly explained in the PCA than in the clustering, and as it turns out the admission rate of an institution appears to be predictable based on a combination of the total tuition rate, loan default rate of its graduates, SAT average scores and whether it is a public or private institution (control). Secondarily the admission rate also reflects the total number of undergraduates that attend the institiion as well as the average faculty salary, assuming that if both the student body count and faculty salaries are higher so will be the admission rate. So if an institution wants to have a good admission rate, then it should be a private school that enrolls more students who pay higher tuition and have higher SAT scores, and it should also pay its faculty higher average salaries.