The dataset on American college and university rankings contains
information on 1302 American colleges and universities offering an
undergraduate program. For each university, there are 17 measurements
that include continuous measurements (such as tuition and graduation
rate) and categorical measurements (such as location by state and
whether it is a private or a public school)
library(ggplot2)
library(networkD3)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
setwd("/Users/ajarakudaibergenova/Documents/UNH/USML/")
uni <- read.csv("Universities.csv")
summary(uni)
## College.Name State Public..1...Private..2.
## Length:1302 Length:1302 Min. :1.000
## Class :character Class :character 1st Qu.:1.000
## Mode :character Mode :character Median :2.000
## Mean :1.639
## 3rd Qu.:2.000
## Max. :2.000
##
## X..appli..rec.d X..appl..accepted X..new.stud..enrolled
## Min. : 35.0 Min. : 35.0 Min. : 18.0
## 1st Qu.: 695.8 1st Qu.: 554.5 1st Qu.: 236.0
## Median : 1470.0 Median : 1095.0 Median : 447.0
## Mean : 2752.1 Mean : 1870.7 Mean : 778.9
## 3rd Qu.: 3314.2 3rd Qu.: 2303.0 3rd Qu.: 984.0
## Max. :48094.0 Max. :26330.0 Max. :7425.0
## NA's :10 NA's :11 NA's :5
## X..new.stud..from.top.10. X..new.stud..from.top.25. X..FT.undergrad
## Min. : 1.00 Min. : 6.00 Min. : 59
## 1st Qu.:13.00 1st Qu.: 36.75 1st Qu.: 966
## Median :21.00 Median : 50.00 Median : 1812
## Mean :25.67 Mean : 52.35 Mean : 3693
## 3rd Qu.:32.00 3rd Qu.: 66.00 3rd Qu.: 4540
## Max. :98.00 Max. :100.00 Max. :31643
## NA's :235 NA's :202 NA's :3
## X..PT.undergrad in.state.tuition out.of.state.tuition room
## Min. : 1.0 Min. : 480 Min. : 1044 Min. : 500
## 1st Qu.: 131.2 1st Qu.: 2580 1st Qu.: 6111 1st Qu.:1710
## Median : 472.0 Median : 8050 Median : 8670 Median :2200
## Mean : 1081.5 Mean : 7897 Mean : 9277 Mean :2515
## 3rd Qu.: 1313.0 3rd Qu.:11600 3rd Qu.:11659 3rd Qu.:3040
## Max. :21836.0 Max. :25750 Max. :25750 Max. :7400
## NA's :32 NA's :30 NA's :20 NA's :321
## board add..fees estim..book.costs estim..personal..
## Min. : 531 Min. : 9.0 Min. : 90 Min. : 75
## 1st Qu.:1619 1st Qu.: 130.0 1st Qu.: 480 1st Qu.: 900
## Median :1980 Median : 264.5 Median : 502 Median :1250
## Mean :2061 Mean : 392.0 Mean : 550 Mean :1389
## 3rd Qu.:2402 3rd Qu.: 480.0 3rd Qu.: 600 3rd Qu.:1794
## Max. :6250 Max. :4374.0 Max. :2340 Max. :6900
## NA's :498 NA's :274 NA's :48 NA's :181
## X..fac..w.PHD stud..fac..ratio Graduation.rate
## Min. : 8.00 Min. : 2.30 Min. : 8.00
## 1st Qu.: 57.00 1st Qu.:11.80 1st Qu.: 47.00
## Median : 71.00 Median :14.30 Median : 60.00
## Mean : 68.65 Mean :14.86 Mean : 60.41
## 3rd Qu.: 82.00 3rd Qu.:17.60 3rd Qu.: 74.00
## Max. :105.00 Max. :91.80 Max. :118.00
## NA's :32 NA's :2 NA's :98
str(uni)
## 'data.frame': 1302 obs. of 20 variables:
## $ College.Name : chr "Alaska Pacific University" "University of Alaska at Fairbanks" "University of Alaska Southeast" "University of Alaska at Anchorage" ...
## $ State : chr "AK" "AK" "AK" "AK" ...
## $ Public..1...Private..2. : int 2 1 1 1 1 2 1 1 1 2 ...
## $ X..appli..rec.d : int 193 1852 146 2065 2817 345 1351 4639 7548 805 ...
## $ X..appl..accepted : int 146 1427 117 1598 1920 320 892 3272 6791 588 ...
## $ X..new.stud..enrolled : int 55 928 89 1162 984 179 570 1278 3070 287 ...
## $ X..new.stud..from.top.10.: int 16 NA 4 NA NA NA 18 NA 25 67 ...
## $ X..new.stud..from.top.25.: int 44 NA 24 NA NA 27 78 NA 57 88 ...
## $ X..FT.undergrad : int 249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
## $ X..PT.undergrad : int 869 4519 1849 10537 305 578 331 405 1716 207 ...
## $ in.state.tuition : int 7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
## $ out.of.state.tuition : int 7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
## $ room : int 1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
## $ board : int 2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
## $ add..fees : int 130 155 34 114 155 300 124 84 NA 120 ...
## $ estim..book.costs : int 800 650 500 580 500 350 300 500 600 400 ...
## $ estim..personal.. : int 1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
## $ X..fac..w.PHD : int 76 67 39 48 53 52 72 48 85 74 ...
## $ stud..fac..ratio : num 11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
## $ Graduation.rate : int 15 NA 39 NA 40 55 51 15 69 72 ...
# I want to rename columns for better experience
new_colnames <- c(
"University", "State", "Public_Private", "Applications_Received",
"Applications_Accepted", "New_Students_Enrolled", "New_Students_Top_10",
"New_Students_Top_25", "Full-Time_Undergraduates", "Part-Time_Undergraduates",
"In-State_Tuition", "Out-of-State_Tuition", "Room_Cost", "Board_Cost",
"Additional_Fees", "Estimated_Book_Costs", "Estimated_Personal_Costs",
"Faculty_with_PHD", "Student_Faculty_Ratio", "Graduation_Rate"
)
# Assigning the new column names to the dataset
colnames(uni) <- new_colnames
# removing categorical variables
numerical_data <- uni[, sapply(uni, is.numeric)]
# Removing records with missing numerical measurements
cleaned_data <- na.omit(numerical_data)
# Running summary statistics
summary(cleaned_data)
## Public_Private Applications_Received Applications_Accepted
## Min. :1.000 Min. : 77 Min. : 61.0
## 1st Qu.:1.000 1st Qu.: 802 1st Qu.: 635.5
## Median :2.000 Median : 1646 Median : 1227.0
## Mean :1.728 Mean : 3147 Mean : 2063.0
## 3rd Qu.:2.000 3rd Qu.: 3862 3rd Qu.: 2456.0
## Max. :2.000 Max. :48094 Max. :26330.0
## New_Students_Enrolled New_Students_Top_10 New_Students_Top_25
## Min. : 27.0 Min. : 1.00 Min. : 9.00
## 1st Qu.: 264.0 1st Qu.:15.00 1st Qu.: 40.00
## Median : 443.0 Median :23.00 Median : 54.00
## Mean : 780.7 Mean :28.01 Mean : 55.65
## 3rd Qu.: 896.5 3rd Qu.:36.00 3rd Qu.: 69.00
## Max. :6392.0 Max. :96.00 Max. :100.00
## Full-Time_Undergraduates Part-Time_Undergraduates In-State_Tuition
## Min. : 249 Min. : 1.0 Min. : 608
## 1st Qu.: 1018 1st Qu.: 81.5 1st Qu.: 3650
## Median : 1715 Median : 299.0 Median : 9858
## Mean : 3563 Mean : 797.5 Mean : 9407
## 3rd Qu.: 4056 3rd Qu.: 869.0 3rd Qu.:13246
## Max. :31643 Max. :21836.0 Max. :20100
## Out-of-State_Tuition Room_Cost Board_Cost Additional_Fees
## Min. : 1044 Min. : 640 Min. : 531 Min. : 10.0
## 1st Qu.: 7290 1st Qu.:1740 1st Qu.:1750 1st Qu.: 137.5
## Median :10100 Median :2090 Median :2082 Median : 280.0
## Mean :10575 Mean :2221 Mean :2122 Mean : 379.0
## 3rd Qu.:13286 3rd Qu.:2663 3rd Qu.:2420 3rd Qu.: 486.0
## Max. :20100 Max. :4816 Max. :4541 Max. :3247.0
## Estimated_Book_Costs Estimated_Personal_Costs Faculty_with_PHD
## Min. : 90.0 Min. : 250 Min. : 8.00
## 1st Qu.: 500.0 1st Qu.: 850 1st Qu.: 63.00
## Median : 500.0 Median :1200 Median : 76.00
## Mean : 548.8 Mean :1312 Mean : 73.21
## 3rd Qu.: 600.0 3rd Qu.:1600 3rd Qu.: 87.00
## Max. :2340.0 Max. :6800 Max. :103.00
## Student_Faculty_Ratio Graduation_Rate
## Min. : 2.90 Min. : 15.00
## 1st Qu.:11.30 1st Qu.: 53.00
## Median :13.40 Median : 66.00
## Mean :13.96 Mean : 65.56
## 3rd Qu.:16.45 3rd Qu.: 79.00
## Max. :28.80 Max. :118.00
We removed all categorical variables. Then removed all records with
missing numerical measurements from the dataset. Run the summary
statistics of the dataset.
# Running PCA
pca_result <- prcomp(cleaned_data, scale = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3648 2.1884 1.11016 1.03281 0.99083 0.87377 0.83471
## Proportion of Variance 0.3107 0.2661 0.06847 0.05926 0.05454 0.04241 0.03871
## Cumulative Proportion 0.3107 0.5767 0.64522 0.70448 0.75902 0.80143 0.84014
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.77282 0.73387 0.6627 0.62985 0.58466 0.45855 0.43772
## Proportion of Variance 0.03318 0.02992 0.0244 0.02204 0.01899 0.01168 0.01064
## Cumulative Proportion 0.87332 0.90324 0.9276 0.94968 0.96867 0.98035 0.99100
## PC15 PC16 PC17 PC18
## Standard deviation 0.30051 0.18897 0.1472 0.1198
## Proportion of Variance 0.00502 0.00198 0.0012 0.0008
## Cumulative Proportion 0.99602 0.99800 0.9992 1.0000
# Creating a scree plot
fviz_eig(pca_result, addlabels = TRUE)

The first principal component (PC1) presents the highest eigenvalue
approximately equal to 2.36, explaining about 31.07% of the total
variance. The second principal component (PC2) demonstrates an
eigenvalue of approximately 2.19, explaining about 26.6% of the total
variance. We would have explained around 75% by the time we get to PC5
and about 96.87% by PC 12 overall variant. This analysis helps us
determine the number of principal components we can shorten for our
dimensionality redunction. We may retain some of these elements that
describe significant proportion of the variation depending on our
objectives behind the analysis. Herein, I select components that explain
75% of the data. Take for case, we keep the first five principal
components with a summative explanation of more than seventy percent of
variation.