The dataset on American college and university rankings contains information on 1302 American colleges and universities offering an undergraduate program. For each university, there are 17 measurements that include continuous measurements (such as tuition and graduation rate) and categorical measurements (such as location by state and whether it is a private or a public school)

library(ggplot2)
library(networkD3)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
setwd("/Users/ajarakudaibergenova/Documents/UNH/USML/")

uni <- read.csv("Universities.csv")
summary(uni)
##  College.Name          State           Public..1...Private..2.
##  Length:1302        Length:1302        Min.   :1.000          
##  Class :character   Class :character   1st Qu.:1.000          
##  Mode  :character   Mode  :character   Median :2.000          
##                                        Mean   :1.639          
##                                        3rd Qu.:2.000          
##                                        Max.   :2.000          
##                                                               
##  X..appli..rec.d   X..appl..accepted X..new.stud..enrolled
##  Min.   :   35.0   Min.   :   35.0   Min.   :  18.0       
##  1st Qu.:  695.8   1st Qu.:  554.5   1st Qu.: 236.0       
##  Median : 1470.0   Median : 1095.0   Median : 447.0       
##  Mean   : 2752.1   Mean   : 1870.7   Mean   : 778.9       
##  3rd Qu.: 3314.2   3rd Qu.: 2303.0   3rd Qu.: 984.0       
##  Max.   :48094.0   Max.   :26330.0   Max.   :7425.0       
##  NA's   :10        NA's   :11        NA's   :5            
##  X..new.stud..from.top.10. X..new.stud..from.top.25. X..FT.undergrad
##  Min.   : 1.00             Min.   :  6.00            Min.   :   59  
##  1st Qu.:13.00             1st Qu.: 36.75            1st Qu.:  966  
##  Median :21.00             Median : 50.00            Median : 1812  
##  Mean   :25.67             Mean   : 52.35            Mean   : 3693  
##  3rd Qu.:32.00             3rd Qu.: 66.00            3rd Qu.: 4540  
##  Max.   :98.00             Max.   :100.00            Max.   :31643  
##  NA's   :235               NA's   :202               NA's   :3      
##  X..PT.undergrad   in.state.tuition out.of.state.tuition      room     
##  Min.   :    1.0   Min.   :  480    Min.   : 1044        Min.   : 500  
##  1st Qu.:  131.2   1st Qu.: 2580    1st Qu.: 6111        1st Qu.:1710  
##  Median :  472.0   Median : 8050    Median : 8670        Median :2200  
##  Mean   : 1081.5   Mean   : 7897    Mean   : 9277        Mean   :2515  
##  3rd Qu.: 1313.0   3rd Qu.:11600    3rd Qu.:11659        3rd Qu.:3040  
##  Max.   :21836.0   Max.   :25750    Max.   :25750        Max.   :7400  
##  NA's   :32        NA's   :30       NA's   :20           NA's   :321   
##      board        add..fees      estim..book.costs estim..personal..
##  Min.   : 531   Min.   :   9.0   Min.   :  90      Min.   :  75     
##  1st Qu.:1619   1st Qu.: 130.0   1st Qu.: 480      1st Qu.: 900     
##  Median :1980   Median : 264.5   Median : 502      Median :1250     
##  Mean   :2061   Mean   : 392.0   Mean   : 550      Mean   :1389     
##  3rd Qu.:2402   3rd Qu.: 480.0   3rd Qu.: 600      3rd Qu.:1794     
##  Max.   :6250   Max.   :4374.0   Max.   :2340      Max.   :6900     
##  NA's   :498    NA's   :274      NA's   :48        NA's   :181      
##  X..fac..w.PHD    stud..fac..ratio Graduation.rate 
##  Min.   :  8.00   Min.   : 2.30    Min.   :  8.00  
##  1st Qu.: 57.00   1st Qu.:11.80    1st Qu.: 47.00  
##  Median : 71.00   Median :14.30    Median : 60.00  
##  Mean   : 68.65   Mean   :14.86    Mean   : 60.41  
##  3rd Qu.: 82.00   3rd Qu.:17.60    3rd Qu.: 74.00  
##  Max.   :105.00   Max.   :91.80    Max.   :118.00  
##  NA's   :32       NA's   :2        NA's   :98
str(uni)
## 'data.frame':    1302 obs. of  20 variables:
##  $ College.Name             : chr  "Alaska Pacific University" "University of Alaska at Fairbanks" "University of Alaska Southeast" "University of Alaska at Anchorage" ...
##  $ State                    : chr  "AK" "AK" "AK" "AK" ...
##  $ Public..1...Private..2.  : int  2 1 1 1 1 2 1 1 1 2 ...
##  $ X..appli..rec.d          : int  193 1852 146 2065 2817 345 1351 4639 7548 805 ...
##  $ X..appl..accepted        : int  146 1427 117 1598 1920 320 892 3272 6791 588 ...
##  $ X..new.stud..enrolled    : int  55 928 89 1162 984 179 570 1278 3070 287 ...
##  $ X..new.stud..from.top.10.: int  16 NA 4 NA NA NA 18 NA 25 67 ...
##  $ X..new.stud..from.top.25.: int  44 NA 24 NA NA 27 78 NA 57 88 ...
##  $ X..FT.undergrad          : int  249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
##  $ X..PT.undergrad          : int  869 4519 1849 10537 305 578 331 405 1716 207 ...
##  $ in.state.tuition         : int  7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
##  $ out.of.state.tuition     : int  7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
##  $ room                     : int  1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
##  $ board                    : int  2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
##  $ add..fees                : int  130 155 34 114 155 300 124 84 NA 120 ...
##  $ estim..book.costs        : int  800 650 500 580 500 350 300 500 600 400 ...
##  $ estim..personal..        : int  1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
##  $ X..fac..w.PHD            : int  76 67 39 48 53 52 72 48 85 74 ...
##  $ stud..fac..ratio         : num  11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
##  $ Graduation.rate          : int  15 NA 39 NA 40 55 51 15 69 72 ...
# I want to rename columns for better experience

new_colnames <- c(
  "University", "State", "Public_Private", "Applications_Received",
  "Applications_Accepted", "New_Students_Enrolled", "New_Students_Top_10",
  "New_Students_Top_25", "Full-Time_Undergraduates", "Part-Time_Undergraduates",
  "In-State_Tuition", "Out-of-State_Tuition", "Room_Cost", "Board_Cost",
  "Additional_Fees", "Estimated_Book_Costs", "Estimated_Personal_Costs",
  "Faculty_with_PHD", "Student_Faculty_Ratio", "Graduation_Rate"
)

# Assigning the new column names to the dataset
colnames(uni) <- new_colnames
# removing categorical variables
numerical_data <- uni[, sapply(uni, is.numeric)]

# Removing records with missing numerical measurements
cleaned_data <- na.omit(numerical_data)

# Running summary statistics
summary(cleaned_data)
##  Public_Private  Applications_Received Applications_Accepted
##  Min.   :1.000   Min.   :   77         Min.   :   61.0      
##  1st Qu.:1.000   1st Qu.:  802         1st Qu.:  635.5      
##  Median :2.000   Median : 1646         Median : 1227.0      
##  Mean   :1.728   Mean   : 3147         Mean   : 2063.0      
##  3rd Qu.:2.000   3rd Qu.: 3862         3rd Qu.: 2456.0      
##  Max.   :2.000   Max.   :48094         Max.   :26330.0      
##  New_Students_Enrolled New_Students_Top_10 New_Students_Top_25
##  Min.   :  27.0        Min.   : 1.00       Min.   :  9.00     
##  1st Qu.: 264.0        1st Qu.:15.00       1st Qu.: 40.00     
##  Median : 443.0        Median :23.00       Median : 54.00     
##  Mean   : 780.7        Mean   :28.01       Mean   : 55.65     
##  3rd Qu.: 896.5        3rd Qu.:36.00       3rd Qu.: 69.00     
##  Max.   :6392.0        Max.   :96.00       Max.   :100.00     
##  Full-Time_Undergraduates Part-Time_Undergraduates In-State_Tuition
##  Min.   :  249            Min.   :    1.0          Min.   :  608   
##  1st Qu.: 1018            1st Qu.:   81.5          1st Qu.: 3650   
##  Median : 1715            Median :  299.0          Median : 9858   
##  Mean   : 3563            Mean   :  797.5          Mean   : 9407   
##  3rd Qu.: 4056            3rd Qu.:  869.0          3rd Qu.:13246   
##  Max.   :31643            Max.   :21836.0          Max.   :20100   
##  Out-of-State_Tuition   Room_Cost      Board_Cost   Additional_Fees 
##  Min.   : 1044        Min.   : 640   Min.   : 531   Min.   :  10.0  
##  1st Qu.: 7290        1st Qu.:1740   1st Qu.:1750   1st Qu.: 137.5  
##  Median :10100        Median :2090   Median :2082   Median : 280.0  
##  Mean   :10575        Mean   :2221   Mean   :2122   Mean   : 379.0  
##  3rd Qu.:13286        3rd Qu.:2663   3rd Qu.:2420   3rd Qu.: 486.0  
##  Max.   :20100        Max.   :4816   Max.   :4541   Max.   :3247.0  
##  Estimated_Book_Costs Estimated_Personal_Costs Faculty_with_PHD
##  Min.   :  90.0       Min.   : 250             Min.   :  8.00  
##  1st Qu.: 500.0       1st Qu.: 850             1st Qu.: 63.00  
##  Median : 500.0       Median :1200             Median : 76.00  
##  Mean   : 548.8       Mean   :1312             Mean   : 73.21  
##  3rd Qu.: 600.0       3rd Qu.:1600             3rd Qu.: 87.00  
##  Max.   :2340.0       Max.   :6800             Max.   :103.00  
##  Student_Faculty_Ratio Graduation_Rate 
##  Min.   : 2.90         Min.   : 15.00  
##  1st Qu.:11.30         1st Qu.: 53.00  
##  Median :13.40         Median : 66.00  
##  Mean   :13.96         Mean   : 65.56  
##  3rd Qu.:16.45         3rd Qu.: 79.00  
##  Max.   :28.80         Max.   :118.00

We removed all categorical variables. Then removed all records with missing numerical measurements from the dataset. Run the summary statistics of the dataset.

# Running PCA
pca_result <- prcomp(cleaned_data, scale = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3648 2.1884 1.11016 1.03281 0.99083 0.87377 0.83471
## Proportion of Variance 0.3107 0.2661 0.06847 0.05926 0.05454 0.04241 0.03871
## Cumulative Proportion  0.3107 0.5767 0.64522 0.70448 0.75902 0.80143 0.84014
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.77282 0.73387 0.6627 0.62985 0.58466 0.45855 0.43772
## Proportion of Variance 0.03318 0.02992 0.0244 0.02204 0.01899 0.01168 0.01064
## Cumulative Proportion  0.87332 0.90324 0.9276 0.94968 0.96867 0.98035 0.99100
##                           PC15    PC16   PC17   PC18
## Standard deviation     0.30051 0.18897 0.1472 0.1198
## Proportion of Variance 0.00502 0.00198 0.0012 0.0008
## Cumulative Proportion  0.99602 0.99800 0.9992 1.0000
# Creating a scree plot
fviz_eig(pca_result, addlabels = TRUE)

The first principal component (PC1) presents the highest eigenvalue approximately equal to 2.36, explaining about 31.07% of the total variance. The second principal component (PC2) demonstrates an eigenvalue of approximately 2.19, explaining about 26.6% of the total variance. We would have explained around 75% by the time we get to PC5 and about 96.87% by PC 12 overall variant. This analysis helps us determine the number of principal components we can shorten for our dimensionality redunction. We may retain some of these elements that describe significant proportion of the variation depending on our objectives behind the analysis. Herein, I select components that explain 75% of the data. Take for case, we keep the first five principal components with a summative explanation of more than seventy percent of variation.