Decriptive analysis

Getting correlations

##      name             location         scores_teaching   scores_research  
##  Length:1448        Length:1448        Min.   :0.00000   Min.   :0.00000  
##  Class :character   Class :character   1st Qu.:0.08222   1st Qu.:0.04649  
##  Mode  :character   Mode  :character   Median :0.14148   Median :0.10865  
##                                        Mean   :0.19144   Mean   :0.17330  
##                                        3rd Qu.:0.24456   3rd Qu.:0.23486  
##                                        Max.   :1.00000   Max.   :1.00000  
##  scores_citations scores_industry_income scores_international_outlook
##  Min.   :0.0000   Min.   :0.00000        Min.   :0.0000              
##  1st Qu.:0.2166   1st Qu.:0.02099        1st Qu.:0.1618              
##  Median :0.4373   Median :0.07646        Median :0.3324              
##  Mean   :0.4675   Mean   :0.18321        Mean   :0.3869              
##  3rd Qu.:0.7034   3rd Qu.:0.22639        3rd Qu.:0.5725              
##  Max.   :1.0000   Max.   :1.00000        Max.   :1.0000              
##  stats_number_students stats_student_staff_ratio stats_pc_intl_students
##  Min.   :0.00000       Min.   :0.0000            Min.   :0.00000       
##  1st Qu.:0.04148       1st Qu.:0.1044            1st Qu.:0.02381       
##  Median :0.07545       Median :0.1414            Median :0.08333       
##  Mean   :0.09787       Mean   :0.1628            Mean   :0.13409       
##  3rd Qu.:0.12727       3rd Qu.:0.1952            3rd Qu.:0.20238       
##  Max.   :1.00000       Max.   :1.0000            Max.   :1.00000       
##  stats_female_share
##  Min.   :0.0000    
##  1st Qu.:0.4330    
##  Median :0.5361    
##  Mean   :0.5045    
##  3rd Qu.:0.5876    
##  Max.   :1.0000

As we can see, there are 11 variables containing 1448 observations presented. 2 of variables are of character type, while others are numeric.

As can be seen from the plot, stats_student_staff_ration and stats_number_students, stats_female_share have almost no correlation with other variables. Therefore, the counts of staff and students, share of females are uncorrelated with others characteristics.

There are strong correlations between scores connected with citations, research and teaching. Moreover, for score_teaching and score_research we can see correlations with scores considering international outlook ans industry income.

Describing results for PCA

Is it possible to produce an acceptable PCA solution on these data?

Yes, it is. Although we have 3 outsiders with low or zero correlations, the rest of the data is correlated, so we will be able to catch maximum variances. Therefore, since the plains of correlated data points will be close to each other, we can divide them into principal components and reduce the structure.

PCA

First, let us explore the summary of the components.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.8671 1.2097 1.1265 0.94784 0.81619 0.74787 0.66099
## Proportion of Variance 0.3873 0.1626 0.1410 0.09982 0.07402 0.06215 0.04855
## Cumulative Proportion  0.3873 0.5499 0.6909 0.79075 0.86476 0.92691 0.97545
##                            PC8     PC9
## Standard deviation     0.38634 0.26767
## Proportion of Variance 0.01658 0.00796
## Cumulative Proportion  0.99204 1.00000

From summary it can be noted that proportion of variance for PC1, PC2, PC3 is higher than 0.1, so that we can think of these components as the most important for explaining the variance of the data. To prove it, let us have a look at cumulative proportion. The first 3 components explain almost 70% of variance, so the sumary suggests that we can focus only on them during the following analysis.

Next, we are to define the number of components with eigenvalues.

## [1] 3.48591462 1.46337701 1.26901618 0.89840475 0.66615817 0.55930826 0.43691022
## [8] 0.14926140 0.07164939

To determine the optimum number of principal component axes, let us look at eigenvalues. We can consider those components which eigenvalues are greater than 1, so these are the first 3 components.

Finally there is a screeplot presented.

So, let us consider scree plot to define the necessary number of components. Here, the first component explains the largest proportion of variance. The 1-3 components explain almost 70% of variance.

To sum up, the analysis show that we can reduce dimensionality by keeping only 3 components which explain 70% of cumulative variance.

Variables` contribution

Next, we are going to explore variables contribution for our 3 components.

##                                PC1   PC2   PC3
## scores_teaching              -0.44  0.23 -0.09
## scores_research              -0.48  0.15 -0.17
## scores_citations             -0.40 -0.18  0.00
## scores_industry_income       -0.28  0.46 -0.27
## scores_international_outlook -0.42 -0.31  0.18
## stats_number_students         0.00 -0.26 -0.65
## stats_student_staff_ratio     0.00 -0.29 -0.59
## stats_pc_intl_students       -0.40 -0.19  0.28
## stats_female_share           -0.04 -0.63  0.10

We can see that

  • for the first component, the most valuable variables are scores_teaching, scores_research, scores_citations, scores_international_outlook, stats_pc_intl_students. They are with the same (negative) signs, so they will be located in the same poles. The component includes rating scores obtained by universities.
  • for the second component, the most valuable variables are scores_industry_income, stats_female_share. However, since they are with different signs, they are located in different poles. They are with the same (negative) signs, so they will be located in the same poles. The component includes socio-demographic characteristics.
  • for the third component, the most valuable variables are stats_number_students, stats_student_staff_ratio. They are with the same (negative) signs, so they will be located in the same poles. The component includes stats on actors which act within universities.

Finally, we want to estimate the quality of representation for our 3 components.

  • High quality of representation stands for a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.
  • Low quality of representation indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle. (see reference)

In our case, all the variables shows at least medium quality of representation (more than 50%), so we can expect that they will be positioned closer to he circumference of the correlation circle.

Plotting location of the best universities(US+Canada vs. rest)

As we can see, most blue dots(representing universities in the US and Canada) are on the left part of PC1 dimension, there are few on the right and at 0 point. This means that universities in the US and Canada get mostly higher scores than ones in most other universities. Therefore we can conclude that the best universities are mostly located in the US and Canada.

T-test (countries and 1st component)

Cheking assumptions

Before conducting a test, we should check normality of the distribution. Let us do it with a histogram.

Obviously, the distribution is not normal, it is skewed to the left.

Conducting T-test

For T-test, the following hypotheses are proposed:

  • H0: the mean PC1 scores of universities from America and other does not differ.
  • H1: the mean PC1 scores does differ
## 
##  Welch Two Sample t-test
## 
## data:  ttest_data$pca1 by ttest_data$country
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.132717 1.721327
## sample estimates:
##     mean in group other mean in group US+Canada 
##                0.198088               -1.228934
  • Statistical conclusion: at the 5% significance level on the available data the null hypothesis should be rejected in favor of the alternative one (p-value < .05).
  • Substantive conclusion: the mean PC1 scores between US+ Canada countries and others is significantly different

Double-checking results with non-parametric test

Since our distribution is not normal, we have to check results with non-parametric test.The following hypotheses are proposed:

  • H0: the two populations (USA countries and others) have the same distribution with the same median PC1 scores.
  • H1: the two populations (USA countries and others) have the different distribution with the different PC1 scores.
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  pca1 by country
## W = 182748, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
  • Statistical conclusion: according to the obtained p-value, which is lower than .05, there are no strong enough evidence to assert that H0 is true. Thus, it should be rejected.
  • Substantive conclusion: The Wilcoxon test also proves that the pca scores of people from the considered groups is significantly different among universities from US+Canada and other.

Plots for PCA solution