## name location scores_teaching scores_research
## Length:1448 Length:1448 Min. :0.00000 Min. :0.00000
## Class :character Class :character 1st Qu.:0.08222 1st Qu.:0.04649
## Mode :character Mode :character Median :0.14148 Median :0.10865
## Mean :0.19144 Mean :0.17330
## 3rd Qu.:0.24456 3rd Qu.:0.23486
## Max. :1.00000 Max. :1.00000
## scores_citations scores_industry_income scores_international_outlook
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.2166 1st Qu.:0.02099 1st Qu.:0.1618
## Median :0.4373 Median :0.07646 Median :0.3324
## Mean :0.4675 Mean :0.18321 Mean :0.3869
## 3rd Qu.:0.7034 3rd Qu.:0.22639 3rd Qu.:0.5725
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## stats_number_students stats_student_staff_ratio stats_pc_intl_students
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.04148 1st Qu.:0.1044 1st Qu.:0.02381
## Median :0.07545 Median :0.1414 Median :0.08333
## Mean :0.09787 Mean :0.1628 Mean :0.13409
## 3rd Qu.:0.12727 3rd Qu.:0.1952 3rd Qu.:0.20238
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## stats_female_share
## Min. :0.0000
## 1st Qu.:0.4330
## Median :0.5361
## Mean :0.5045
## 3rd Qu.:0.5876
## Max. :1.0000
As we can see, there are 11 variables containing 1448 observations presented. 2 of variables are of character type, while others are numeric.
As can be seen from the plot, stats_student_staff_ration and stats_number_students, stats_female_share have almost no correlation with other variables. Therefore, the counts of staff and students, share of females are uncorrelated with others characteristics.
There are strong correlations between scores connected with citations, research and teaching. Moreover, for score_teaching and score_research we can see correlations with scores considering international outlook ans industry income.
Is it possible to produce an acceptable PCA solution on these data?
Yes, it is. Although we have 3 outsiders with low or zero correlations, the rest of the data is correlated, so we will be able to catch maximum variances. Therefore, since the plains of correlated data points will be close to each other, we can divide them into principal components and reduce the structure.
First, let us explore the summary of the components.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8671 1.2097 1.1265 0.94784 0.81619 0.74787 0.66099
## Proportion of Variance 0.3873 0.1626 0.1410 0.09982 0.07402 0.06215 0.04855
## Cumulative Proportion 0.3873 0.5499 0.6909 0.79075 0.86476 0.92691 0.97545
## PC8 PC9
## Standard deviation 0.38634 0.26767
## Proportion of Variance 0.01658 0.00796
## Cumulative Proportion 0.99204 1.00000
From summary it can be noted that proportion of variance for PC1, PC2, PC3 is higher than 0.1, so that we can think of these components as the most important for explaining the variance of the data. To prove it, let us have a look at cumulative proportion. The first 3 components explain almost 70% of variance, so the sumary suggests that we can focus only on them during the following analysis.
Next, we are to define the number of components with eigenvalues.
## [1] 3.48591462 1.46337701 1.26901618 0.89840475 0.66615817 0.55930826 0.43691022
## [8] 0.14926140 0.07164939
To determine the optimum number of principal component axes, let us look at eigenvalues. We can consider those components which eigenvalues are greater than 1, so these are the first 3 components.
Finally there is a screeplot presented.
So, let us consider scree plot to define the necessary number of components. Here, the first component explains the largest proportion of variance. The 1-3 components explain almost 70% of variance.
To sum up, the analysis show that we can reduce dimensionality by keeping only 3 components which explain 70% of cumulative variance.
Next, we are going to explore variables contribution for our 3 components.
## PC1 PC2 PC3
## scores_teaching -0.44 0.23 -0.09
## scores_research -0.48 0.15 -0.17
## scores_citations -0.40 -0.18 0.00
## scores_industry_income -0.28 0.46 -0.27
## scores_international_outlook -0.42 -0.31 0.18
## stats_number_students 0.00 -0.26 -0.65
## stats_student_staff_ratio 0.00 -0.29 -0.59
## stats_pc_intl_students -0.40 -0.19 0.28
## stats_female_share -0.04 -0.63 0.10
We can see that
scores_teaching, scores_research, scores_citations, scores_international_outlook, stats_pc_intl_students. They are with the same (negative) signs, so they will be located in the same poles. The component includes rating scores obtained by universities.scores_industry_income, stats_female_share. However, since they are with different signs, they are located in different poles. They are with the same (negative) signs, so they will be located in the same poles. The component includes socio-demographic characteristics.stats_number_students, stats_student_staff_ratio. They are with the same (negative) signs, so they will be located in the same poles. The component includes stats on actors which act within universities.Finally, we want to estimate the quality of representation for our 3 components.
In our case, all the variables shows at least medium quality of representation (more than 50%), so we can expect that they will be positioned closer to he circumference of the correlation circle.
As we can see, most blue dots(representing universities in the US and Canada) are on the left part of PC1 dimension, there are few on the right and at 0 point. This means that universities in the US and Canada get mostly higher scores than ones in most other universities. Therefore we can conclude that the best universities are mostly located in the US and Canada.
Before conducting a test, we should check normality of the distribution. Let us do it with a histogram.
Obviously, the distribution is not normal, it is skewed to the left.
For T-test, the following hypotheses are proposed:
##
## Welch Two Sample t-test
##
## data: ttest_data$pca1 by ttest_data$country
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.132717 1.721327
## sample estimates:
## mean in group other mean in group US+Canada
## 0.198088 -1.228934
Since our distribution is not normal, we have to check results with non-parametric test.The following hypotheses are proposed:
##
## Wilcoxon rank sum test with continuity correction
##
## data: pca1 by country
## W = 182748, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0