In this project I will apply data scraping skills to retrieve a table of the Times Higher Education University World Ranking from a Wikipedia page. The given data will be summarized in an appropriate plot showing changes in rating of universities.
The first step is, as usual, loading necessary packages.
library(rvest)
library(dplyr)
library(CGPfunctions)
library(ggthemes)
To begin with, I download a data from a web-page, then I retrieve tables from it and check whether the table is one of my interest.
wikidata <- "https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings"
data1 <- wikidata%>%
read_html()%>%
html_table(fill=TRUE)
rmarkdown::paged_table(data1[[3]])
It seems that I’ve got the right table, so I can save it. However, it is seen that data in some columns with is not in appropriate type because of sign “-” meaning that in some year university dropped from the rating. It is better to transform such cases into NAs and then turn all data into one type. Also, it is better to rename columns as there are useless numbers of Wikipedia links.
#Saving a table to data frame
rating <- data1[[3]]
#Replacing "-" with NAs
rating[,2:12] <- replace(rating[,2:12], rating[,2:12]=="-", NA)
#Turning data into numeric type
rating[,2:12] <- (lapply(rating[,2:12], as.numeric))
#Changing heads of columns
names(rating) <- c("University", "2010-2011", "2011-2012", "2012-2013", "2013-2014", "2014-2015", "2015-2016", "2016-2017", "2017-2018", "2018-2019", "2019-2020", "2020-2021")
Let’s see the table one more time to be sure that everything is fine.
rmarkdown::paged_table(rating)
As I successfully retrieved the table and cleaned the data, it is now possible to do some work with it.
In this part I will build a slopegraph that shows how first 10 rating positions of universities changed from 2011 to 2021.
Firstly, I need to limit our table to 10 rows.
top10uni <- head(rating, 10)
top10uni%>%
knitr:: kable()
| University | 2010-2011 | 2011-2012 | 2012-2013 | 2013-2014 | 2014-2015 | 2015-2016 | 2016-2017 | 2017-2018 | 2018-2019 | 2019-2020 | 2020-2021 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| University of Oxford | 6 | 4 | 2 | 2 | 3 | 2 | 1 | 1 | 1 | 1 | 1 |
| Stanford University | 4 | 2 | 3 | 4 | 4 | 3 | 3 | 3 | 3 | 4 | 2 |
| Harvard University | 1 | 2 | 4 | 2 | 2 | 6 | 6 | 6 | 6 | 7 | 3 |
| California Institute of Technology | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 5 | 2 | 4 |
| Massachusetts Institute of Technology | 3 | 7 | 5 | 5 | 6 | 5 | 5 | 5 | 4 | 5 | 5 |
| University of Cambridge | 6 | 6 | 7 | 7 | 5 | 4 | 4 | 2 | 2 | 3 | 6 |
| University of California, Berkeley | 8 | 10 | 9 | 8 | 8 | 13 | 10 | 18 | 15 | 13 | 7 |
| Yale University | 10 | 11 | 11 | 11 | 9 | 12 | 12 | 12 | 8 | 8 | 8 |
| Princeton University | 5 | 5 | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 6 | 9 |
| University of Chicago | 13 | 9 | 10 | 9 | 11 | 10 | 10 | 9 | 10 | 9 | 10 |
I am interested only in 2011 and 2021 year, so other years should be deleted from the table. I make a copy of table and leave in the first one data for 2011 and in the second one for 2021. Also, I add new column with year as it is necessary for building a graph. Then, I unite these two tables into one.
top10uni2<-top10uni
top10uni <- top10uni[,-c(3:12)]
year1 <- c("2011", "2011", "2011", "2011", "2011", "2011", "2011", "2011", "2011", "2011")
top10uni <- cbind(top10uni, year1)
top10uni2 <- top10uni2[,-c(2:11)]
year2 <- c("2021", "2021", "2021", "2021", "2021", "2021", "2021", "2021", "2021", "2021")
top10uni2 <- cbind(top10uni2, year2)
names(top10uni) <- c("uni", "rating", "year")
names(top10uni2) <- c("uni", "rating", "year")
topuni <- rbind(top10uni, top10uni2)
topuni%>%
knitr:: kable()
| uni | rating | year |
|---|---|---|
| University of Oxford | 6 | 2011 |
| Stanford University | 4 | 2011 |
| Harvard University | 1 | 2011 |
| California Institute of Technology | 2 | 2011 |
| Massachusetts Institute of Technology | 3 | 2011 |
| University of Cambridge | 6 | 2011 |
| University of California, Berkeley | 8 | 2011 |
| Yale University | 10 | 2011 |
| Princeton University | 5 | 2011 |
| University of Chicago | 13 | 2011 |
| University of Oxford | 1 | 2021 |
| Stanford University | 2 | 2021 |
| Harvard University | 3 | 2021 |
| California Institute of Technology | 4 | 2021 |
| Massachusetts Institute of Technology | 5 | 2021 |
| University of Cambridge | 6 | 2021 |
| University of California, Berkeley | 7 | 2021 |
| Yale University | 8 | 2021 |
| Princeton University | 9 | 2021 |
| University of Chicago | 10 | 2021 |
This is what I have after manipulations with data. Now, it is possible to construct our slopegraph. I decided to apply newggslopegraph() function from CGPfunctions package.
As seen from the graph, Top-10 of 2011 and Top-10 of 2021 are rather similar in terms of names of universities that fill the rating. There are changes in the highest positions, while lower rankings did not change too much. By 10 years, 5 universities increased their rating, 4 fell to lower positions and 1 have stayed on the same place.
In this project I will build a binary logistic regression model that is intended to predict attending lawful demonstrations. The data for the analysis is taken from 7th wave of World Values Survey. The focus of analysis is Russia. I have 9 variables to select from prepared dataset and there will be up to 2 variables of categorical and continuous type as predictors.
This section is devoted to preparation of data before analysis. First, it is necessary to upload packages and dataset.
library(dplyr)
library(ggplot2)
library(ggthemes)
library(psych)
library(kableExtra)
library(equatiomatic)
library(gridExtra)
library(ggpubr)
library(naniar)
library(VIM)
library(mice)
library(margins)
library(caret)
library(car)
library(sjPlot)
library(rcompanion)
library(tidyr)
df <- readRDS("ruwvs")
Now I have a dataset and it is necessary to look at its structure to be sure that variables are of a proper type.
str(df)
## 'data.frame': 1810 obs. of 10 variables:
## $ Q211 : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 2 2 ...
## $ intinpol : Factor w/ 4 levels "Very interested",..: 4 2 2 3 3 3 3 2 3 4 ...
## $ eduR : Factor w/ 4 levels "Primary","Secondary",..: 3 4 4 4 4 2 2 2 4 2 ...
## $ townsize : Factor w/ 5 levels "Under 5,000",..: 5 5 5 5 5 4 4 4 5 5 ...
## $ settlement: Factor w/ 5 levels "Capital city",..: 1 1 1 2 2 2 2 2 1 1 ...
## $ income : Factor w/ 10 levels "Lower step","second step",..: 8 6 6 3 4 3 6 4 5 9 ...
## $ region : Factor w/ 8 levels "RU: North West Federal District",..: 2 2 2 4 4 2 2 2 2 2 ...
## $ age : Factor w/ 85 levels "16","17","18",..: 21 43 27 32 46 54 25 63 36 13 ...
## $ income1 : num 7 5 5 2 3 2 5 3 4 8 ...
## $ eduT : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 2 1 ...
#Age is factor now, but should be in numeric form
df$age<-as.numeric(as.character(df$age))
#The value of settlement "Another city, town(not a regional or district center)" is rather long and it will be inconvenient to print it in graphics and model summaries. Let's make it shorter.
levels(df$settlement)[4]<-"Another city"
#income is in continuous form, so just for the convenience let's delete its categorical form.
df$income<-NULL
Now, the environment is prepared for work and I can proceed to inspection of data in the next part.
In this part of the project, the variables for further analysis are described in detail. To get information about variables, press buttons.
Attending peaceful demonstrations (Q211) is a binary variable where 1 means that a person will attend a demonstration while 0 refers to rejection to attend. Q211 is an outcome variable in our model. From the barplots below it is seen that both values of the variable are filled enough with observations.
Interest in politics (intinpol) is an ordinal variable with four levels from “Not at all interested” [in politics] to “Very interested”. “intinpol” can be used as a predictor in our model. As seen from the barplots below, both values of the variable are filled enough with observations. It seems that it is possible to include this variable as a predictor.
Education (eduR) is an ordinal variable with four levels (Primary, Secondary, Post-secondary and Tertiary). Because of Russian context where a person should have at least secondary education, it is very rare that someone have only primary education. Considering this fact it is useless to include this variable in analysis because primary education as reference group in the model will always be in inferior position because of small number of cases.
Size of town (townsize) is an ordinal variable representing sizes of population in its levels. From the barplots below it is possible to see that both values of the variable are filled enough with observations. However, it seems that this variable is not much appropriate for the model.
Type of settlement (settlement) is a categorical variable containing different types of residential areas in Russia. It will not be included in a model because some values has not very many observations.
Region is a categorical variable with names of Russian regions. It will not be included in analysis because it is rather hard to define proper reference group here to compare while interpreting model.
Age is a continuous variable presenting age of respondents from WVS. The distribution and descriptive statistics are presented below. By looking at histogram, mean, median, skew and kurtosis, it is clear that distribution is not normal. I will include age in our model since it is necessary to have continuous predictors and age could be a good one for predicting attending demonstrations.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1810 | 45.41 | 17.12 | 43 | 18 | 91 | 0.35 | -0.81 |
Income is a continuous variable transformed from ordinal variable with levels of income. The distribution and descriptive statistics are presented below. I will include age in our model since it is necessary to have continuous predictors and income could be a good one for predicting attending demonstrations.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1747 | 3.77 | 1.93 | 4 | 0 | 9 | 0.1 | -0.39 |
Having tertiary education (eduT) is a binary variable with values 0 and 1 which corresponds to not having and having higher education respectively. From the barplots below it is clear that both values of the variable are filled enough with observations. It seems to be a good idea to include this variable in our analysis.
In this section I am trying to solve the problem of missing data. To begin with, let’s observe presence of NAs in our data
miss_var_summary(df)%>%
kable()%>%
kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)
| variable | n_miss | pct_miss |
|---|---|---|
| income1 | 63 | 3.4806630 |
| eduR | 10 | 0.5524862 |
| eduT | 10 | 0.5524862 |
| intinpol | 6 | 0.3314917 |
| Q211 | 0 | 0.0000000 |
| townsize | 0 | 0.0000000 |
| settlement | 0 | 0.0000000 |
| region | 0 | 0.0000000 |
| age | 0 | 0.0000000 |
vis_miss(df)
As seen from figures above, there are 4 variables with missing data, and overall there is only 0.5% of miss in the whole dataset. This percent is very small, so it is possible to apply imputation to fill missing.
First, let’s try k-Nearest Neighbour Imputation (kNN) from VIM package. As seen from the table below, all missing values were successfully imputed.
df_knn <- kNN(df, k=5, variable=c("income1", "eduR", "eduT", "intinpol"))
miss_var_summary(df_knn)%>%
kable()%>%
kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)
| variable | n_miss | pct_miss |
|---|---|---|
| Q211 | 0 | 0 |
| intinpol | 0 | 0 |
| eduR | 0 | 0 |
| townsize | 0 | 0 |
| settlement | 0 | 0 |
| region | 0 | 0 |
| age | 0 | 0 |
| income1 | 0 | 0 |
| eduT | 0 | 0 |
| income1_imp | 0 | 0 |
| eduR_imp | 0 | 0 |
| eduT_imp | 0 | 0 |
| intinpol_imp | 0 | 0 |
Now, let’s try another method - Multivariate Imputation by Chained Equations (MICE) from mice package. As seen from the table below, all missing values were successfully imputed.
dfmice <- mice(df, m=5, defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
##
## iter imp variable
## 1 1 intinpol eduR income1 eduT
## 1 2 intinpol eduR income1 eduT
## 1 3 intinpol eduR income1 eduT
## 1 4 intinpol eduR income1 eduT
## 1 5 intinpol eduR income1 eduT
## 2 1 intinpol eduR income1 eduT
## 2 2 intinpol eduR income1 eduT
## 2 3 intinpol eduR income1 eduT
## 2 4 intinpol eduR income1 eduT
## 2 5 intinpol eduR income1 eduT
## 3 1 intinpol eduR income1 eduT
## 3 2 intinpol eduR income1 eduT
## 3 3 intinpol eduR income1 eduT
## 3 4 intinpol eduR income1 eduT
## 3 5 intinpol eduR income1 eduT
## 4 1 intinpol eduR income1 eduT
## 4 2 intinpol eduR income1 eduT
## 4 3 intinpol eduR income1 eduT
## 4 4 intinpol eduR income1 eduT
## 4 5 intinpol eduR income1 eduT
## 5 1 intinpol eduR income1 eduT
## 5 2 intinpol eduR income1 eduT
## 5 3 intinpol eduR income1 eduT
## 5 4 intinpol eduR income1 eduT
## 5 5 intinpol eduR income1 eduT
df_mice <- complete(dfmice,3)
miss_var_summary(df_mice)%>%
kable()%>%
kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)
| variable | n_miss | pct_miss |
|---|---|---|
| Q211 | 0 | 0 |
| intinpol | 0 | 0 |
| eduR | 0 | 0 |
| townsize | 0 | 0 |
| settlement | 0 | 0 |
| region | 0 | 0 |
| age | 0 | 0 |
| income1 | 0 | 0 |
| eduT | 0 | 0 |
At this point I can visualize imputed data. Since the share of missings in variables eduR, eduT and intinpol was very small, it is rather useless to do visualization because it will be hard to see anything. However, it is still possible to work with income1 variable. In the histograms below visualization of imputed data by two methods is presented.
According to the histograms above, it can be said that kNN works better here because imputed data repeats distribution of original data.
Now, I am ready to start model building. To remind, I have Q211 as an outcome variable and income1, age, intinpol and eduT as predictors. I will use forward stepwise method meaning adding predictors one by one and looking at whether the model became better.
Before the modeling I want to be sure that there is no empty or small cells by doing a crosstab between categorical predictors and the outcome variable.
xtabs(~ Q211 + intinpol, data = df)
## intinpol
## Q211 Very interested Somewhat interested Not very interested
## 0 69 352 416
## 1 79 286 297
## intinpol
## Q211 Not at all interested
## 0 229
## 1 76
xtabs(~ Q211 + eduT, data = df)
## eduT
## Q211 0 1
## 0 726 339
## 1 470 265
Everything looks rather good. So, let’s start with tertiary education.
model1 <- glm(Q211 ~ eduT, data = df0, family = binomial())
summary(model1)
##
## Call:
## glm(formula = Q211 ~ eduT, family = binomial(), data = df0)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.100 -1.008 -1.008 1.357 1.357
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4133 0.0599 -6.90 5.19e-12 ***
## eduT1 0.2278 0.1031 2.21 0.0271 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2353.1 on 1734 degrees of freedom
## AIC: 2357.1
##
## Number of Fisher Scoring iterations: 4
As seen from the output above, the predictor is significant and the residual deviance shows that having a model is better than no model at all. Let’s add one more predictor - interest in politics.
model2 <- glm(Q211 ~ eduT + intinpol, data = df0, family = binomial())
summary(model2)
##
## Call:
## glm(formula = Q211 ~ eduT + intinpol, family = binomial(), data = df0)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3268 -1.0692 -0.7605 1.2595 1.6624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1608 0.1731 0.929 0.35284
## eduT1 0.1837 0.1045 1.758 0.07871 .
## intinpolSomewhat interested -0.4207 0.1874 -2.245 0.02479 *
## intinpolNot very interested -0.5356 0.1860 -2.880 0.00398 **
## intinpolNot at all interested -1.2534 0.2156 -5.814 6.09e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2309.4 on 1731 degrees of freedom
## AIC: 2319.4
##
## Number of Fisher Scoring iterations: 4
The output above shows that intinpol is significant while eduT lost its significance.
Now, I will compare given two models with help of anova.
anova(model1, model2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Q211 ~ eduT
## Model 2: Q211 ~ eduT + intinpol
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1734 2353.1
## 2 1731 2309.4 3 43.639 1.801e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output says that the second model is significantly better than the first one, so adding “intinpol” variable improved the model.
Moving forward, I add another predictor - income.
model3 <- glm(Q211 ~ eduT + intinpol + income1, data = df0, family = binomial())
summary(model3)
##
## Call:
## glm(formula = Q211 ~ eduT + intinpol + income1, family = binomial(),
## data = df0)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4405 -1.0619 -0.7961 1.2545 1.8209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35610 0.19621 1.815 0.06954 .
## eduT1 0.24397 0.10848 2.249 0.02451 *
## intinpolSomewhat interested -0.40412 0.18790 -2.151 0.03150 *
## intinpolNot very interested -0.53551 0.18630 -2.874 0.00405 **
## intinpolNot at all interested -1.28521 0.21655 -5.935 2.94e-09 ***
## income1 -0.05748 0.02698 -2.131 0.03312 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2304.9 on 1730 degrees of freedom
## AIC: 2316.9
##
## Number of Fisher Scoring iterations: 4
As seen from the output above, all predictors are significant. And I again make a comparative test.
anova(model2, model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Q211 ~ eduT + intinpol
## Model 2: Q211 ~ eduT + intinpol + income1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1731 2309.4
## 2 1730 2304.9 1 4.5658 0.03262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value is significant, this model3 is better than model2.
And I do the final step, adding last predictor - age.
model4 <- glm(Q211 ~ eduT + intinpol + income1 + age, data = df0, family = binomial())
summary(model4)
##
## Call:
## glm(formula = Q211 ~ eduT + intinpol + income1 + age, family = binomial(),
## data = df0)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4389 -1.0612 -0.7981 1.2562 1.8317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3838016 0.2705450 1.419 0.15601
## eduT1 0.2421433 0.1091694 2.218 0.02655 *
## intinpolSomewhat interested -0.4053699 0.1880869 -2.155 0.03114 *
## intinpolNot very interested -0.5381749 0.1871663 -2.875 0.00404 **
## intinpolNot at all interested -1.2892335 0.2182435 -5.907 3.48e-09 ***
## income1 -0.0585451 0.0279182 -2.097 0.03599 *
## age -0.0004586 0.0030831 -0.149 0.88176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2304.8 on 1729 degrees of freedom
## AIC: 2318.8
##
## Number of Fisher Scoring iterations: 4
As seen from above, all predictors but age are significant. Comparing models will show if the model was improved.
anova(model3, model4, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Q211 ~ eduT + intinpol + income1
## Model 2: Q211 ~ eduT + intinpol + income1 + age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1730 2304.9
## 2 1729 2304.8 1 0.022128 0.8817
The output shows that model4 is not better than model3. Thus, I can say that model3 is the best model and I will take it for further analysis.
In this part I will interpret coefficients of the model. First, let’s write out the model equation.
extract_eq(model3, wrap = TRUE, use_coefs=TRUE, terms_per_line = 3)
\[ \begin{aligned} \log\left[ \frac { P( \operatorname{Q211} = \operatorname{1} ) }{ 1 - P( \operatorname{Q211} = \operatorname{1} ) } \right] &= 0.36 + 0.24(\operatorname{eduT}_{\operatorname{1}}) - 0.4(\operatorname{intinpol}_{\operatorname{Somewhat\ interested}})\ - \\ &\quad 0.54(\operatorname{intinpol}_{\operatorname{Not\ very\ interested}}) - 1.29(\operatorname{intinpol}_{\operatorname{Not\ at\ all\ interested}}) - 0.06(\operatorname{income1})\ + \\ &\quad \epsilon \end{aligned} \]
Then, I extract odds ratios by exponentiation of coefficients from equation.
exp(cbind(OR = coef(model3), confint(model3)))
## OR 2.5 % 97.5 %
## (Intercept) 1.4277497 0.9733762 2.1029471
## eduT1 1.2763087 1.0317961 1.5788207
## intinpolSomewhat interested 0.6675668 0.4608576 0.9636953
## intinpolNot very interested 0.5853727 0.4053375 0.8423304
## intinpolNot at all interested 0.2765914 0.1802280 0.4215937
## income1 0.9441430 0.8953447 0.9952651
The numbers from output above state the following:
Now, let’s inspect average marginal effects and interpret them.
m <- margins(model3, type = "response")
summary(m)
## factor AME SE z p lower upper
## eduT1 0.0580 0.0258 2.2426 0.0249 0.0073 0.1086
## income1 -0.0136 0.0063 -2.1413 0.0322 -0.0260 -0.0011
## intinpolNot at all interested -0.2973 0.0489 -6.0781 0.0000 -0.3932 -0.2015
## intinpolNot very interested -0.1324 0.0457 -2.8962 0.0038 -0.2220 -0.0428
## intinpolSomewhat interested -0.1002 0.0462 -2.1670 0.0302 -0.1909 -0.0096
The numbers from output above say the following:
In this part I will try to predict values of Y using different values of predictor.
First, let’s try to predict with fixed average value of income.
pol <- c("Not at all interested", "Not very interested", "Somewhat interested", "Very interested")
newdata1 <- crossing(income1 = mean(df0$income1),
intinpol = factor(pol),
eduT = factor(0:1))
newdata1$pred <- predict(model3, newdata = newdata1, type = "response")
newdata1
## # A tibble: 8 x 4
## income1 intinpol eduT pred
## <dbl> <fct> <fct> <dbl>
## 1 3.77 Not at all interested 0 0.241
## 2 3.77 Not at all interested 1 0.289
## 3 3.77 Not very interested 0 0.402
## 4 3.77 Not very interested 1 0.462
## 5 3.77 Somewhat interested 0 0.434
## 6 3.77 Somewhat interested 1 0.495
## 7 3.77 Very interested 0 0.535
## 8 3.77 Very interested 1 0.595
In the table above there are predicted values of Y in “pred” column for given set of predictors’ values.
And now, let’s try to predict Y without holding income at average.
newdata2 <- crossing(income1 = seq(from=1, to=10, length.out = 10),
intinpol = factor(pol),
eduT = factor(0:1))
newdata3 <- cbind(newdata2, predict(model3, newdata = newdata2, type = "response", se = T))
newdata3 <- within(newdata3, {
PredictedProb <- fit
LL <- fit - (1.96 * se.fit)
UL <- fit + (1.96 * se.fit)
})
head(newdata3)
## income1 intinpol eduT fit se.fit residual.scale
## 1 1 Not at all interested 0 0.2715857 0.02864770 1
## 2 1 Not at all interested 1 0.3224315 0.03759049 1
## 3 1 Not very interested 0 0.4410542 0.02634740 1
## 4 1 Not very interested 1 0.5017719 0.03479455 1
## 5 1 Somewhat interested 0 0.4736510 0.02859693 1
## 6 1 Somewhat interested 1 0.5345643 0.03626430 1
## UL LL PredictedProb
## 1 0.3277352 0.2154362 0.2715857
## 2 0.3961088 0.2487541 0.3224315
## 3 0.4926951 0.3894133 0.4410542
## 4 0.5699692 0.4335746 0.5017719
## 5 0.5297009 0.4176010 0.4736510
## 6 0.6056423 0.4634863 0.5345643
In the table above it is possible to see predicted values of Y in “PredictedProb” column. Using this data I can visualize what I have.
From the plots above it is seen that there is no interaction effect neither for interest in politics nor for tertiary education and income. However, the lines indicate that in all cases probability to attend demonstration decrease with the increase of income.
In this section I say about model quality using McFadden pseudo-R-squared since AIC works only when we compare models.
nagelkerke(model3)
## $Models
##
## Model: "glm, Q211 ~ eduT + intinpol + income1, binomial(), df0"
## Null: "glm, Q211 ~ 1, binomial(), df0"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0225104
## Cox and Snell (ML) 0.0301122
## Nagelkerke (Cragg and Uhler) 0.0405338
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -5 -26.539 53.078 3.2414e-10
##
## $Number.of.observations
##
## Model: 1736
## Null: 1736
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
For our model McFadden pseudo-R-squared is equal to 0.02 which indicates bad quality because it should have values 0.2-0.4 to be interpreted as satisfactory.
In this part I will calculate accuracy of the model.
First, I need to create a training and a test subsamples and fit a model on the training data.
bound <- floor((nrow(df0)/2))
df2 <- df0[sample(nrow(df0)), ]
df.train <- df2[1:bound, ]
df.test <- df2[(bound+1):nrow(df2), ]
testmodel <- glm(Q211 ~ eduT + intinpol + income1, data = df.train, family = "binomial",
na.action = na.omit)
And then I build confusion matrix and look at results.
pred <- format(round(predict(testmodel, newdata = df.test, type = "response")))
confusionMatrix(table(pred, df.test$Q211))
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 477 331
## 1 25 35
##
## Accuracy : 0.5899
## 95% CI : (0.5563, 0.6228)
## No Information Rate : 0.5783
## P-Value [Acc > NIR] : 0.2572
##
## Kappa : 0.0517
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.95020
## Specificity : 0.09563
## Pos Pred Value : 0.59035
## Neg Pred Value : 0.58333
## Prevalence : 0.57834
## Detection Rate : 0.54954
## Detection Prevalence : 0.93088
## Balanced Accuracy : 0.52291
##
## 'Positive' Class : 0
##
P-value for comparison of accuracy and no information rate is insignificant which means that our model is not better than no model at all.
In this section, I do model diagnostics and check some assumptions of logistic regression.
No prefect multicollinearity - there is no perfect linear relationship between predictors. It is checked by Variance Inflation Factors (VIF) below.
vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## eduT 1.078514 1 1.038515
## intinpol 1.027172 3 1.004478
## income1 1.103277 1 1.050370
For all predictors GVIF is ~ 1 which is good. There is no multicollinearity.
Linearity - linear relationship between continuous predictor variables and the logit of the outcome.
residualPlots(model3)
## Test stat Pr(>|Test stat|)
## eduT
## intinpol
## income1 0.3655 0.5455
The results of the test and plots with continuous predictor (income1) show that there is linearity between it and the logit.
So, I built a logistic regression model according to which 1) having higher education is associated with higher probability to attend peaceful demonstrations; 2) lower interest in politics associated with lower probability to attend peaceful demonstrations; 3) higher income is associated with lower probability to attend peaceful demonstrations. However, the model itself is not very good because some quality tests indicate insignificance of model against situation where no model at all.
In this project I will apply method of principle component analysis (PCA) on the data coming from World University Rankings 2021 by Times Higher Education.
The first step is to prepare packages and data set.
library(dplyr)
library(naniar)
library(kableExtra)
library(corrplot)
library(pca3d)
library(ggplot2)
library(ggthemes)
library(plotly)
data1 <- read.csv('THE2021.csv', header = TRUE, sep = ",")
In this part I explore the data and correlations between variables in it.
To begin with, let’s look at the structure. For PCA I need numeric variables and, according to the output below, all variables that will be put in PCA function are of proper type.
str(data1)
## 'data.frame': 1448 obs. of 11 variables:
## $ name : chr "University of Oxford" "Stanford University" "Harvard University" "California Institute of Technology" ...
## $ location : chr "United Kingdom" "United States" "United States" "United States" ...
## $ scores_teaching : num 91.3 92.2 94.4 92.5 90.7 90.3 85.8 91.9 88.8 88.9 ...
## $ scores_research : num 99.6 96.7 98.8 96.9 94.4 99.2 97.2 93.8 92.5 90.5 ...
## $ scores_citations : num 98 99.9 99.4 97 99.7 95.6 99.1 97.9 98.9 98.6 ...
## $ scores_industry_income : num 68.7 90.1 46.8 92.7 90.4 52.1 84.3 56.1 58 54.9 ...
## $ scores_international_outlook: num 96.4 79.5 77.7 83.6 90 95.7 72.3 68.4 80.2 74 ...
## $ stats_number_students : int 20774 16223 21261 2238 11276 19370 39918 12910 8091 14292 ...
## $ stats_student_staff_ratio : num 11.1 7.4 9.3 6.3 8.4 11 19.8 6 8 5.9 ...
## $ stats_pc_intl_students : int 41 23 25 33 34 38 17 20 23 31 ...
## $ stats_female_share : int 46 44 49 36 39 47 51 50 46 46 ...
Also, it is important to check for missing values in the data set. And as seen from the table below, there are no missings in our data.
miss_var_summary(data1)%>%
kable()%>%
kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)
| variable | n_miss | pct_miss |
|---|---|---|
| name | 0 | 0 |
| location | 0 | 0 |
| scores_teaching | 0 | 0 |
| scores_research | 0 | 0 |
| scores_citations | 0 | 0 |
| scores_industry_income | 0 | 0 |
| scores_international_outlook | 0 | 0 |
| stats_number_students | 0 | 0 |
| stats_student_staff_ratio | 0 | 0 |
| stats_pc_intl_students | 0 | 0 |
| stats_female_share | 0 | 0 |
Next, it is necessary to check descriptive statistics of the data. As seen from the summary below, scales, units of measurements and extremes are quite different.
summary(data1)
## name location scores_teaching scores_research
## Length:1448 Length:1448 Min. :11.70 Min. : 7.10
## Class :character Class :character 1st Qu.:18.50 1st Qu.:11.40
## Mode :character Mode :character Median :23.40 Median :17.15
## Mean :27.53 Mean :23.13
## 3rd Qu.:31.93 3rd Qu.:28.82
## Max. :94.40 Max. :99.60
## scores_citations scores_industry_income scores_international_outlook
## Min. : 1.90 Min. : 33.30 Min. : 13.50
## 1st Qu.: 23.15 1st Qu.: 34.70 1st Qu.: 27.50
## Median : 44.80 Median : 38.40 Median : 42.25
## Mean : 47.76 Mean : 45.52 Mean : 46.96
## 3rd Qu.: 70.90 3rd Qu.: 48.40 3rd Qu.: 63.02
## Max. :100.00 Max. :100.00 Max. :100.00
## stats_number_students stats_student_staff_ratio stats_pc_intl_students
## Min. : 557 Min. : 0.90 Min. : 0.00
## 1st Qu.: 9746 1st Qu.: 12.20 1st Qu.: 2.00
## Median : 17273 Median : 16.20 Median : 7.00
## Mean : 22241 Mean : 18.51 Mean :11.26
## 3rd Qu.: 28753 3rd Qu.: 22.02 3rd Qu.:17.00
## Max. :222102 Max. :109.10 Max. :84.00
## stats_female_share
## Min. : 1.00
## 1st Qu.:43.00
## Median :53.00
## Mean :49.94
## 3rd Qu.:58.00
## Max. :98.00
In addition, let’s look at the variances of variables.
lapply(data1[ , 3:11], var)
## $scores_teaching
## [1] 179.3113
##
## $scores_research
## [1] 293.1243
##
## $scores_citations
## [1] 766.5359
##
## $scores_industry_income
## [1] 275.3449
##
## $scores_international_outlook
## [1] 540.7424
##
## $stats_number_students
## [1] 424682073
##
## $stats_student_staff_ratio
## [1] 105.9024
##
## $stats_pc_intl_students
## [1] 144.5496
##
## $stats_female_share
## [1] 147.2624
The output below shows that variances of variables are quite different and because of this variables will have different weights and variables with high variances will be overrepresented in principal components. To solve this problem, I can standardize variables and make their variances equal to 1.
data1[ , 3:11] <- data1[ , 3:11]%>%
scale()%>%
as.data.frame()
Looking at variances again, I see that standardizing was successful.
lapply(data1[ , 3:11], var)
## $scores_teaching
## [1] 1
##
## $scores_research
## [1] 1
##
## $scores_citations
## [1] 1
##
## $scores_industry_income
## [1] 1
##
## $scores_international_outlook
## [1] 1
##
## $stats_number_students
## [1] 1
##
## $stats_student_staff_ratio
## [1] 1
##
## $stats_pc_intl_students
## [1] 1
##
## $stats_female_share
## [1] 1
Before PCA it is also necessary to explore correlations between variables.
In the plot below it is seen that there are rather high correlations between some variables.
data1[ , 3:11]%>%
cor()%>%
corrplot()
In this part I run PCA and describe the results.
pca_res <- prcomp(data1[ , 3:11])
summary(pca_res)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8671 1.2097 1.1265 0.94784 0.81619 0.74787 0.66099
## Proportion of Variance 0.3873 0.1626 0.1410 0.09982 0.07402 0.06215 0.04855
## Cumulative Proportion 0.3873 0.5499 0.6909 0.79075 0.86476 0.92691 0.97545
## PC8 PC9
## Standard deviation 0.38634 0.26767
## Proportion of Variance 0.01658 0.00796
## Cumulative Proportion 0.99204 1.00000
In the summary above variance explained by components is presented. All in all, I extracted 9 components, but their meaningfulness will be discussed later. The first principal component explains 0.3873 (~39%) of variance. The threshold of 0.7 of explained variance is crossed between PC3 and PC4.
Also, it is necessary to look at eigenvalues of components. As seen from output below, only three components have eigenvalues more than 1. Guiding by Kaiser-Guttman criterion, I should keep 3 components.
pca_res$sdev ^ 2
## [1] 3.48591462 1.46337701 1.26901618 0.89840475 0.66615817 0.55930826 0.43691022
## [8] 0.14926140 0.07164939
The screeplot below confirms the choice of three components.
screeplot(pca_res, type = "lines")
box()
abline(h = 1)
Guiding by all this information, I can say that it is possible to produce an acceptable PCA solution on these data because there are moderate and sometimes high correlations between variables which means components will take place. Also, as seen from PCA results, more than 50% of variance is explained by first two components which makes them meaningful.
Now, let’s extract the loadings of our components.
round(pca_res$rotation[ , 1:3], 2)%>%
kable()%>%
kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| scores_teaching | -0.44 | 0.23 | -0.09 |
| scores_research | -0.48 | 0.15 | -0.17 |
| scores_citations | -0.40 | -0.18 | 0.00 |
| scores_industry_income | -0.28 | 0.46 | -0.27 |
| scores_international_outlook | -0.42 | -0.31 | 0.18 |
| stats_number_students | 0.00 | -0.26 | -0.65 |
| stats_student_staff_ratio | 0.00 | -0.29 | -0.59 |
| stats_pc_intl_students | -0.40 | -0.19 | 0.28 |
| stats_female_share | -0.04 | -0.63 | 0.10 |
The first principal component has highest negative loadings for scores on teaching, research, citations and international outlook as well as percentage of international students. This reflect some parameters of universities educational and research performance, so I can name this component as low performance. For the second component, there are high positive loading for score on industry income and high negative loading for share of women. As for the third component, statistics on number of students at all and per staff member have high negative loadings.
Now, let’s plot the results of PCA.
In order to make some additional tests and visualizations it is necessary to extract scores (or coordinates) of components and add them to our dataset.
data2 <- cbind(data1, pca_res$x[ , 1:3])
Also, let’s create a binary variable indicating whether university locates in the US or Canada or in other countries.
data2$location_bin <- as.factor(ifelse(data1$location == "United States" | data1$location == "Canada", "US and Canada", "Other countries"))
Using newly created binary variable (location_bin) I can try to answer the question whether the best universities located in the US and Canada.
First, I can build a plot with coloring by ‘location_bin’.
Looking at the plot above, it is hard to say exactly from which countries come the best universities because visually there are more or less same amount of points in low values of PC1 (indication high scores on performance) both for American and Canadian universities and universities from other countries.
To get more concrete answer I can run T-test for PC1 scores and compare two groups from variable ‘location_bin’.
t.test(PC1 ~ location_bin, data = data2)
##
## Welch Two Sample t-test
##
## data: PC1 by location_bin
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.132717 1.721327
## sample estimates:
## mean in group Other countries mean in group US and Canada
## 0.198088 -1.228934
As was written earlier, the first component (PC1) was named as “low performance” since it has negative correlations with scores on teaching, research, citations and international outlook. Thus, the lower values of PC1 correspond to higher performance of universities.
The result of T-test in the output above suggests that the difference of means between two groups is significant at the p-value < 2.2e-16. The mean of PC1 for universities from the US and Canada (-1.228934) is significantly less than the mean for universities from other countries (0.198088). This fact states that overall American and Canadian universities perform better than universities from other countries. This doesn’t particularly mean that all universities from the US and Canada are strictly better than others, but rather that the former have better performance more frequently than latter.
Finally, I am making an interactive plot with the PCA solution.
In this project I am going to apply method of cluster analysis for the data coming from the World University Rankings 2021 by Times Higher Education. The aim is to clusterise universities for a company targeting university applicants from all over the world. I need to locate clusters of universities so as to match the best suiting university group to each applicant.
First, we load, as usual, necessary packages for work.
library(factoextra)
library(cluster)
library(dplyr)
library(psych)
library(kableExtra)
library(ggplot2)
library(ggthemes)
library(naniar)
library(Rtsne)
library(gridExtra)
library(ggpubr)
library(dendextend)
Just to save a space I left scraping procedure out of the report, so I load prepared RDS file as data. The file is a little bit different from the original table that was extracted from Times Higher Education site. I deleted several columns because they did not carry any information about characteristics of universities but were rather technical things.
the <- readRDS("THE_rating.RDS")
In this part I am choosing variables from dataset for cluster analysis. Since clustering is demanded by a company targeting university applicants, we need to chose variables that would be important for applicants. I suppose that variables about quality of teaching as well as academic abilities (‘scores_teaching’ and ‘scores_research’ respectively) should be included for sure because they are usually main characteristics of university. Also, international outlook of universities may be rather important for applicants to be sure that a university is open for international students.
Also, since universities are presented for applicants from all over the worlds, it can be important where the university is located. Thus, I created new variable based on variable ‘location’ where I specified regions for universities according to countries of their location.
Overall, I take 4 variables for my analysis. Let’s put them in separate dataset.
the1 <- select(the, c(name, scores_teaching, scores_research, scores_international_outlook, region))
In this section I describe the data obtained for analysis. First, I should check for missing values in the data.
vis_miss(the1)
As seen from the plot above, there are no missing values in our data. And now let’s proceed to variables. Each variable is described in separate subsection below.
Teaching (scores_teaching) represents results of quantitative assessment of learning environment of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.
In the histogram it is seen that values distributed unequally with prevalence of low scores. The distribution is right-skewed.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1526 | 27.81 | 13.62 | 23.55 | 11.7 | 94.4 | 2.02 | 4.98 |
Research (scores_research) represents results of quantitative assessment of volume, income and reputation of universities in research field. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.
As seen from the histogram, values distributed unequally with prevalence of low scores. The distribution is right-skewed.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1526 | 23.57 | 17.33 | 17.5 | 7.1 | 99.6 | 1.86 | 3.72 |
International outlook (scores_international_outlook) represents results of quantitative assessment of international staff, students, research of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.
The histogram shows that observations are distributed more or less equally (at least compared to previous variables), however there is still right skewness.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1526 | 47.03 | 23.17 | 42.3 | 13.5 | 100 | 0.59 | -0.73 |
Region is a nominal variable representing information about location of universities in the world. The distribution of the variable is presented below.
As seen from the barplot, regions are distributed quite unequally, however it completely depends on size of territory as well as population.
In this section I will conduct cluster analysis. There will be several subsections with preparations, applying different methods and interpretation.
Before clusterisation it is necessary to decide which number of clusters should be retained. For this, I conduct several tests.
The first approach is ‘elbow method’ to determine the optimal k. To do this, we run several models with k = [1:10] and compare the ‘within-cluster sum of squares to the cluster centroid’ in each case. The optimal number of clusters (k) is where the ‘elbow’ occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.
The results below suggest 2 or 3 clusters.
fviz_nbclust(the1[ , 2:4], kmeans, method = "wss")
The next approach is the ‘silhouette width statistic’. It measures the quality of a clustering. That is, it determines how well each object lies within its cluster.
As seen from the plot below, recommended number of clusters is 2.
fviz_nbclust(the1[ , 2:4], kmeans, method = "silhouette")
Finally, I use gap statistic. It compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
The plot below suggests 2 clusters for k-means clustering.
fviz_nbclust(the1[ , 2:4], kmeans, method = "gap")
As for hierarchical clustering, it is also suggested to retain 2 clusters.
fviz_nbclust(the1[ , 2:4], hcut, method = "gap")
In this part I apply k-means method of clustering. As was recommended by tests above, I look for 2 clusters. Since k-means is only applicable for numeric variables, I include only scores on teaching, research and international outlook in the analysis.
set.seed(100)
uniclust <- kmeans(the1[ , 2:4], 2)
I executed function of k-means clustering and now we can visualize the result.
In order to conduct hierarchical clustering, it is necessary to count euclidean distances between observations for numeric variables.
d <- dist(the1[ , 2:4], method = 'euclidean')
Now, I build a plot with distances between observations and a dendrogram.
heatmap(as.matrix(d), symm = T,
distfun = function(x) as.dist(x))
The plot is aimed to show number of clusters, however it is hard to identify red squares indicating number of clusters in this particular one.
Now, we get the solution through three different type of distances: minimal, average and maximal.
hfit_ward <- agnes(d, method = "ward")
hfit_average <- agnes(d, method = "average")
hfit_complete <- agnes(d, method = "complete")
To choose better method, let’s look at agglomerative coefficients of each.
hfit_ward$ac
## [1] 0.998086
hfit_average$ac
## [1] 0.9698641
hfit_complete$ac
## [1] 0.9832386
As seen from the output, ward method has the closest to 1 value, so I take this one. Let’s build a dendrogram.
According to the dendrogram, there are two clusters different from each other (judging by height of the first branch).
Let’s visualize clusters.
It is also possible to colour clusters in dendrogram.
In this part I will try to do clustering using data of both numeric and categorical type. I will include ‘region’ variable in the analysis.
In order to do that, I count distances using ‘gower’ metric.
gower_dist <- daisy(the1[ , -1],
metric = "gower")
summary(gower_dist)
## 1163575 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001132 0.312730 0.374980 0.384530 0.461510 0.967030
## Metric : mixed ; Types = I, I, I, N
## Number of objects : 1526
In the summary we see that we have 3 integer and 1 nominal variable which is true.
Then, we build a distance matrix.
gower_mat <- as.matrix(gower_dist)
It is necessary to look at the most similar and dissimilar pairs in the data to see if they make sense
the1[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]
## name scores_teaching scores_research
## 1161 University of Fukui 21.4 10.2
## 891 Kurume University 21.5 10.4
## scores_international_outlook region
## 1161 22.3 East Asia
## 891 22.2 East Asia
the1[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]
## name scores_teaching scores_research
## 1258 Maharaja Sayajirao University of Baroda 13.9 8.8
## 1 University of Oxford 91.3 99.6
## scores_international_outlook region
## 1258 15.1 South Asia
## 1 96.4 West Europe
It seems that they make sense, so we can continue.
Now, it is necessary to understand how many clusters we can retain. I again apply silhouette width statistic.
fviz_nbclust(gower_mat, cluster::pam, method = "silhouette")
The plot shows that we should use 10 clusters.
And now we conduct clustering using partitioning around medoids (PAM).
pam_fit <- pam(gower_dist, diss = TRUE, k = 10)
pam_fit$data <- the1[ , -1]
I did clustering and now we can see the results in visual form.
For now I have three results of cluster analysis: two based on numeric data and one based on both numeric and categorical data. It is necessary to choose the best solution. Thus, let’s look at visualizations one more time.
Results of k-means clustering and hierarchical clustering look similar but do not seem to be very meaningful. In both plots 1st clusters are very scattered. Also, 1st and 2nd clusters in both plots are very close to each other and it is very hard to detect difference. Although PAM clustering with mixed data has quite a large number of clusters (10), it looks more meaningful. Thus, I would take the last result as the best solution.
Since I chose the most appropriate result, I can now define what clusters basically mean. Let’s look at the statistics of each one.
pam_results <- the1 %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit$clustering)%>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary
## [[1]]
## scores_teaching scores_research scores_international_outlook
## Min. :15.10 Min. : 9.80 Min. :42.40
## 1st Qu.:23.00 1st Qu.:19.80 1st Qu.:63.10
## Median :31.30 Median :31.10 Median :76.50
## Mean :33.32 Mean :34.09 Mean :75.16
## 3rd Qu.:39.40 3rd Qu.:44.60 3rd Qu.:87.70
## Max. :91.30 Max. :99.60 Max. :99.60
##
## region cluster
## West Europe :243 Min. :1
## North Europe : 18 1st Qu.:1
## Southeast Asia: 4 Median :1
## Africa : 0 Mean :1
## Australia : 0 3rd Qu.:1
## South Europe : 0 Max. :1
## (Other) : 0
##
## [[2]]
## scores_teaching scores_research scores_international_outlook
## Min. :15.90 Min. : 8.20 Min. :21.30
## 1st Qu.:26.00 1st Qu.:19.95 1st Qu.:38.55
## Median :34.45 Median :27.45 Median :50.05
## Mean :38.97 Mean :35.48 Mean :52.34
## 3rd Qu.:46.62 3rd Qu.:42.70 3rd Qu.:64.58
## Max. :94.40 Max. :98.80 Max. :94.70
##
## region cluster
## North America :211 Min. :2
## North Europe : 8 1st Qu.:2
## Southeast Asia : 6 Median :2
## Central America: 1 Mean :2
## Africa : 0 3rd Qu.:2
## Australia : 0 Max. :2
## (Other) : 0
##
## [[3]]
## scores_teaching scores_research scores_international_outlook
## Min. :14.00 Min. : 7.60 Min. : 15.30
## 1st Qu.:19.10 1st Qu.:11.80 1st Qu.: 22.80
## Median :23.15 Median :16.05 Median : 28.50
## Mean :27.01 Mean :22.65 Mean : 32.42
## 3rd Qu.:29.48 3rd Qu.:27.15 3rd Qu.: 37.25
## Max. :89.60 Max. :94.90 Max. :100.00
##
## region cluster
## East Asia :288 Min. :3
## Southeast Asia: 4 1st Qu.:3
## Africa : 0 Median :3
## Australia : 0 Mean :3
## North Europe : 0 3rd Qu.:3
## South Europe : 0 Max. :3
## (Other) : 0
##
## [[4]]
## scores_teaching scores_research scores_international_outlook
## Min. :16.40 Min. :13.20 Min. :59.80
## 1st Qu.:22.38 1st Qu.:26.77 1st Qu.:84.15
## Median :25.60 Median :32.90 Median :89.25
## Mean :31.48 Mean :37.14 Mean :87.27
## 3rd Qu.:38.58 3rd Qu.:43.10 3rd Qu.:93.42
## Max. :75.90 Max. :90.80 Max. :97.00
##
## region cluster
## Australia :45 Min. :4
## North Europe : 7 1st Qu.:4
## Southeast Asia: 3 Median :4
## Middle East : 1 Mean :4
## Africa : 0 3rd Qu.:4
## South Europe : 0 Max. :4
## (Other) : 0
##
## [[5]]
## scores_teaching scores_research scores_international_outlook
## Min. :12.70 Min. : 8.60 Min. :25.90
## 1st Qu.:17.93 1st Qu.:14.57 1st Qu.:39.20
## Median :20.00 Median :18.00 Median :44.50
## Mean :22.69 Mean :19.74 Mean :46.02
## 3rd Qu.:25.18 3rd Qu.:22.75 3rd Qu.:53.15
## Max. :55.60 Max. :41.40 Max. :75.90
##
## region cluster
## South Europe :129 Min. :5
## North Europe : 7 1st Qu.:5
## Central America: 1 Median :5
## Southeast Asia : 1 Mean :5
## Africa : 0 3rd Qu.:5
## Australia : 0 Max. :5
## (Other) : 0
##
## [[6]]
## scores_teaching scores_research scores_international_outlook
## Min. :12.40 Min. : 7.20 Min. :22.60
## 1st Qu.:15.45 1st Qu.: 7.90 1st Qu.:37.10
## Median :18.30 Median : 9.20 Median :43.10
## Mean :19.21 Mean :12.28 Mean :43.78
## 3rd Qu.:22.50 3rd Qu.:11.30 3rd Qu.:46.95
## Max. :31.30 Max. :43.90 Max. :81.60
##
## region cluster
## Africa :63 Min. :6
## Southeast Asia : 6 1st Qu.:6
## Central America: 4 Median :6
## North Europe : 2 Mean :6
## Australia : 0 3rd Qu.:6
## South Europe : 0 Max. :6
## (Other) : 0
##
## [[7]]
## scores_teaching scores_research scores_international_outlook
## Min. :14.0 Min. : 7.70 Min. :16.50
## 1st Qu.:17.4 1st Qu.: 9.60 1st Qu.:26.90
## Median :19.5 Median :11.60 Median :34.50
## Mean :22.0 Mean :13.99 Mean :37.44
## 3rd Qu.:22.6 3rd Qu.:14.50 3rd Qu.:44.50
## Max. :80.0 Max. :67.60 Max. :75.80
##
## region cluster
## East Europe :124 Min. :7
## Southeast Asia : 10 1st Qu.:7
## North Europe : 2 Median :7
## Central America: 1 Mean :7
## Africa : 0 3rd Qu.:7
## Australia : 0 Max. :7
## (Other) : 0
##
## [[8]]
## scores_teaching scores_research scores_international_outlook
## Min. :11.70 Min. : 7.10 Min. :14.90
## 1st Qu.:17.60 1st Qu.: 9.70 1st Qu.:18.90
## Median :21.40 Median :13.10 Median :24.00
## Mean :23.01 Mean :15.77 Mean :37.53
## 3rd Qu.:26.40 3rd Qu.:18.60 3rd Qu.:53.60
## Max. :46.80 Max. :52.00 Max. :99.40
##
## region cluster
## Middle East :134 Min. :8
## Southeast Asia: 2 1st Qu.:8
## North Europe : 1 Median :8
## Africa : 0 Mean :8
## Australia : 0 3rd Qu.:8
## South Europe : 0 Max. :8
## (Other) : 0
##
## [[9]]
## scores_teaching scores_research scores_international_outlook
## Min. :13.20 Min. : 7.3 Min. :14.60
## 1st Qu.:15.80 1st Qu.: 8.9 1st Qu.:22.30
## Median :17.50 Median :10.1 Median :28.20
## Mean :19.54 Mean :12.1 Mean :32.28
## 3rd Qu.:21.20 3rd Qu.:12.9 3rd Qu.:44.00
## Max. :56.60 Max. :58.9 Max. :58.90
##
## region cluster
## South America :90 Min. :9
## Central America:13 1st Qu.:9
## Southeast Asia :10 Median :9
## Africa : 0 Mean :9
## Australia : 0 3rd Qu.:9
## North Europe : 0 Max. :9
## (Other) : 0
##
## [[10]]
## scores_teaching scores_research scores_international_outlook
## Min. :12.10 Min. : 7.20 Min. :13.50
## 1st Qu.:18.10 1st Qu.: 9.25 1st Qu.:16.15
## Median :23.40 Median :11.00 Median :19.30
## Mean :25.34 Mean :13.08 Mean :24.30
## 3rd Qu.:30.40 3rd Qu.:14.90 3rd Qu.:33.10
## Max. :58.10 Max. :53.10 Max. :46.70
##
## region cluster
## South Asia :84 Min. :10
## Southeast Asia: 3 1st Qu.:10
## Africa : 0 Median :10
## Australia : 0 Mean :10
## North Europe : 0 3rd Qu.:10
## South Europe : 0 Max. :10
## (Other) : 0
It is quite hard to give names to clusters. Most of them are formed by regional characteristic. So, I will just name characteristics of clusters without naming. All in all, since our aim is to present universities to applicants, characteristics are more important than names.
1st cluster - good universities with moderately high teaching and research scores and rather high international orientation; located in Western Europe
2nd cluster - the best universities with the best research and learning environment and moderate international outlook; located in North America
3rd cluster - ordinary universities with moderate scores for everything; located in East Asia
4th cluster - the most internationally oriented universities with good learning and research environment; located in Australia
5th cluster - ordinary universities with lower scores for everything; located in South Europe
6th cluster - universities with the worst research and learning environment from Africa
7th cluster - ordinary universities with lower scores for everything; located in East Europe
8th cluster - ordinary universities with lower scores for everything from Middle East
9th cluster - ordinary universities with lower scores for everything from Central and South America
10th cluster - universities with bad research and learning environment from South Asia
To sum up, I made relatively informative, but still not perfect, clusterization of universities, As was presented in the plot, there are some outliers for many clusters which were assigned to one or another cluster somehow. Also, some regions were represented with small shares in different clusters. Thus, the results could be better, but nevertheless are meaningful.
R suggests various ways of reporting data, descriptive statistics and model summaries. They are presented by different packages. In this paper I will review ‘base’, ‘psych’, ‘sjPlot’ and ‘table1’ packages and functions within them aimed to present data and models.
library(psych)
library(sjPlot)
library(table1)
I will bring brief descriptions of these packages from R documentation. base contains basic functions necessary to do simple operations in R. psych is a general purpose toolbox for personality, psychometric theory and experimental psychology. Some of the functions are written to support a book on psychometric theory as well as publications in personality research. sjPlot is a collection of plotting and table output functions for data visualization. Results of various statistical analyses can be visualized using this package, including simple and cross tabulated frequencies, histograms, box plots, linear models, mixed effects models, principal component analysis and correlation matrices, cluster analyses, scatter plots, stacked scales, effects plots of regression models and much more. Finally, table1 creates HTML tables of descriptive statistics, as one would expect to see as the first table in a medical/epidemiological journal article.
As seen from the description, although all these packages have different functions and purposes, they are quite useful. But still, there are different situations where we can apply one or another packages. Descriptive statistics are an important part of work with quantitative data. Three of four mentioned packages contain tools for doing descriptive statistics.
In base package we apply summary function and it produces simple R output. It presents minimal statistics: means, median, extremes and quantiles. It also works with categorical variables and show number of observations. I apply all functions in this paper on dataset ‘iris’ from ‘datasets’ package which is system library as well as ‘base’.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
It is possible to retrieve more information with decribeBy function from psych package. Standard deviations, skewness and kurtosis appeared in the report. However, categorical variable “Species” is identified as numeric which is not good. But it is possible to group statistics by a variable. Good thing is that this table can be nicely printed through kable function in HTML reports.
describeBy(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
kableExtra:: kable_styling(kableExtra::kable(describeBy(iris), digits = 2), bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sepal.Length | 1 | 150 | 5.84 | 0.83 | 5.80 | 5.81 | 1.04 | 4.3 | 7.9 | 3.6 | 0.31 | -0.61 | 0.07 |
| Sepal.Width | 2 | 150 | 3.06 | 0.44 | 3.00 | 3.04 | 0.44 | 2.0 | 4.4 | 2.4 | 0.31 | 0.14 | 0.04 |
| Petal.Length | 3 | 150 | 3.76 | 1.77 | 4.35 | 3.76 | 1.85 | 1.0 | 6.9 | 5.9 | -0.27 | -1.42 | 0.14 |
| Petal.Width | 4 | 150 | 1.20 | 0.76 | 1.30 | 1.18 | 1.04 | 0.1 | 2.5 | 2.4 | -0.10 | -1.36 | 0.06 |
| Species* | 5 | 150 | 2.00 | 0.82 | 2.00 | 2.00 | 1.48 | 1.0 | 3.0 | 2.0 | 0.00 | -1.52 | 0.07 |
table1 presents similar to summary information, but in more nice looking format. It is common to see such tables in research papers as was mentioned in description. This function also works fine with categorical data, and it is possible to group statistics by it. There is a little disadvantage that it is necessary to write the formula for table but not just insert data set.
table1(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)
| Overall (N=150) |
|
|---|---|
| Sepal.Length | |
| Mean (SD) | 5.84 (0.828) |
| Median [Min, Max] | 5.80 [4.30, 7.90] |
| Sepal.Width | |
| Mean (SD) | 3.06 (0.436) |
| Median [Min, Max] | 3.00 [2.00, 4.40] |
| Petal.Length | |
| Mean (SD) | 3.76 (1.77) |
| Median [Min, Max] | 4.35 [1.00, 6.90] |
| Petal.Width | |
| Mean (SD) | 1.20 (0.762) |
| Median [Min, Max] | 1.30 [0.100, 2.50] |
| Species | |
| setosa | 50 (33.3%) |
| versicolor | 50 (33.3%) |
| virginica | 50 (33.3%) |
table1(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width | Species, data = iris)
| setosa (N=50) |
versicolor (N=50) |
virginica (N=50) |
Overall (N=150) |
|
|---|---|---|---|---|
| Sepal.Length | ||||
| Mean (SD) | 5.01 (0.352) | 5.94 (0.516) | 6.59 (0.636) | 5.84 (0.828) |
| Median [Min, Max] | 5.00 [4.30, 5.80] | 5.90 [4.90, 7.00] | 6.50 [4.90, 7.90] | 5.80 [4.30, 7.90] |
| Sepal.Width | ||||
| Mean (SD) | 3.43 (0.379) | 2.77 (0.314) | 2.97 (0.322) | 3.06 (0.436) |
| Median [Min, Max] | 3.40 [2.30, 4.40] | 2.80 [2.00, 3.40] | 3.00 [2.20, 3.80] | 3.00 [2.00, 4.40] |
| Petal.Length | ||||
| Mean (SD) | 1.46 (0.174) | 4.26 (0.470) | 5.55 (0.552) | 3.76 (1.77) |
| Median [Min, Max] | 1.50 [1.00, 1.90] | 4.35 [3.00, 5.10] | 5.55 [4.50, 6.90] | 4.35 [1.00, 6.90] |
| Petal.Width | ||||
| Mean (SD) | 0.246 (0.105) | 1.33 (0.198) | 2.03 (0.275) | 1.20 (0.762) |
| Median [Min, Max] | 0.200 [0.100, 0.600] | 1.30 [1.00, 1.80] | 2.00 [1.40, 2.50] | 1.30 [0.100, 2.50] |
Besides descriptive statistics, reports on tests and models are important to be shown. In case of base package, we appeal to already known summary. To be honest, summary is rather universal function able to summarize different objects in R. The output for linear models contain all important numbers and statistics.
mod1 <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24675 -0.29657 -0.01515 0.27676 1.00269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
## Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
sjPlot in its turn suggest proper looking tables and even allows to put several models together which is very convenient for comparison of different models. It is less detailed than previous one, but enough for interpretation.
mod2 <- lm(Sepal.Length ~ Petal.Length + Petal.Width, data = iris)
tab_model(model1, mod2)
| Q211 | Sepal.Length | |||||
|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Estimates | CI | p |
| (Intercept) | 0.66 | 0.59 – 0.74 | <0.001 | 4.19 | 4.00 – 4.38 | <0.001 |
| eduT [1] | 1.26 | 1.03 – 1.54 | 0.027 | |||
| Petal.Length | 0.54 | 0.40 – 0.68 | <0.001 | |||
| Petal.Width | -0.32 | -0.64 – -0.00 | 0.048 | |||
| Observations | 1736 | 150 | ||||
| R2 Tjur | 0.003 | 0.766 / 0.763 | ||||
It is also possible to create pretty correlation tables with tab_corr.
tab_corr(iris[ , 1:4])
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| Sepal.Length | -0.118 | 0.872*** | 0.818*** | |
| Sepal.Width | -0.118 | -0.428*** | -0.366*** | |
| Petal.Length | 0.872*** | -0.428*** | 0.963*** | |
| Petal.Width | 0.818*** | -0.366*** | 0.963*** | |
| Computed correlation used pearson-method with listwise-deletion. | ||||
Well, I brought some examples of how to do descriptive statistics and model summaries in R. It is clear that different packages can be applied in different situations because there are some requirements to follow. Thus, the choice of tools depends on the conditions. If it is a routine work and a person does some computations for him/herself, then some basic stuff is enough because it is simple, informative and easy to get. In this case base package is the best choice. For situations where some official report is demanded (e.g. in articles) it is better to use more advanced tools to meet publishing standards and requirements. It is possible to use sjPlot or table1 since they produce good tables. Other way is to apply kableExtra package in combination with other packages (e.g. psych) to create great customized tabulated reports. The choice is choice is not limited by these packages and there are many other options to do great data descriptions in more interactive way for some web-pages.
To sum up, if to create some toolset for beginners who could make convincing reports, I would suggest psych :: decribeBy + customized kable table for descriptive statistics because it is quite simple to create and fully informative. As for regression models, which are frequently reported, I recommend sjPlot :: tab_modelbecause it is also easy to make and it produce nice-looking comparable tables. These two tools are more or less enough for newcomers in R. As for serious papers there are customized in classic themes kable tables and table1 just because it is common to see products of these tools in existing articles.
At first, I would say that Bayesian thinking is not the thing that can be understood by myself on the first try to be honest, but I will try to do my best in this paper. Since me and my friends did mostly qualitative research in previous year, it is rather impossible to apply Bayesian thinking to those works. Thus, I will try to analyze my current diploma paper which I work on.
It is necessary to briefly describe the idea of my research. The topic sounds something like ‘association between health-related occupations and risky sexual behavior’. And the general research question is the following: “Is there any association between being a medical worker and engagement in risky sexual behavior?” The data comes from General Social Survey and the method of data analysis will be binary logistic regression.
Well, Bayesian inference works with probabilities. So, in Bayesian perspective the research question will be changed in such a way: “What is the probability to be engaged in risky sexual behavior under condition of being a medical worker?” In order to do Bayesian inference, it is necessary to have generative mode, prior probability and data. So, if I were doing my research in Bayesian perspective, I would need some of these things. As I was mentioned, I use data from General Social Survey. I will need some prior probability of being engaged in risky sexual behavior as well as shares of medical workers in population. And on the one hand I could take this prior probability from some sources or statistics and then use General Social Survey as data for doing inference. On the other hand, if I had enough resources for doing in such a way, I could count prior probability of risky sexual behavior based on General Social Survey data and then conduct my own survey to attain data for making inference. There is even one more possible way of doing. Bayesian thinking also implies some kind of updating of knowledge by multiple trials. And since General Social Survey is a periodic study which is conducted every two years, then I could use several rounds of survey for updating the model and transforming posterior probability into prior probability for the next step.
It seems that the result of the study would be a single probability or a credential interval that would answer the research question, Since Bayesian inference has nothing to do with null hypothesis testing, p-values and other stuff, it is not necessary to provide any additional statistics to justify significance of the results.
As I said previously, Bayesian thinking is not a very easy issue, so ordinary people may not catch the idea of the study if it will be presented in pure scientific words. Still, it is important to deliver the insight of study to public and stakeholders as it carries some social value. I think, the better way for presenting the whole thing is by using graphics showing distributions of probabilities, intersections of conditions, etc. It seems to be more intuitive to understand the meaning through the picture rather than just with some numbers.