Scraping data from Wikipedia

Introduction

In this project I will apply data scraping skills to retrieve a table of the Times Higher Education University World Ranking from a Wikipedia page. The given data will be summarized in an appropriate plot showing changes in rating of universities.

The first step is, as usual, loading necessary packages.

library(rvest)
library(dplyr)
library(CGPfunctions)
library(ggthemes)

Data scraping

To begin with, I download a data from a web-page, then I retrieve tables from it and check whether the table is one of my interest.

wikidata <- "https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings"

data1 <- wikidata%>%
  read_html()%>%
  html_table(fill=TRUE)

rmarkdown::paged_table(data1[[3]])

It seems that I’ve got the right table, so I can save it. However, it is seen that data in some columns with is not in appropriate type because of sign “-” meaning that in some year university dropped from the rating. It is better to transform such cases into NAs and then turn all data into one type. Also, it is better to rename columns as there are useless numbers of Wikipedia links.

#Saving a table to data frame
rating <- data1[[3]]

#Replacing "-" with NAs
rating[,2:12] <- replace(rating[,2:12], rating[,2:12]=="-", NA)  

#Turning data into numeric type
rating[,2:12] <- (lapply(rating[,2:12], as.numeric))

#Changing heads of columns
names(rating) <- c("University", "2010-2011", "2011-2012", "2012-2013", "2013-2014", "2014-2015", "2015-2016", "2016-2017", "2017-2018", "2018-2019", "2019-2020", "2020-2021")

Let’s see the table one more time to be sure that everything is fine.

rmarkdown::paged_table(rating)

As I successfully retrieved the table and cleaned the data, it is now possible to do some work with it.

Building a plot

In this part I will build a slopegraph that shows how first 10 rating positions of universities changed from 2011 to 2021.

Firstly, I need to limit our table to 10 rows.

top10uni <- head(rating, 10)

top10uni%>%
  knitr:: kable()

University	2010-2011	2011-2012	2012-2013	2013-2014	2014-2015	2015-2016	2016-2017	2017-2018	2018-2019	2019-2020	2020-2021
University of Oxford	6	4	2	2	3	2	1	1	1	1	1
Stanford University	4	2	3	4	4	3	3	3	3	4	2
Harvard University	1	2	4	2	2	6	6	6	6	7	3
California Institute of Technology	2	1	1	1	1	1	2	3	5	2	4
Massachusetts Institute of Technology	3	7	5	5	6	5	5	5	4	5	5
University of Cambridge	6	6	7	7	5	4	4	2	2	3	6
University of California, Berkeley	8	10	9	8	8	13	10	18	15	13	7
Yale University	10	11	11	11	9	12	12	12	8	8	8
Princeton University	5	5	6	6	7	7	7	7	7	6	9
University of Chicago	13	9	10	9	11	10	10	9	10	9	10

I am interested only in 2011 and 2021 year, so other years should be deleted from the table. I make a copy of table and leave in the first one data for 2011 and in the second one for 2021. Also, I add new column with year as it is necessary for building a graph. Then, I unite these two tables into one.

top10uni2<-top10uni

top10uni <- top10uni[,-c(3:12)] 
year1 <- c("2011", "2011", "2011", "2011", "2011", "2011", "2011", "2011", "2011", "2011")
top10uni <- cbind(top10uni, year1)

top10uni2 <- top10uni2[,-c(2:11)] 
year2 <- c("2021", "2021", "2021", "2021", "2021", "2021", "2021", "2021", "2021", "2021")
top10uni2 <- cbind(top10uni2, year2)

names(top10uni) <- c("uni", "rating", "year")
names(top10uni2) <- c("uni", "rating", "year")

topuni <- rbind(top10uni, top10uni2)

topuni%>%
  knitr:: kable()

uni	rating	year
University of Oxford	6	2011
Stanford University	4	2011
Harvard University	1	2011
California Institute of Technology	2	2011
Massachusetts Institute of Technology	3	2011
University of Cambridge	6	2011
University of California, Berkeley	8	2011
Yale University	10	2011
Princeton University	5	2011
University of Chicago	13	2011
University of Oxford	1	2021
Stanford University	2	2021
Harvard University	3	2021
California Institute of Technology	4	2021
Massachusetts Institute of Technology	5	2021
University of Cambridge	6	2021
University of California, Berkeley	7	2021
Yale University	8	2021
Princeton University	9	2021
University of Chicago	10	2021

This is what I have after manipulations with data. Now, it is possible to construct our slopegraph. I decided to apply newggslopegraph() function from CGPfunctions package.

Summary

As seen from the graph, Top-10 of 2011 and Top-10 of 2021 are rather similar in terms of names of universities that fill the rating. There are changes in the highest positions, while lower rankings did not change too much. By 10 years, 5 universities increased their rating, 4 fell to lower positions and 1 have stayed on the same place.

Project on logistic regression

Introduction

In this project I will build a binary logistic regression model that is intended to predict attending lawful demonstrations. The data for the analysis is taken from 7th wave of World Values Survey. The focus of analysis is Russia. I have 9 variables to select from prepared dataset and there will be up to 2 variables of categorical and continuous type as predictors.

Getting started

This section is devoted to preparation of data before analysis. First, it is necessary to upload packages and dataset.

library(dplyr)
library(ggplot2)
library(ggthemes)
library(psych)
library(kableExtra)
library(equatiomatic)
library(gridExtra)
library(ggpubr)
library(naniar)
library(VIM)
library(mice)
library(margins)
library(caret)
library(car)
library(sjPlot)
library(rcompanion)
library(tidyr)

df <- readRDS("ruwvs")

Now I have a dataset and it is necessary to look at its structure to be sure that variables are of a proper type.

str(df)

## 'data.frame':    1810 obs. of  10 variables:
##  $ Q211      : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 2 2 ...
##  $ intinpol  : Factor w/ 4 levels "Very interested",..: 4 2 2 3 3 3 3 2 3 4 ...
##  $ eduR      : Factor w/ 4 levels "Primary","Secondary",..: 3 4 4 4 4 2 2 2 4 2 ...
##  $ townsize  : Factor w/ 5 levels "Under 5,000",..: 5 5 5 5 5 4 4 4 5 5 ...
##  $ settlement: Factor w/ 5 levels "Capital city",..: 1 1 1 2 2 2 2 2 1 1 ...
##  $ income    : Factor w/ 10 levels "Lower step","second step",..: 8 6 6 3 4 3 6 4 5 9 ...
##  $ region    : Factor w/ 8 levels "RU: North West Federal District",..: 2 2 2 4 4 2 2 2 2 2 ...
##  $ age       : Factor w/ 85 levels "16","17","18",..: 21 43 27 32 46 54 25 63 36 13 ...
##  $ income1   : num  7 5 5 2 3 2 5 3 4 8 ...
##  $ eduT      : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 2 1 ...

#Age is factor now, but should be in numeric form
df$age<-as.numeric(as.character(df$age))

#The value of settlement "Another city, town(not a regional or district center)" is rather long and it will be inconvenient to print it in graphics and model summaries. Let's make it shorter.
levels(df$settlement)[4]<-"Another city"

#income is in continuous form, so just for the convenience let's delete its categorical form.
df$income<-NULL

Now, the environment is prepared for work and I can proceed to inspection of data in the next part.

Data description

In this part of the project, the variables for further analysis are described in detail. To get information about variables, press buttons.

Attending demonstrations

Attending peaceful demonstrations (Q211) is a binary variable where 1 means that a person will attend a demonstration while 0 refers to rejection to attend. Q211 is an outcome variable in our model. From the barplots below it is seen that both values of the variable are filled enough with observations.

Interest in politics

Interest in politics (intinpol) is an ordinal variable with four levels from “Not at all interested” [in politics] to “Very interested”. “intinpol” can be used as a predictor in our model. As seen from the barplots below, both values of the variable are filled enough with observations. It seems that it is possible to include this variable as a predictor.

Education

Education (eduR) is an ordinal variable with four levels (Primary, Secondary, Post-secondary and Tertiary). Because of Russian context where a person should have at least secondary education, it is very rare that someone have only primary education. Considering this fact it is useless to include this variable in analysis because primary education as reference group in the model will always be in inferior position because of small number of cases.

Size of town

Size of town (townsize) is an ordinal variable representing sizes of population in its levels. From the barplots below it is possible to see that both values of the variable are filled enough with observations. However, it seems that this variable is not much appropriate for the model.

Type of settlement

Type of settlement (settlement) is a categorical variable containing different types of residential areas in Russia. It will not be included in a model because some values has not very many observations.

Region

Region is a categorical variable with names of Russian regions. It will not be included in analysis because it is rather hard to define proper reference group here to compare while interpreting model.

Age

Age is a continuous variable presenting age of respondents from WVS. The distribution and descriptive statistics are presented below. By looking at histogram, mean, median, skew and kurtosis, it is clear that distribution is not normal. I will include age in our model since it is necessary to have continuous predictors and age could be a good one for predicting attending demonstrations.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1810	45.41	17.12	43	18	91	0.35	-0.81

Income

Income is a continuous variable transformed from ordinal variable with levels of income. The distribution and descriptive statistics are presented below. I will include age in our model since it is necessary to have continuous predictors and income could be a good one for predicting attending demonstrations.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1747	3.77	1.93	4	0	9	0.1	-0.39

Tertiary education

Having tertiary education (eduT) is a binary variable with values 0 and 1 which corresponds to not having and having higher education respectively. From the barplots below it is clear that both values of the variable are filled enough with observations. It seems to be a good idea to include this variable in our analysis.

Dealing with missing data

In this section I am trying to solve the problem of missing data. To begin with, let’s observe presence of NAs in our data

miss_var_summary(df)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

variable	n_miss	pct_miss
income1	63	3.4806630
eduR	10	0.5524862
eduT	10	0.5524862
intinpol	6	0.3314917
Q211	0	0.0000000
townsize	0	0.0000000
settlement	0	0.0000000
region	0	0.0000000
age	0	0.0000000

vis_miss(df)

As seen from figures above, there are 4 variables with missing data, and overall there is only 0.5% of miss in the whole dataset. This percent is very small, so it is possible to apply imputation to fill missing.

Data imputation

First, let’s try k-Nearest Neighbour Imputation (kNN) from VIM package. As seen from the table below, all missing values were successfully imputed.

df_knn <- kNN(df, k=5, variable=c("income1", "eduR", "eduT", "intinpol"))

miss_var_summary(df_knn)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

variable	n_miss	pct_miss
Q211	0	0
intinpol	0	0
eduR	0	0
townsize	0	0
settlement	0	0
region	0	0
age	0	0
income1	0	0
eduT	0	0
income1_imp	0	0
eduR_imp	0	0
eduT_imp	0	0
intinpol_imp	0	0

Now, let’s try another method - Multivariate Imputation by Chained Equations (MICE) from mice package. As seen from the table below, all missing values were successfully imputed.

dfmice <- mice(df, m=5, defaultMethod = c("pmm", "logreg", "polyreg", "polr"))

## 
##  iter imp variable
##   1   1  intinpol  eduR  income1  eduT
##   1   2  intinpol  eduR  income1  eduT
##   1   3  intinpol  eduR  income1  eduT
##   1   4  intinpol  eduR  income1  eduT
##   1   5  intinpol  eduR  income1  eduT
##   2   1  intinpol  eduR  income1  eduT
##   2   2  intinpol  eduR  income1  eduT
##   2   3  intinpol  eduR  income1  eduT
##   2   4  intinpol  eduR  income1  eduT
##   2   5  intinpol  eduR  income1  eduT
##   3   1  intinpol  eduR  income1  eduT
##   3   2  intinpol  eduR  income1  eduT
##   3   3  intinpol  eduR  income1  eduT
##   3   4  intinpol  eduR  income1  eduT
##   3   5  intinpol  eduR  income1  eduT
##   4   1  intinpol  eduR  income1  eduT
##   4   2  intinpol  eduR  income1  eduT
##   4   3  intinpol  eduR  income1  eduT
##   4   4  intinpol  eduR  income1  eduT
##   4   5  intinpol  eduR  income1  eduT
##   5   1  intinpol  eduR  income1  eduT
##   5   2  intinpol  eduR  income1  eduT
##   5   3  intinpol  eduR  income1  eduT
##   5   4  intinpol  eduR  income1  eduT
##   5   5  intinpol  eduR  income1  eduT

df_mice <- complete(dfmice,3)

miss_var_summary(df_mice)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

variable	n_miss	pct_miss
Q211	0	0
intinpol	0	0
eduR	0	0
townsize	0	0
settlement	0	0
region	0	0
age	0	0
income1	0	0
eduT	0	0

Visualization of imputed data

At this point I can visualize imputed data. Since the share of missings in variables eduR, eduT and intinpol was very small, it is rather useless to do visualization because it will be hard to see anything. However, it is still possible to work with income1 variable. In the histograms below visualization of imputed data by two methods is presented.

According to the histograms above, it can be said that kNN works better here because imputed data repeats distribution of original data.

Building a model

Now, I am ready to start model building. To remind, I have Q211 as an outcome variable and income1, age, intinpol and eduT as predictors. I will use forward stepwise method meaning adding predictors one by one and looking at whether the model became better.

Before the modeling I want to be sure that there is no empty or small cells by doing a crosstab between categorical predictors and the outcome variable.

xtabs(~ Q211 + intinpol, data = df)

##     intinpol
## Q211 Very interested Somewhat interested Not very interested
##    0              69                 352                 416
##    1              79                 286                 297
##     intinpol
## Q211 Not at all interested
##    0                   229
##    1                    76

xtabs(~ Q211 + eduT, data = df)

##     eduT
## Q211   0   1
##    0 726 339
##    1 470 265

Everything looks rather good. So, let’s start with tertiary education.

model1 <- glm(Q211 ~ eduT, data = df0, family = binomial())
summary(model1)

## 
## Call:
## glm(formula = Q211 ~ eduT, family = binomial(), data = df0)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.100  -1.008  -1.008   1.357   1.357  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4133     0.0599   -6.90 5.19e-12 ***
## eduT1         0.2278     0.1031    2.21   0.0271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2357.9  on 1735  degrees of freedom
## Residual deviance: 2353.1  on 1734  degrees of freedom
## AIC: 2357.1
## 
## Number of Fisher Scoring iterations: 4

As seen from the output above, the predictor is significant and the residual deviance shows that having a model is better than no model at all. Let’s add one more predictor - interest in politics.

model2 <- glm(Q211 ~ eduT + intinpol, data = df0, family = binomial())
summary(model2)

## 
## Call:
## glm(formula = Q211 ~ eduT + intinpol, family = binomial(), data = df0)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3268  -1.0692  -0.7605   1.2595   1.6624  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     0.1608     0.1731   0.929  0.35284    
## eduT1                           0.1837     0.1045   1.758  0.07871 .  
## intinpolSomewhat interested    -0.4207     0.1874  -2.245  0.02479 *  
## intinpolNot very interested    -0.5356     0.1860  -2.880  0.00398 ** 
## intinpolNot at all interested  -1.2534     0.2156  -5.814 6.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2357.9  on 1735  degrees of freedom
## Residual deviance: 2309.4  on 1731  degrees of freedom
## AIC: 2319.4
## 
## Number of Fisher Scoring iterations: 4

The output above shows that intinpol is significant while eduT lost its significance.

Now, I will compare given two models with help of anova.

anova(model1, model2, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Q211 ~ eduT
## Model 2: Q211 ~ eduT + intinpol
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1734     2353.1                          
## 2      1731     2309.4  3   43.639 1.801e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output says that the second model is significantly better than the first one, so adding “intinpol” variable improved the model.

Moving forward, I add another predictor - income.

model3 <- glm(Q211 ~ eduT + intinpol + income1, data = df0, family = binomial())
summary(model3)

## 
## Call:
## glm(formula = Q211 ~ eduT + intinpol + income1, family = binomial(), 
##     data = df0)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4405  -1.0619  -0.7961   1.2545   1.8209  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    0.35610    0.19621   1.815  0.06954 .  
## eduT1                          0.24397    0.10848   2.249  0.02451 *  
## intinpolSomewhat interested   -0.40412    0.18790  -2.151  0.03150 *  
## intinpolNot very interested   -0.53551    0.18630  -2.874  0.00405 ** 
## intinpolNot at all interested -1.28521    0.21655  -5.935 2.94e-09 ***
## income1                       -0.05748    0.02698  -2.131  0.03312 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2357.9  on 1735  degrees of freedom
## Residual deviance: 2304.9  on 1730  degrees of freedom
## AIC: 2316.9
## 
## Number of Fisher Scoring iterations: 4

As seen from the output above, all predictors are significant. And I again make a comparative test.

anova(model2, model3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Q211 ~ eduT + intinpol
## Model 2: Q211 ~ eduT + intinpol + income1
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1      1731     2309.4                       
## 2      1730     2304.9  1   4.5658  0.03262 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is significant, this model3 is better than model2.

And I do the final step, adding last predictor - age.

model4 <- glm(Q211 ~ eduT + intinpol + income1 + age, data = df0, family = binomial())
summary(model4)

## 
## Call:
## glm(formula = Q211 ~ eduT + intinpol + income1 + age, family = binomial(), 
##     data = df0)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4389  -1.0612  -0.7981   1.2562   1.8317  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    0.3838016  0.2705450   1.419  0.15601    
## eduT1                          0.2421433  0.1091694   2.218  0.02655 *  
## intinpolSomewhat interested   -0.4053699  0.1880869  -2.155  0.03114 *  
## intinpolNot very interested   -0.5381749  0.1871663  -2.875  0.00404 ** 
## intinpolNot at all interested -1.2892335  0.2182435  -5.907 3.48e-09 ***
## income1                       -0.0585451  0.0279182  -2.097  0.03599 *  
## age                           -0.0004586  0.0030831  -0.149  0.88176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2357.9  on 1735  degrees of freedom
## Residual deviance: 2304.8  on 1729  degrees of freedom
## AIC: 2318.8
## 
## Number of Fisher Scoring iterations: 4

As seen from above, all predictors but age are significant. Comparing models will show if the model was improved.

anova(model3, model4, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Q211 ~ eduT + intinpol + income1
## Model 2: Q211 ~ eduT + intinpol + income1 + age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1730     2304.9                     
## 2      1729     2304.8  1 0.022128   0.8817

The output shows that model4 is not better than model3. Thus, I can say that model3 is the best model and I will take it for further analysis.

Interpretation

In this part I will interpret coefficients of the model. First, let’s write out the model equation.

extract_eq(model3, wrap = TRUE, use_coefs=TRUE, terms_per_line = 3)

\[ \begin{aligned} \log\left[ \frac { P( \operatorname{Q211} = \operatorname{1} ) }{ 1 - P( \operatorname{Q211} = \operatorname{1} ) } \right] &= 0.36 + 0.24(\operatorname{eduT}_{\operatorname{1}}) - 0.4(\operatorname{intinpol}_{\operatorname{Somewhat\ interested}})\ - \\ &\quad 0.54(\operatorname{intinpol}_{\operatorname{Not\ very\ interested}}) - 1.29(\operatorname{intinpol}_{\operatorname{Not\ at\ all\ interested}}) - 0.06(\operatorname{income1})\ + \\ &\quad \epsilon \end{aligned} \]

Then, I extract odds ratios by exponentiation of coefficients from equation.

exp(cbind(OR = coef(model3), confint(model3)))

##                                      OR     2.5 %    97.5 %
## (Intercept)                   1.4277497 0.9733762 2.1029471
## eduT1                         1.2763087 1.0317961 1.5788207
## intinpolSomewhat interested   0.6675668 0.4608576 0.9636953
## intinpolNot very interested   0.5853727 0.4053375 0.8423304
## intinpolNot at all interested 0.2765914 0.1802280 0.4215937
## income1                       0.9441430 0.8953447 0.9952651

The numbers from output above state the following:

When a person have tertiary education, his/her odds to attend demonstration change by a factor of 1.28 (or increase by 28%) as compared to person without tertiary education (95% CI = [1.03 ; 1.58])
When a person is somewhat interested in politics, his/her odds to attend demonstration change by a factor of 0.67 (or decrease by 33%) as compared to person who is very interested in politics (95% CI = [0.46 ; 0.96])
When a person is not very interested in politics, his/her odds to attend demonstration change by a factor of 0.58 (or decrease by 42%) as compared to person who is very interested in politics (95% CI = [0.40 ; 0.84])
When a person is not at all interested in politics, his/her odds to attend demonstration change by a factor of 0.27 (or decrease by 73%) as compared to person very interested in politics (95% CI = [0.18 ; 0.42])
When income increases by one unit, the odds that a person will attend a demonstration change by factor of 0.94 (or decrease by 6%) (95% CI = [0.89 ; 0.99])

Now, let’s inspect average marginal effects and interpret them.

m <- margins(model3, type = "response")
summary(m)

##                         factor     AME     SE       z      p   lower   upper
##                          eduT1  0.0580 0.0258  2.2426 0.0249  0.0073  0.1086
##                        income1 -0.0136 0.0063 -2.1413 0.0322 -0.0260 -0.0011
##  intinpolNot at all interested -0.2973 0.0489 -6.0781 0.0000 -0.3932 -0.2015
##    intinpolNot very interested -0.1324 0.0457 -2.8962 0.0038 -0.2220 -0.0428
##    intinpolSomewhat interested -0.1002 0.0462 -2.1670 0.0302 -0.1909 -0.0096

The numbers from output above say the following:

for two hypothetical individuals with average values on income, the predicted probability of attending demonstration is 0.06 greater for the individual with tertiary education than for one without tertiary education
if income increased by some very small amount (0.01), then the predicted probability of attending demonstration would decrease by about 0.01 * 0.01
for two hypothetical individuals with average values on income, the predicted probability of attending demonstration is 0.29 smaller for the individual who is not at all interested in politics than for one who is very interested in politics
for two hypothetical individuals with average values on income, the predicted probability of attending demonstration is 0.13 smaller for the individual who is not very interested in politics than for one who is very interested in politics
for two hypothetical individuals with average values on income, the predicted probability of attending demonstration is 0.10 smaller for the individual who is somewhat interested in politics than for one who is very interested in politics

Predicting outcome

In this part I will try to predict values of Y using different values of predictor.

First, let’s try to predict with fixed average value of income.

pol <- c("Not at all interested", "Not very interested", "Somewhat interested", "Very interested")

newdata1 <- crossing(income1 = mean(df0$income1), 
                     intinpol = factor(pol), 
                     eduT = factor(0:1))

newdata1$pred <- predict(model3, newdata = newdata1, type = "response")

newdata1

## # A tibble: 8 x 4
##   income1 intinpol              eduT   pred
##     <dbl> <fct>                 <fct> <dbl>
## 1    3.77 Not at all interested 0     0.241
## 2    3.77 Not at all interested 1     0.289
## 3    3.77 Not very interested   0     0.402
## 4    3.77 Not very interested   1     0.462
## 5    3.77 Somewhat interested   0     0.434
## 6    3.77 Somewhat interested   1     0.495
## 7    3.77 Very interested       0     0.535
## 8    3.77 Very interested       1     0.595

In the table above there are predicted values of Y in “pred” column for given set of predictors’ values.

And now, let’s try to predict Y without holding income at average.

newdata2 <- crossing(income1 = seq(from=1, to=10, length.out = 10), 
                     intinpol = factor(pol), 
                     eduT = factor(0:1))

newdata3 <- cbind(newdata2, predict(model3, newdata = newdata2, type = "response", se = T))

newdata3 <- within(newdata3, {
  PredictedProb <- fit
  LL <- fit - (1.96 * se.fit)
  UL <- fit + (1.96 * se.fit)
})

head(newdata3)

##   income1              intinpol eduT       fit     se.fit residual.scale
## 1       1 Not at all interested    0 0.2715857 0.02864770              1
## 2       1 Not at all interested    1 0.3224315 0.03759049              1
## 3       1   Not very interested    0 0.4410542 0.02634740              1
## 4       1   Not very interested    1 0.5017719 0.03479455              1
## 5       1   Somewhat interested    0 0.4736510 0.02859693              1
## 6       1   Somewhat interested    1 0.5345643 0.03626430              1
##          UL        LL PredictedProb
## 1 0.3277352 0.2154362     0.2715857
## 2 0.3961088 0.2487541     0.3224315
## 3 0.4926951 0.3894133     0.4410542
## 4 0.5699692 0.4335746     0.5017719
## 5 0.5297009 0.4176010     0.4736510
## 6 0.6056423 0.4634863     0.5345643

In the table above it is possible to see predicted values of Y in “PredictedProb” column. Using this data I can visualize what I have.

From the plots above it is seen that there is no interaction effect neither for interest in politics nor for tertiary education and income. However, the lines indicate that in all cases probability to attend demonstration decrease with the increase of income.

Model quality

In this section I say about model quality using McFadden pseudo-R-squared since AIC works only when we compare models.

nagelkerke(model3)

## $Models
##                                                                
## Model: "glm, Q211 ~ eduT + intinpol + income1, binomial(), df0"
## Null:  "glm, Q211 ~ 1, binomial(), df0"                        
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.0225104
## Cox and Snell (ML)                  0.0301122
## Nagelkerke (Cragg and Uhler)        0.0405338
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -5     -26.539 53.078 3.2414e-10
## 
## $Number.of.observations
##            
## Model: 1736
## Null:  1736
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

For our model McFadden pseudo-R-squared is equal to 0.02 which indicates bad quality because it should have values 0.2-0.4 to be interpreted as satisfactory.

Model accuracy

In this part I will calculate accuracy of the model.

First, I need to create a training and a test subsamples and fit a model on the training data.

bound <- floor((nrow(df0)/2))     
df2 <- df0[sample(nrow(df0)), ]   
df.train <- df2[1:bound, ]              
df.test <- df2[(bound+1):nrow(df2), ]    

testmodel <- glm(Q211 ~ eduT + intinpol + income1, data = df.train, family = "binomial", 
                na.action = na.omit)

And then I build confusion matrix and look at results.

pred <- format(round(predict(testmodel, newdata = df.test, type = "response")))
confusionMatrix(table(pred, df.test$Q211))

## Confusion Matrix and Statistics
## 
##     
## pred   0   1
##    0 477 331
##    1  25  35
##                                           
##                Accuracy : 0.5899          
##                  95% CI : (0.5563, 0.6228)
##     No Information Rate : 0.5783          
##     P-Value [Acc > NIR] : 0.2572          
##                                           
##                   Kappa : 0.0517          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.95020         
##             Specificity : 0.09563         
##          Pos Pred Value : 0.59035         
##          Neg Pred Value : 0.58333         
##              Prevalence : 0.57834         
##          Detection Rate : 0.54954         
##    Detection Prevalence : 0.93088         
##       Balanced Accuracy : 0.52291         
##                                           
##        'Positive' Class : 0               
##

P-value for comparison of accuracy and no information rate is insignificant which means that our model is not better than no model at all.

Model diagnostics

In this section, I do model diagnostics and check some assumptions of logistic regression.

No prefect multicollinearity - there is no perfect linear relationship between predictors. It is checked by Variance Inflation Factors (VIF) below.

vif(model3)

##              GVIF Df GVIF^(1/(2*Df))
## eduT     1.078514  1        1.038515
## intinpol 1.027172  3        1.004478
## income1  1.103277  1        1.050370

For all predictors GVIF is ~ 1 which is good. There is no multicollinearity.

Linearity - linear relationship between continuous predictor variables and the logit of the outcome.

residualPlots(model3)

##          Test stat Pr(>|Test stat|)
## eduT                               
## intinpol                           
## income1     0.3655           0.5455

The results of the test and plots with continuous predictor (income1) show that there is linearity between it and the logit.

Conclusion

So, I built a logistic regression model according to which 1) having higher education is associated with higher probability to attend peaceful demonstrations; 2) lower interest in politics associated with lower probability to attend peaceful demonstrations; 3) higher income is associated with lower probability to attend peaceful demonstrations. However, the model itself is not very good because some quality tests indicate insignificance of model against situation where no model at all.

Project on principle component analysis

Introduction and preparation

In this project I will apply method of principle component analysis (PCA) on the data coming from World University Rankings 2021 by Times Higher Education.

The first step is to prepare packages and data set.

library(dplyr)
library(naniar)
library(kableExtra)
library(corrplot)
library(pca3d)
library(ggplot2)
library(ggthemes)
library(plotly)

data1 <- read.csv('THE2021.csv', header = TRUE, sep = ",")

Exploring the data

In this part I explore the data and correlations between variables in it.

To begin with, let’s look at the structure. For PCA I need numeric variables and, according to the output below, all variables that will be put in PCA function are of proper type.

str(data1)

## 'data.frame':    1448 obs. of  11 variables:
##  $ name                        : chr  "University of Oxford" "Stanford University" "Harvard University" "California Institute of Technology" ...
##  $ location                    : chr  "United Kingdom" "United States" "United States" "United States" ...
##  $ scores_teaching             : num  91.3 92.2 94.4 92.5 90.7 90.3 85.8 91.9 88.8 88.9 ...
##  $ scores_research             : num  99.6 96.7 98.8 96.9 94.4 99.2 97.2 93.8 92.5 90.5 ...
##  $ scores_citations            : num  98 99.9 99.4 97 99.7 95.6 99.1 97.9 98.9 98.6 ...
##  $ scores_industry_income      : num  68.7 90.1 46.8 92.7 90.4 52.1 84.3 56.1 58 54.9 ...
##  $ scores_international_outlook: num  96.4 79.5 77.7 83.6 90 95.7 72.3 68.4 80.2 74 ...
##  $ stats_number_students       : int  20774 16223 21261 2238 11276 19370 39918 12910 8091 14292 ...
##  $ stats_student_staff_ratio   : num  11.1 7.4 9.3 6.3 8.4 11 19.8 6 8 5.9 ...
##  $ stats_pc_intl_students      : int  41 23 25 33 34 38 17 20 23 31 ...
##  $ stats_female_share          : int  46 44 49 36 39 47 51 50 46 46 ...

Also, it is important to check for missing values in the data set. And as seen from the table below, there are no missings in our data.

miss_var_summary(data1)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

variable	n_miss	pct_miss
name	0	0
location	0	0
scores_teaching	0	0
scores_research	0	0
scores_citations	0	0
scores_industry_income	0	0
scores_international_outlook	0	0
stats_number_students	0	0
stats_student_staff_ratio	0	0
stats_pc_intl_students	0	0
stats_female_share	0	0

Next, it is necessary to check descriptive statistics of the data. As seen from the summary below, scales, units of measurements and extremes are quite different.

summary(data1)

##      name             location         scores_teaching scores_research
##  Length:1448        Length:1448        Min.   :11.70   Min.   : 7.10  
##  Class :character   Class :character   1st Qu.:18.50   1st Qu.:11.40  
##  Mode  :character   Mode  :character   Median :23.40   Median :17.15  
##                                        Mean   :27.53   Mean   :23.13  
##                                        3rd Qu.:31.93   3rd Qu.:28.82  
##                                        Max.   :94.40   Max.   :99.60  
##  scores_citations scores_industry_income scores_international_outlook
##  Min.   :  1.90   Min.   : 33.30         Min.   : 13.50              
##  1st Qu.: 23.15   1st Qu.: 34.70         1st Qu.: 27.50              
##  Median : 44.80   Median : 38.40         Median : 42.25              
##  Mean   : 47.76   Mean   : 45.52         Mean   : 46.96              
##  3rd Qu.: 70.90   3rd Qu.: 48.40         3rd Qu.: 63.02              
##  Max.   :100.00   Max.   :100.00         Max.   :100.00              
##  stats_number_students stats_student_staff_ratio stats_pc_intl_students
##  Min.   :   557        Min.   :  0.90            Min.   : 0.00         
##  1st Qu.:  9746        1st Qu.: 12.20            1st Qu.: 2.00         
##  Median : 17273        Median : 16.20            Median : 7.00         
##  Mean   : 22241        Mean   : 18.51            Mean   :11.26         
##  3rd Qu.: 28753        3rd Qu.: 22.02            3rd Qu.:17.00         
##  Max.   :222102        Max.   :109.10            Max.   :84.00         
##  stats_female_share
##  Min.   : 1.00     
##  1st Qu.:43.00     
##  Median :53.00     
##  Mean   :49.94     
##  3rd Qu.:58.00     
##  Max.   :98.00

In addition, let’s look at the variances of variables.

lapply(data1[ , 3:11], var)

## $scores_teaching
## [1] 179.3113
## 
## $scores_research
## [1] 293.1243
## 
## $scores_citations
## [1] 766.5359
## 
## $scores_industry_income
## [1] 275.3449
## 
## $scores_international_outlook
## [1] 540.7424
## 
## $stats_number_students
## [1] 424682073
## 
## $stats_student_staff_ratio
## [1] 105.9024
## 
## $stats_pc_intl_students
## [1] 144.5496
## 
## $stats_female_share
## [1] 147.2624

The output below shows that variances of variables are quite different and because of this variables will have different weights and variables with high variances will be overrepresented in principal components. To solve this problem, I can standardize variables and make their variances equal to 1.

data1[ , 3:11] <- data1[ , 3:11]%>%
  scale()%>%
  as.data.frame()

Looking at variances again, I see that standardizing was successful.

lapply(data1[ , 3:11], var)

## $scores_teaching
## [1] 1
## 
## $scores_research
## [1] 1
## 
## $scores_citations
## [1] 1
## 
## $scores_industry_income
## [1] 1
## 
## $scores_international_outlook
## [1] 1
## 
## $stats_number_students
## [1] 1
## 
## $stats_student_staff_ratio
## [1] 1
## 
## $stats_pc_intl_students
## [1] 1
## 
## $stats_female_share
## [1] 1

Before PCA it is also necessary to explore correlations between variables.

In the plot below it is seen that there are rather high correlations between some variables.

data1[ , 3:11]%>%
  cor()%>%
  corrplot()

Principle component analysis

In this part I run PCA and describe the results.

pca_res <- prcomp(data1[ , 3:11])

summary(pca_res)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.8671 1.2097 1.1265 0.94784 0.81619 0.74787 0.66099
## Proportion of Variance 0.3873 0.1626 0.1410 0.09982 0.07402 0.06215 0.04855
## Cumulative Proportion  0.3873 0.5499 0.6909 0.79075 0.86476 0.92691 0.97545
##                            PC8     PC9
## Standard deviation     0.38634 0.26767
## Proportion of Variance 0.01658 0.00796
## Cumulative Proportion  0.99204 1.00000

In the summary above variance explained by components is presented. All in all, I extracted 9 components, but their meaningfulness will be discussed later. The first principal component explains 0.3873 (~39%) of variance. The threshold of 0.7 of explained variance is crossed between PC3 and PC4.

Also, it is necessary to look at eigenvalues of components. As seen from output below, only three components have eigenvalues more than 1. Guiding by Kaiser-Guttman criterion, I should keep 3 components.

pca_res$sdev ^ 2

## [1] 3.48591462 1.46337701 1.26901618 0.89840475 0.66615817 0.55930826 0.43691022
## [8] 0.14926140 0.07164939

The screeplot below confirms the choice of three components.

screeplot(pca_res, type = "lines")
box()
abline(h = 1)

Guiding by all this information, I can say that it is possible to produce an acceptable PCA solution on these data because there are moderate and sometimes high correlations between variables which means components will take place. Also, as seen from PCA results, more than 50% of variance is explained by first two components which makes them meaningful.

Now, let’s extract the loadings of our components.

round(pca_res$rotation[ , 1:3], 2)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

	PC1	PC2	PC3
scores_teaching	-0.44	0.23	-0.09
scores_research	-0.48	0.15	-0.17
scores_citations	-0.40	-0.18	0.00
scores_industry_income	-0.28	0.46	-0.27
scores_international_outlook	-0.42	-0.31	0.18
stats_number_students	0.00	-0.26	-0.65
stats_student_staff_ratio	0.00	-0.29	-0.59
stats_pc_intl_students	-0.40	-0.19	0.28
stats_female_share	-0.04	-0.63	0.10

The first principal component has highest negative loadings for scores on teaching, research, citations and international outlook as well as percentage of international students. This reflect some parameters of universities educational and research performance, so I can name this component as low performance. For the second component, there are high positive loading for score on industry income and high negative loading for share of women. As for the third component, statistics on number of students at all and per staff member have high negative loadings.

Now, let’s plot the results of PCA.

Further analysis

In order to make some additional tests and visualizations it is necessary to extract scores (or coordinates) of components and add them to our dataset.

data2 <- cbind(data1, pca_res$x[ , 1:3])

Also, let’s create a binary variable indicating whether university locates in the US or Canada or in other countries.

data2$location_bin <- as.factor(ifelse(data1$location == "United States" | data1$location == "Canada", "US and Canada", "Other countries"))

Using newly created binary variable (location_bin) I can try to answer the question whether the best universities located in the US and Canada.

First, I can build a plot with coloring by ‘location_bin’.

Looking at the plot above, it is hard to say exactly from which countries come the best universities because visually there are more or less same amount of points in low values of PC1 (indication high scores on performance) both for American and Canadian universities and universities from other countries.

To get more concrete answer I can run T-test for PC1 scores and compare two groups from variable ‘location_bin’.

t.test(PC1 ~ location_bin, data = data2)

## 
##  Welch Two Sample t-test
## 
## data:  PC1 by location_bin
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.132717 1.721327
## sample estimates:
## mean in group Other countries   mean in group US and Canada 
##                      0.198088                     -1.228934

As was written earlier, the first component (PC1) was named as “low performance” since it has negative correlations with scores on teaching, research, citations and international outlook. Thus, the lower values of PC1 correspond to higher performance of universities.

The result of T-test in the output above suggests that the difference of means between two groups is significant at the p-value < 2.2e-16. The mean of PC1 for universities from the US and Canada (-1.228934) is significantly less than the mean for universities from other countries (0.198088). This fact states that overall American and Canadian universities perform better than universities from other countries. This doesn’t particularly mean that all universities from the US and Canada are strictly better than others, but rather that the former have better performance more frequently than latter.

Finally, I am making an interactive plot with the PCA solution.

Project on cluster analysis

Introduction

In this project I am going to apply method of cluster analysis for the data coming from the World University Rankings 2021 by Times Higher Education. The aim is to clusterise universities for a company targeting university applicants from all over the world. I need to locate clusters of universities so as to match the best suiting university group to each applicant.

Data and package preparation

First, we load, as usual, necessary packages for work.

library(factoextra)
library(cluster)
library(dplyr)
library(psych)
library(kableExtra)
library(ggplot2)
library(ggthemes)
library(naniar)
library(Rtsne)
library(gridExtra)
library(ggpubr)
library(dendextend)

Just to save a space I left scraping procedure out of the report, so I load prepared RDS file as data. The file is a little bit different from the original table that was extracted from Times Higher Education site. I deleted several columns because they did not carry any information about characteristics of universities but were rather technical things.

the <- readRDS("THE_rating.RDS")

Choice of variables

In this part I am choosing variables from dataset for cluster analysis. Since clustering is demanded by a company targeting university applicants, we need to chose variables that would be important for applicants. I suppose that variables about quality of teaching as well as academic abilities (‘scores_teaching’ and ‘scores_research’ respectively) should be included for sure because they are usually main characteristics of university. Also, international outlook of universities may be rather important for applicants to be sure that a university is open for international students.

Also, since universities are presented for applicants from all over the worlds, it can be important where the university is located. Thus, I created new variable based on variable ‘location’ where I specified regions for universities according to countries of their location.

Overall, I take 4 variables for my analysis. Let’s put them in separate dataset.

the1 <- select(the, c(name, scores_teaching, scores_research, scores_international_outlook, region))

Data description

In this section I describe the data obtained for analysis. First, I should check for missing values in the data.

vis_miss(the1)

As seen from the plot above, there are no missing values in our data. And now let’s proceed to variables. Each variable is described in separate subsection below.

scores_teching

Teaching (scores_teaching) represents results of quantitative assessment of learning environment of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.

In the histogram it is seen that values distributed unequally with prevalence of low scores. The distribution is right-skewed.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1526	27.81	13.62	23.55	11.7	94.4	2.02	4.98

scores_research

Research (scores_research) represents results of quantitative assessment of volume, income and reputation of universities in research field. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.

As seen from the histogram, values distributed unequally with prevalence of low scores. The distribution is right-skewed.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1526	23.57	17.33	17.5	7.1	99.6	1.86	3.72

scores_international_outlook

International outlook (scores_international_outlook) represents results of quantitative assessment of international staff, students, research of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.

The histogram shows that observations are distributed more or less equally (at least compared to previous variables), however there is still right skewness.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1526	47.03	23.17	42.3	13.5	100	0.59	-0.73

region

Region is a nominal variable representing information about location of universities in the world. The distribution of the variable is presented below.

As seen from the barplot, regions are distributed quite unequally, however it completely depends on size of territory as well as population.

Cluster analysis

In this section I will conduct cluster analysis. There will be several subsections with preparations, applying different methods and interpretation.

How many clusters to retain?

Before clusterisation it is necessary to decide which number of clusters should be retained. For this, I conduct several tests.

The first approach is ‘elbow method’ to determine the optimal k. To do this, we run several models with k = [1:10] and compare the ‘within-cluster sum of squares to the cluster centroid’ in each case. The optimal number of clusters (k) is where the ‘elbow’ occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.

The results below suggest 2 or 3 clusters.

fviz_nbclust(the1[ , 2:4], kmeans, method = "wss")

The next approach is the ‘silhouette width statistic’. It measures the quality of a clustering. That is, it determines how well each object lies within its cluster.

As seen from the plot below, recommended number of clusters is 2.

fviz_nbclust(the1[ , 2:4], kmeans, method = "silhouette")

Finally, I use gap statistic. It compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).

The plot below suggests 2 clusters for k-means clustering.

fviz_nbclust(the1[ , 2:4], kmeans, method = "gap")

As for hierarchical clustering, it is also suggested to retain 2 clusters.

fviz_nbclust(the1[ , 2:4], hcut, method = "gap")

K-means clustering

In this part I apply k-means method of clustering. As was recommended by tests above, I look for 2 clusters. Since k-means is only applicable for numeric variables, I include only scores on teaching, research and international outlook in the analysis.

set.seed(100)
uniclust <- kmeans(the1[ , 2:4], 2)

I executed function of k-means clustering and now we can visualize the result.

Hierarchical clustering

In order to conduct hierarchical clustering, it is necessary to count euclidean distances between observations for numeric variables.

d <- dist(the1[ , 2:4], method = 'euclidean')

Now, I build a plot with distances between observations and a dendrogram.

heatmap(as.matrix(d), symm = T,
        distfun = function(x) as.dist(x))

The plot is aimed to show number of clusters, however it is hard to identify red squares indicating number of clusters in this particular one.

Now, we get the solution through three different type of distances: minimal, average and maximal.

hfit_ward <- agnes(d, method = "ward") 
hfit_average <- agnes(d, method = "average")
hfit_complete <- agnes(d, method = "complete")

To choose better method, let’s look at agglomerative coefficients of each.

hfit_ward$ac

## [1] 0.998086

hfit_average$ac

## [1] 0.9698641

hfit_complete$ac

## [1] 0.9832386

As seen from the output, ward method has the closest to 1 value, so I take this one. Let’s build a dendrogram.

According to the dendrogram, there are two clusters different from each other (judging by height of the first branch).

Let’s visualize clusters.

It is also possible to colour clusters in dendrogram.

Clustering mixed data

In this part I will try to do clustering using data of both numeric and categorical type. I will include ‘region’ variable in the analysis.

In order to do that, I count distances using ‘gower’ metric.

gower_dist <- daisy(the1[ , -1],
                    metric = "gower")

summary(gower_dist)

## 1163575 dissimilarities, summarized :
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001132 0.312730 0.374980 0.384530 0.461510 0.967030 
## Metric :  mixed ;  Types = I, I, I, N 
## Number of objects : 1526

In the summary we see that we have 3 integer and 1 nominal variable which is true.

Then, we build a distance matrix.

gower_mat <- as.matrix(gower_dist)

It is necessary to look at the most similar and dissimilar pairs in the data to see if they make sense

the1[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]

##                     name scores_teaching scores_research
## 1161 University of Fukui            21.4            10.2
## 891    Kurume University            21.5            10.4
##      scores_international_outlook    region
## 1161                         22.3 East Asia
## 891                          22.2 East Asia

the1[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]

##                                         name scores_teaching scores_research
## 1258 Maharaja Sayajirao University of Baroda            13.9             8.8
## 1                       University of Oxford            91.3            99.6
##      scores_international_outlook      region
## 1258                         15.1  South Asia
## 1                            96.4 West Europe

It seems that they make sense, so we can continue.

Now, it is necessary to understand how many clusters we can retain. I again apply silhouette width statistic.

fviz_nbclust(gower_mat, cluster::pam, method = "silhouette")

The plot shows that we should use 10 clusters.

And now we conduct clustering using partitioning around medoids (PAM).

pam_fit <- pam(gower_dist, diss = TRUE, k = 10)
pam_fit$data <- the1[ , -1]

I did clustering and now we can see the results in visual form.

Comparing results

For now I have three results of cluster analysis: two based on numeric data and one based on both numeric and categorical data. It is necessary to choose the best solution. Thus, let’s look at visualizations one more time.

Results of k-means clustering and hierarchical clustering look similar but do not seem to be very meaningful. In both plots 1st clusters are very scattered. Also, 1st and 2nd clusters in both plots are very close to each other and it is very hard to detect difference. Although PAM clustering with mixed data has quite a large number of clusters (10), it looks more meaningful. Thus, I would take the last result as the best solution.

Interpretaion

Since I chose the most appropriate result, I can now define what clusters basically mean. Let’s look at the statistics of each one.

pam_results <- the1 %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit$clustering)%>%
  group_by(cluster) %>%
  do(the_summary = summary(.)) 

pam_results$the_summary

## [[1]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :15.10   Min.   : 9.80   Min.   :42.40               
##  1st Qu.:23.00   1st Qu.:19.80   1st Qu.:63.10               
##  Median :31.30   Median :31.10   Median :76.50               
##  Mean   :33.32   Mean   :34.09   Mean   :75.16               
##  3rd Qu.:39.40   3rd Qu.:44.60   3rd Qu.:87.70               
##  Max.   :91.30   Max.   :99.60   Max.   :99.60               
##                                                              
##             region       cluster 
##  West Europe   :243   Min.   :1  
##  North Europe  : 18   1st Qu.:1  
##  Southeast Asia:  4   Median :1  
##  Africa        :  0   Mean   :1  
##  Australia     :  0   3rd Qu.:1  
##  South Europe  :  0   Max.   :1  
##  (Other)       :  0              
## 
## [[2]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :15.90   Min.   : 8.20   Min.   :21.30               
##  1st Qu.:26.00   1st Qu.:19.95   1st Qu.:38.55               
##  Median :34.45   Median :27.45   Median :50.05               
##  Mean   :38.97   Mean   :35.48   Mean   :52.34               
##  3rd Qu.:46.62   3rd Qu.:42.70   3rd Qu.:64.58               
##  Max.   :94.40   Max.   :98.80   Max.   :94.70               
##                                                              
##              region       cluster 
##  North America  :211   Min.   :2  
##  North Europe   :  8   1st Qu.:2  
##  Southeast Asia :  6   Median :2  
##  Central America:  1   Mean   :2  
##  Africa         :  0   3rd Qu.:2  
##  Australia      :  0   Max.   :2  
##  (Other)        :  0              
## 
## [[3]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :14.00   Min.   : 7.60   Min.   : 15.30              
##  1st Qu.:19.10   1st Qu.:11.80   1st Qu.: 22.80              
##  Median :23.15   Median :16.05   Median : 28.50              
##  Mean   :27.01   Mean   :22.65   Mean   : 32.42              
##  3rd Qu.:29.48   3rd Qu.:27.15   3rd Qu.: 37.25              
##  Max.   :89.60   Max.   :94.90   Max.   :100.00              
##                                                              
##             region       cluster 
##  East Asia     :288   Min.   :3  
##  Southeast Asia:  4   1st Qu.:3  
##  Africa        :  0   Median :3  
##  Australia     :  0   Mean   :3  
##  North Europe  :  0   3rd Qu.:3  
##  South Europe  :  0   Max.   :3  
##  (Other)       :  0              
## 
## [[4]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :16.40   Min.   :13.20   Min.   :59.80               
##  1st Qu.:22.38   1st Qu.:26.77   1st Qu.:84.15               
##  Median :25.60   Median :32.90   Median :89.25               
##  Mean   :31.48   Mean   :37.14   Mean   :87.27               
##  3rd Qu.:38.58   3rd Qu.:43.10   3rd Qu.:93.42               
##  Max.   :75.90   Max.   :90.80   Max.   :97.00               
##                                                              
##             region      cluster 
##  Australia     :45   Min.   :4  
##  North Europe  : 7   1st Qu.:4  
##  Southeast Asia: 3   Median :4  
##  Middle East   : 1   Mean   :4  
##  Africa        : 0   3rd Qu.:4  
##  South Europe  : 0   Max.   :4  
##  (Other)       : 0              
## 
## [[5]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :12.70   Min.   : 8.60   Min.   :25.90               
##  1st Qu.:17.93   1st Qu.:14.57   1st Qu.:39.20               
##  Median :20.00   Median :18.00   Median :44.50               
##  Mean   :22.69   Mean   :19.74   Mean   :46.02               
##  3rd Qu.:25.18   3rd Qu.:22.75   3rd Qu.:53.15               
##  Max.   :55.60   Max.   :41.40   Max.   :75.90               
##                                                              
##              region       cluster 
##  South Europe   :129   Min.   :5  
##  North Europe   :  7   1st Qu.:5  
##  Central America:  1   Median :5  
##  Southeast Asia :  1   Mean   :5  
##  Africa         :  0   3rd Qu.:5  
##  Australia      :  0   Max.   :5  
##  (Other)        :  0              
## 
## [[6]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :12.40   Min.   : 7.20   Min.   :22.60               
##  1st Qu.:15.45   1st Qu.: 7.90   1st Qu.:37.10               
##  Median :18.30   Median : 9.20   Median :43.10               
##  Mean   :19.21   Mean   :12.28   Mean   :43.78               
##  3rd Qu.:22.50   3rd Qu.:11.30   3rd Qu.:46.95               
##  Max.   :31.30   Max.   :43.90   Max.   :81.60               
##                                                              
##              region      cluster 
##  Africa         :63   Min.   :6  
##  Southeast Asia : 6   1st Qu.:6  
##  Central America: 4   Median :6  
##  North Europe   : 2   Mean   :6  
##  Australia      : 0   3rd Qu.:6  
##  South Europe   : 0   Max.   :6  
##  (Other)        : 0              
## 
## [[7]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :14.0    Min.   : 7.70   Min.   :16.50               
##  1st Qu.:17.4    1st Qu.: 9.60   1st Qu.:26.90               
##  Median :19.5    Median :11.60   Median :34.50               
##  Mean   :22.0    Mean   :13.99   Mean   :37.44               
##  3rd Qu.:22.6    3rd Qu.:14.50   3rd Qu.:44.50               
##  Max.   :80.0    Max.   :67.60   Max.   :75.80               
##                                                              
##              region       cluster 
##  East Europe    :124   Min.   :7  
##  Southeast Asia : 10   1st Qu.:7  
##  North Europe   :  2   Median :7  
##  Central America:  1   Mean   :7  
##  Africa         :  0   3rd Qu.:7  
##  Australia      :  0   Max.   :7  
##  (Other)        :  0              
## 
## [[8]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :11.70   Min.   : 7.10   Min.   :14.90               
##  1st Qu.:17.60   1st Qu.: 9.70   1st Qu.:18.90               
##  Median :21.40   Median :13.10   Median :24.00               
##  Mean   :23.01   Mean   :15.77   Mean   :37.53               
##  3rd Qu.:26.40   3rd Qu.:18.60   3rd Qu.:53.60               
##  Max.   :46.80   Max.   :52.00   Max.   :99.40               
##                                                              
##             region       cluster 
##  Middle East   :134   Min.   :8  
##  Southeast Asia:  2   1st Qu.:8  
##  North Europe  :  1   Median :8  
##  Africa        :  0   Mean   :8  
##  Australia     :  0   3rd Qu.:8  
##  South Europe  :  0   Max.   :8  
##  (Other)       :  0              
## 
## [[9]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :13.20   Min.   : 7.3    Min.   :14.60               
##  1st Qu.:15.80   1st Qu.: 8.9    1st Qu.:22.30               
##  Median :17.50   Median :10.1    Median :28.20               
##  Mean   :19.54   Mean   :12.1    Mean   :32.28               
##  3rd Qu.:21.20   3rd Qu.:12.9    3rd Qu.:44.00               
##  Max.   :56.60   Max.   :58.9    Max.   :58.90               
##                                                              
##              region      cluster 
##  South America  :90   Min.   :9  
##  Central America:13   1st Qu.:9  
##  Southeast Asia :10   Median :9  
##  Africa         : 0   Mean   :9  
##  Australia      : 0   3rd Qu.:9  
##  North Europe   : 0   Max.   :9  
##  (Other)        : 0              
## 
## [[10]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :12.10   Min.   : 7.20   Min.   :13.50               
##  1st Qu.:18.10   1st Qu.: 9.25   1st Qu.:16.15               
##  Median :23.40   Median :11.00   Median :19.30               
##  Mean   :25.34   Mean   :13.08   Mean   :24.30               
##  3rd Qu.:30.40   3rd Qu.:14.90   3rd Qu.:33.10               
##  Max.   :58.10   Max.   :53.10   Max.   :46.70               
##                                                              
##             region      cluster  
##  South Asia    :84   Min.   :10  
##  Southeast Asia: 3   1st Qu.:10  
##  Africa        : 0   Median :10  
##  Australia     : 0   Mean   :10  
##  North Europe  : 0   3rd Qu.:10  
##  South Europe  : 0   Max.   :10  
##  (Other)       : 0

It is quite hard to give names to clusters. Most of them are formed by regional characteristic. So, I will just name characteristics of clusters without naming. All in all, since our aim is to present universities to applicants, characteristics are more important than names.

1st cluster - good universities with moderately high teaching and research scores and rather high international orientation; located in Western Europe
2nd cluster - the best universities with the best research and learning environment and moderate international outlook; located in North America
3rd cluster - ordinary universities with moderate scores for everything; located in East Asia
4th cluster - the most internationally oriented universities with good learning and research environment; located in Australia
5th cluster - ordinary universities with lower scores for everything; located in South Europe
6th cluster - universities with the worst research and learning environment from Africa
7th cluster - ordinary universities with lower scores for everything; located in East Europe
8th cluster - ordinary universities with lower scores for everything from Middle East
9th cluster - ordinary universities with lower scores for everything from Central and South America
10th cluster - universities with bad research and learning environment from South Asia

Conclusion

To sum up, I made relatively informative, but still not perfect, clusterization of universities, As was presented in the plot, there are some outliers for many clusters which were assigned to one or another cluster somehow. Also, some regions were represented with small shares in different clusters. Thus, the results could be better, but nevertheless are meaningful.

Reflection on doing descriptive statistics in R

R suggests various ways of reporting data, descriptive statistics and model summaries. They are presented by different packages. In this paper I will review ‘base’, ‘psych’, ‘sjPlot’ and ‘table1’ packages and functions within them aimed to present data and models.

library(psych)
library(sjPlot)
library(table1)

I will bring brief descriptions of these packages from R documentation. base contains basic functions necessary to do simple operations in R. psych is a general purpose toolbox for personality, psychometric theory and experimental psychology. Some of the functions are written to support a book on psychometric theory as well as publications in personality research. sjPlot is a collection of plotting and table output functions for data visualization. Results of various statistical analyses can be visualized using this package, including simple and cross tabulated frequencies, histograms, box plots, linear models, mixed effects models, principal component analysis and correlation matrices, cluster analyses, scatter plots, stacked scales, effects plots of regression models and much more. Finally, table1 creates HTML tables of descriptive statistics, as one would expect to see as the first table in a medical/epidemiological journal article.

As seen from the description, although all these packages have different functions and purposes, they are quite useful. But still, there are different situations where we can apply one or another packages. Descriptive statistics are an important part of work with quantitative data. Three of four mentioned packages contain tools for doing descriptive statistics.

In base package we apply summary function and it produces simple R output. It presents minimal statistics: means, median, extremes and quantiles. It also works with categorical variables and show number of observations. I apply all functions in this paper on dataset ‘iris’ from ‘datasets’ package which is system library as well as ‘base’.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

It is possible to retrieve more information with decribeBy function from psych package. Standard deviations, skewness and kurtosis appeared in the report. However, categorical variable “Species” is identified as numeric which is not good. But it is possible to group statistics by a variable. Good thing is that this table can be nicely printed through kable function in HTML reports.

describeBy(iris)

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

kableExtra:: kable_styling(kableExtra::kable(describeBy(iris), digits = 2), bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Sepal.Length	1	150	5.84	0.83	5.80	5.81	1.04	4.3	7.9	3.6	0.31	-0.61	0.07
Sepal.Width	2	150	3.06	0.44	3.00	3.04	0.44	2.0	4.4	2.4	0.31	0.14	0.04
Petal.Length	3	150	3.76	1.77	4.35	3.76	1.85	1.0	6.9	5.9	-0.27	-1.42	0.14
Petal.Width	4	150	1.20	0.76	1.30	1.18	1.04	0.1	2.5	2.4	-0.10	-1.36	0.06
Species*	5	150	2.00	0.82	2.00	2.00	1.48	1.0	3.0	2.0	0.00	-1.52	0.07

table1 presents similar to summary information, but in more nice looking format. It is common to see such tables in research papers as was mentioned in description. This function also works fine with categorical data, and it is possible to group statistics by it. There is a little disadvantage that it is necessary to write the formula for table but not just insert data set.

table1(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)

	Overall (N=150)
Sepal.Length
Mean (SD)	5.84 (0.828)
Median [Min, Max]	5.80 [4.30, 7.90]
Sepal.Width
Mean (SD)	3.06 (0.436)
Median [Min, Max]	3.00 [2.00, 4.40]
Petal.Length
Mean (SD)	3.76 (1.77)
Median [Min, Max]	4.35 [1.00, 6.90]
Petal.Width
Mean (SD)	1.20 (0.762)
Median [Min, Max]	1.30 [0.100, 2.50]
Species
setosa	50 (33.3%)
versicolor	50 (33.3%)
virginica	50 (33.3%)

table1(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width | Species, data = iris)

	setosa (N=50)	versicolor (N=50)	virginica (N=50)	Overall (N=150)
Sepal.Length
Mean (SD)	5.01 (0.352)	5.94 (0.516)	6.59 (0.636)	5.84 (0.828)
Median [Min, Max]	5.00 [4.30, 5.80]	5.90 [4.90, 7.00]	6.50 [4.90, 7.90]	5.80 [4.30, 7.90]
Sepal.Width
Mean (SD)	3.43 (0.379)	2.77 (0.314)	2.97 (0.322)	3.06 (0.436)
Median [Min, Max]	3.40 [2.30, 4.40]	2.80 [2.00, 3.40]	3.00 [2.20, 3.80]	3.00 [2.00, 4.40]
Petal.Length
Mean (SD)	1.46 (0.174)	4.26 (0.470)	5.55 (0.552)	3.76 (1.77)
Median [Min, Max]	1.50 [1.00, 1.90]	4.35 [3.00, 5.10]	5.55 [4.50, 6.90]	4.35 [1.00, 6.90]
Petal.Width
Mean (SD)	0.246 (0.105)	1.33 (0.198)	2.03 (0.275)	1.20 (0.762)
Median [Min, Max]	0.200 [0.100, 0.600]	1.30 [1.00, 1.80]	2.00 [1.40, 2.50]	1.30 [0.100, 2.50]

Besides descriptive statistics, reports on tests and models are important to be shown. In case of base package, we appeal to already known summary. To be honest, summary is rather universal function able to summarize different objects in R. The output for linear models contain all important numbers and statistics.

mod1 <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(mod1)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24675 -0.29657 -0.01515  0.27676  1.00269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.30660    0.07839   54.94   <2e-16 ***
## Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

sjPlot in its turn suggest proper looking tables and even allows to put several models together which is very convenient for comparison of different models. It is less detailed than previous one, but enough for interpretation.

mod2 <- lm(Sepal.Length ~ Petal.Length + Petal.Width, data = iris)
tab_model(model1, mod2)

	Q211			Sepal.Length
Predictors	Odds Ratios	CI	p	Estimates	CI	p
(Intercept)	0.66	0.59 – 0.74	<0.001	4.19	4.00 – 4.38	<0.001
eduT [1]	1.26	1.03 – 1.54	0.027
Petal.Length				0.54	0.40 – 0.68	<0.001
Petal.Width				-0.32	-0.64 – -0.00	0.048
Observations	1736			150
R² Tjur	0.003			0.766 / 0.763

It is also possible to create pretty correlation tables with tab_corr.

tab_corr(iris[ , 1:4])

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Sepal.Length		-0.118	0.872***	0.818***
Sepal.Width	-0.118		-0.428***	-0.366***
Petal.Length	0.872***	-0.428***		0.963***
Petal.Width	0.818***	-0.366***	0.963***
Computed correlation used pearson-method with listwise-deletion.

Well, I brought some examples of how to do descriptive statistics and model summaries in R. It is clear that different packages can be applied in different situations because there are some requirements to follow. Thus, the choice of tools depends on the conditions. If it is a routine work and a person does some computations for him/herself, then some basic stuff is enough because it is simple, informative and easy to get. In this case base package is the best choice. For situations where some official report is demanded (e.g. in articles) it is better to use more advanced tools to meet publishing standards and requirements. It is possible to use sjPlot or table1 since they produce good tables. Other way is to apply kableExtra package in combination with other packages (e.g. psych) to create great customized tabulated reports. The choice is choice is not limited by these packages and there are many other options to do great data descriptions in more interactive way for some web-pages.

To sum up, if to create some toolset for beginners who could make convincing reports, I would suggest psych :: decribeBy + customized kable table for descriptive statistics because it is quite simple to create and fully informative. As for regression models, which are frequently reported, I recommend sjPlot :: tab_modelbecause it is also easy to make and it produce nice-looking comparable tables. These two tools are more or less enough for newcomers in R. As for serious papers there are customized in classic themes kable tables and table1 just because it is common to see products of these tools in existing articles.

Reflection on Bayesian thinking

At first, I would say that Bayesian thinking is not the thing that can be understood by myself on the first try to be honest, but I will try to do my best in this paper. Since me and my friends did mostly qualitative research in previous year, it is rather impossible to apply Bayesian thinking to those works. Thus, I will try to analyze my current diploma paper which I work on.

It is necessary to briefly describe the idea of my research. The topic sounds something like ‘association between health-related occupations and risky sexual behavior’. And the general research question is the following: “Is there any association between being a medical worker and engagement in risky sexual behavior?” The data comes from General Social Survey and the method of data analysis will be binary logistic regression.

Well, Bayesian inference works with probabilities. So, in Bayesian perspective the research question will be changed in such a way: “What is the probability to be engaged in risky sexual behavior under condition of being a medical worker?” In order to do Bayesian inference, it is necessary to have generative mode, prior probability and data. So, if I were doing my research in Bayesian perspective, I would need some of these things. As I was mentioned, I use data from General Social Survey. I will need some prior probability of being engaged in risky sexual behavior as well as shares of medical workers in population. And on the one hand I could take this prior probability from some sources or statistics and then use General Social Survey as data for doing inference. On the other hand, if I had enough resources for doing in such a way, I could count prior probability of risky sexual behavior based on General Social Survey data and then conduct my own survey to attain data for making inference. There is even one more possible way of doing. Bayesian thinking also implies some kind of updating of knowledge by multiple trials. And since General Social Survey is a periodic study which is conducted every two years, then I could use several rounds of survey for updating the model and transforming posterior probability into prior probability for the next step.

It seems that the result of the study would be a single probability or a credential interval that would answer the research question, Since Bayesian inference has nothing to do with null hypothesis testing, p-values and other stuff, it is not necessary to provide any additional statistics to justify significance of the results.

As I said previously, Bayesian thinking is not a very easy issue, so ordinary people may not catch the idea of the study if it will be presented in pure scientific words. Still, it is important to deliver the insight of study to public and stakeholders as it carries some social value. I think, the better way for presenting the whole thing is by using graphics showing distributions of probabilities, intersections of conditions, etc. It seems to be more intuitive to understand the meaning through the picture rather than just with some numbers.

Portfolio of data analysis projects

by Artyom Kulikov

12/12/2020

Scraping data from Wikipedia

Introduction

Data scraping

Building a plot

Summary

Project on logistic regression

Introduction

Getting started

Data description

Attending demonstrations

Interest in politics

Education

Size of town

Type of settlement

Region

Age

Income

Tertiary education

Dealing with missing data

Data imputation

Visualization of imputed data

Building a model

Interpretation

Predicting outcome

Model quality

Model accuracy

Model diagnostics

Conclusion

Project on principle component analysis

Introduction and preparation

Exploring the data

Principle component analysis

Further analysis

Project on cluster analysis

Introduction

Data and package preparation

Choice of variables

Data description

scores_teching

scores_research

scores_international_outlook

region

Cluster analysis

How many clusters to retain?

K-means clustering

Hierarchical clustering

Clustering mixed data

Comparing results

Interpretaion

Conclusion

Reflection on doing descriptive statistics in R

Reflection on Bayesian thinking