https://rpubs.com/staszkiewicz/CCR_rep_EN [04-konferncje/MSM]
https://www.dropbox.com/scl/fo/sajrtchhqeftivhd4zwqu/h?rlkey=u37l184e2khoxfufulsyq00kc&dl=0
Past and copy above link to your borwser download the data cvs
“Dobre_dane_Regresja_czyste_8_Wrzesnia_2015gdt.csv” for the replicaton of the programming below. Click the “code” button to see the orginal R code. (the upper right corner)
In the methods of systematic reviews, the inclusion or exclusion of literature items in the review is a major area of discretion for the researcher, and thus a methodological weakness. Because we do not have objective and complete criteria for the inclusion or exclusion of literature items. In practice, various filtering criteria are used, e.g. journal ranking, citation count, language, type of contribution (e.g. only articles, no monographs), time period, etc. This is due to the fact that the originally identified population can be several thousand items, while a reasonable literature review up to about 100 items. Typically, PRISMA is used, i.e.
Liberati et al. (2009)
The image shows in the oval the location of the general population reduction procedure according to Liberati et al. (2009).
The idea behind citation regression is that you don’t need to indicate the filtering mechanisms arbitrarily, and you can do it algorithmically. Or use the CCR for the robustness section, to show the study results stability.
Let’s imagine that we already have a selected entire population of articles that relate to an issue of interest (e.g., non-audit components of auditors’ fees). There may be an infinite number of these items, but some of them are important and recognized, and some may be marginal additions to the mainstream considerations. If anyone reading all this literature had expertise in auditing, they could have selected important items for the literature review. Such a discretionary choice is based on the prestige knowledge and competence of whoever is making the selection. However, two people with similar levels of knowledge and competence can make different choices - this is called the researcher’s bias. In view of the above, how would this be done to be objective. PRISMa, in its essence, does not provide an answer to this it only recommends repetition, i.e. explaining the choice. CCR (citation count regression) is built on a different philosophy, namely, it attempts to define the entire research area and select those items that either most affect the mainstream or define the limit of cognition.
CCR is a regression, where the dependent variable is the annual number of citations of the paper, and the independent variables are derived from bibliometric studies of factors affecting citations. CCR assumes that the number of citations indicates the scientific quality of a paper, and it is a stimulus (i.e., more citations equals better quality). This assumption is quite naive and heavily criticized, but the instrument itself is quite reasonable.
\[ y = aX+b \] in our example \(y\) is the annual number of citations, while \(X\) is a vector of independent variables, \(b\) a free expression.
The vector \(X\) can consist of any logical set of variables typical of the domain. Without exploring the topic further, let’s consider the following variables affecting the number of citations. \(Years\) the number of years since the article was published \(BigSampe\) a binary variable equal to 1 if the article uses a sample of more than 1000 items, \(Method\) a binary variable of 1 for regressions, \(Anglo-Saxon\) a binary variable if the market discussed in the article is the US, Australia, Singapore, UK or New Zeleanida, \(USA\) a binary variable if the study is about the US, \(TimeSpan\) the time range of the sample analyzed in the article (number of years), \(Sample size\) the number of items in the sample, \(BusinessSupport\) takes the value 1 when the study was externally funded by business entities,
Let’s imagine that we have a collection of articles imported from any abstract database (e.g. WoS or Scopus), the variables for regression are built on the basis of metadata i.e. information that is found in the bibliometric description or abstract of the item.
If we draw a scatter plot i.e. number of citations versus dependent variables, we can basically get either a “non-concrete” random image Panel A, or a set of points resembling some curve Panel B. In both cases, we can fit linear regressions Panel C and D but the errors of these fits can be significant (interestingly, this method is robust to them).
We can interpret the blue regression line on Panel C and D as the main current within the analyzed literature. Namely, most of the papers are close (according to the independent variables) to the regression line, hence the papers that are away from it are either about side areas, new perspectives or represent substantively unsuccessful papers over which other researchers have mercifully let down the curtain of silence.
And from there it’s close to understanding the idea of CCR. Well, let’s add to our list one more paper represented by the red dot on Panels E and F. Let’s recalculate the regression and plot it again we get a new regression line (the red line on Panels E and F). This one paper affects the mainstream by deflecting it slightly (Panel F) or significantly changing its direction (Panel E). This direction is nothing but \(Delta\) i.e. the change in the value of the directional coefficient of the regression after taking into account the new article. In view of the above, our literature largely consists of articles inside the mainstream and those outside the mainstream that significantly affect it.
library(ggplot2)
# generate data
y<- runif(100, min=0, max=50)
z<- runif(100, min=1, max=12)
x<- 2*y+ z
model<- lm(y~x)
#ggplot
df1<-data.frame(x,z)
df<-data.frame(x,y)
df2<-rbind(df,c(75,0.5))
df3<-rbind(df1, c(75,0.5))
plot1<- ggplot(df1, aes(z, y))+
geom_point() +
theme_bw()+
labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
ggtitle("Panel A")
plot2<- ggplot(df, aes(x, y))+
geom_point() +
theme_bw()+
labs(x = 'Independent varialbes', y = 'Yearly number of citations')+ #+
#annotate("point", x = 75, y = 0.5, colour = "red") # to dodaje punt kolorowy
ggtitle("Panel B")
plot1a<- ggplot(df1, aes(z, y))+
geom_point() +
geom_smooth(method=lm)+
theme_bw()+
labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
ggtitle("Panel C")
plot2a<- ggplot(df, aes(x, y))+
geom_point() +
geom_smooth(method=lm)+
theme_bw()+
labs(x = 'Independent varialbes', y = 'Yearly number of citations')+ #+
#annotate("point", x = 75, y = 0.5, colour = "red") # to dodaje punt kolorowy
ggtitle("Panel D")
# Zrobię sobie regresje
# Sciagam parametry modelu
coeff<-coefficients(model)
wolny<-coeff[1]
nachyl<-coeff[2]
plot3<- ggplot(df3, aes(x, z))+
geom_point() +
geom_smooth(method=lm)+
theme_bw()+
labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
annotate("point", x = 10, y = 0.5, colour = "red") + # to dodaje punt kolorowy
geom_abline(intercept = wolny, slope = nachyl, color="red",
linetype="solid", size=1.5)+
ggtitle("Panel E")
plot4<- ggplot(df2, aes(x, y))+
geom_point() +
geom_smooth(method=lm)+
geom_smooth()+
theme_bw()+
labs(x = 'Independent varialbes', y = 'Yearly number of citations')+
annotate("point", x = 75, y = 0.5, colour = "red")+ # to dodaje punt kolorowy
geom_abline(intercept = wolny, slope = 0.2, color="red", linetype="solid", size=1.5)+
ggtitle("Panel F")
require(gridExtra)
grid.arrange(plot1, plot2,plot1a, plot2a, plot3, plot4, ncol=2, nrow=3)
And well, our task in view of this boils down to removing one point (article) from all recalculating the directional coefficient of the regression for all articles and all but that one, calculating the difference between the directional coefficients and finding those differences that are statistically significant (i.e., there is a test that will show us this). Without going into the technique itself, the method allows us to roughly reduce the original population by a factor of ten to a sample of those articles that are significant.
Since it is an abstract and algorithmic method it removes for us the problem of the researcher’s burden, but after all, we have no assurance that a fundamental work in the literature that is in the mainstream will not be missed. And this feature is its weakness, so in practice we use a combination of algorithmic and discretionary methods. Methodological details Staszkiewicz (2019)
Interestingly, the presented algorithm is suitable for both the selection of articles and checking whether the presented literature review covers all relevant items.
Step 1 We load the data and calculate the regression
Step 2 We calculate leveraged items
Step 3 We determine a sample of articles for further narrative description
Step 4 We evaluate the stabinness of the results and the characteristics of the sample
Step 5 We discuss the technical issues of importing data from WoS
We should prepare meta data about our articles, the definitions of variables are included in the table below:
Source:Staszkiewicz (2019).
Let’s load the data into the DF object. We can prepare the data in Excel. We will discuss data import in Step 5.
library(readr)
DF <- read_csv(file.choose())
## Rows: 67 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): PublicationYears, BigSample, Method, AngloSaxon, USA, TimeSpan, Sa...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# kondersja na prosty dataframe i nadanie nazw zgodnie z indeksem papieru
DF<- as.data.frame(DF)
rownames(DF)<-DF$Nr_Papieru
To load the data into the DF object, we used the readr library and
the file.choose
function so that we can select the data
file from the window (we can automate this as well).
We will view the data with the View
or
head(DF)
function and list all the variables available in
our database with colnames(DF)
.
head(DF)
## PublicationYears BigSample Method AngloSaxon USA TimeSpan Sample
## 2 35 0 0 1 1 0 243
## 3 33 0 0 1 1 0 410
## 5 22 0 1 1 1 1 98
## 6 22 0 1 1 1 3 860
## 7 20 0 0 0 0 0 287
## 8 16 0 0 0 0 0 2428
## BusinessSupport TC TCYear EXCLUDED PY Nr_Papieru
## 2 0 0 0.0000000 0 1980 2
## 3 0 4 0.1212121 0 1982 3
## 5 0 49 2.2272727 0 1993 5
## 6 0 35 1.5909091 0 1993 6
## 7 0 6 0.3000000 0 1995 7
## 8 0 7 0.4375000 0 1999 8
colnames(DF)
## [1] "PublicationYears" "BigSample" "Method" "AngloSaxon"
## [5] "USA" "TimeSpan" "Sample" "BusinessSupport"
## [9] "TC" "TCYear" "EXCLUDED" "PY"
## [13] "Nr_Papieru"
Item no. 13 is an index of our papers from the database.
Let’s see the structure of the data - descriptive statistics. For
this we will use the psych
library and the
describe
function (this can be done in different ways, of
course).
# load the library/załaduj bibliotekę
#install.packages(psych) # jeśli nie jest zainstalowana
library(psych)
##
## Dołączanie pakietu: 'psych'
## Następujące obiekty zostały zakryte z 'package:ggplot2':
##
## %+%, alpha
#statystyka opisowa
describe(DF)
## vars n mean sd median trimmed mad min max
## PublicationYears 1 67 8.52 8.16 7.0 7.05 5.93 0 36.00
## BigSample 2 67 0.27 0.45 0.0 0.22 0.00 0 1.00
## Method 3 67 0.70 0.46 1.0 0.75 0.00 0 1.00
## AngloSaxon 4 67 0.69 0.47 1.0 0.73 0.00 0 1.00
## USA 5 67 0.45 0.50 0.0 0.44 0.00 0 1.00
## TimeSpan 6 67 3.87 7.57 1.0 2.20 1.48 0 40.00
## Sample 7 67 1755.17 6348.16 371.0 779.31 413.65 0 51755.00
## BusinessSupport 8 67 0.04 0.21 0.0 0.00 0.00 0 1.00
## TC 9 67 22.52 46.55 3.0 10.87 4.45 0 227.00
## TCYear 10 67 2.17 3.95 0.5 1.19 0.74 0 18.92
## EXCLUDED 11 67 0.00 0.00 0.0 0.00 0.00 0 0.00
## PY 12 67 2006.48 8.16 2008.0 2007.95 5.93 1979 2015.00
## Nr_Papieru 13 67 64.72 44.95 55.0 64.07 65.23 1 137.00
## range skew kurtosis se
## PublicationYears 36.00 1.80 3.10 1.00
## BigSample 1.00 1.02 -0.97 0.05
## Method 1.00 -0.86 -1.28 0.06
## AngloSaxon 1.00 -0.79 -1.40 0.06
## USA 1.00 0.21 -1.99 0.06
## TimeSpan 40.00 3.67 13.88 0.92
## Sample 51755.00 7.29 54.45 775.55
## BusinessSupport 1.00 4.30 16.78 0.03
## TC 227.00 2.65 6.79 5.69
## TCYear 18.92 2.48 5.70 0.48
## EXCLUDED 0.00 NaN NaN 0.00
## PY 36.00 -1.80 3.10 1.00
## Nr_Papieru 136.00 0.13 -1.49 5.49
Let’s compare to the original:
Source:Staszkiewicz (2019).
As you can see our data is a little different from the original in the text (due to the exclusion of one paper here 67 observations in the original paper 68 observations).
Let’s calculate the regression with this equation:
\[ TC = \beta_0 +\beta_1PublicationYear +\beta_2BigSample +\beta_3Method +\beta_4AngloSaxon+\beta_5TimeSpan+\beta_6Sample+\beta_7BusnessSupport+\beta_8US+\varepsilon \] Source:Staszkiewicz (2019).
For simplicity, we will use a simple regression:
TCY<- lm(TCYear ~ PublicationYears +BigSample+ Method+AngloSaxon+TimeSpan +Sample+BusinessSupport,DF)
After the code is generated, an object is created in our przypakdu
TC, which contains information about the linear regression model. Let us
see it with the summaryTC
function.
summary(TCY)
##
## Call:
## lm(formula = TCYear ~ PublicationYears + BigSample + Method +
## AngloSaxon + TimeSpan + Sample + BusinessSupport, data = DF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6610 -1.4347 -0.4097 0.3778 13.8215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.085e+00 1.206e+00 -0.900 0.3718
## PublicationYears 1.019e-01 5.645e-02 1.806 0.0760 .
## BigSample 3.583e+00 1.129e+00 3.173 0.0024 **
## Method 6.020e-01 1.069e+00 0.563 0.5755
## AngloSaxon 8.287e-01 1.027e+00 0.807 0.4231
## TimeSpan 1.106e-01 6.237e-02 1.773 0.0815 .
## Sample -5.256e-05 7.557e-05 -0.696 0.4895
## BusinessSupport 2.185e+00 2.184e+00 1.001 0.3212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.633 on 59 degrees of freedom
## Multiple R-squared: 0.2427, Adjusted R-squared: 0.1529
## F-statistic: 2.701 on 7 and 59 DF, p-value: 0.01693
Our model is equivalent to the M4 model from the paper (circled in red), the differences are due to the absence of one observation, and the use of heteroskedastic correction in the original calculation, here we do not need to repeat them.
Note that the models have a low adjusted \(R^2\), so their predictive value is quite weak, \(PublicationYears\) and \(BigSample\) and \(Sample\) interact with the annual number of citations. Both the significance of the variables and the fit of the model will not be necessarily relevant for further consideration. Why - I refer those interested to the methodological details.
For now, let’s see graphically the results of our estimation - let’s
apply the plot
function.
plot(TCY)
An observation is considered to have high leverage if its value (or values) for the predictor variables are significantly more extreme compared to the other observations in the dataset.
The hatvalues()
function is used to calculate the
leverage for each observation in the model
# obliczmy wartość lewara
hats <- as.data.frame(hatvalues(TCY))
# wylistumy te vartości
hats
## hatvalues(TCY)
## 2 0.21290798
## 3 0.19167812
## 5 0.07835187
## 6 0.07988768
## 7 0.10611081
## 8 0.09475865
## 9 0.04597844
## 10 0.04404193
## 11 0.07201387
## 12 0.07308525
## 15 0.13581008
## 19 0.07100794
## 20 0.06762549
## 21 0.04526655
## 22 0.37665168
## 23 0.09072711
## 24 0.08400364
## 29 0.04133848
## 30 0.06246831
## 35 0.34292671
## 36 0.08739606
## 37 0.04062334
## 38 0.18367009
## 43 0.07222597
## 44 0.39530021
## 45 0.06844399
## 46 0.04191017
## 51 0.04291890
## 52 0.08724704
## 53 0.04695298
## 54 0.06736653
## 55 0.06529267
## 62 0.04571295
## 63 0.04351669
## 64 0.97998610
## 67 0.04362519
## 68 0.07036614
## 77 0.09122794
## 78 0.06180910
## 88 0.34836347
## 90 0.07183608
## 91 0.04735230
## 101 0.07554022
## 102 0.37576933
## 103 0.10478660
## 104 0.05068387
## 105 0.13679124
## 106 0.07345583
## 107 0.13162536
## 108 0.05073110
## 117 0.06632617
## 118 0.12817979
## 119 0.09669277
## 120 0.08084825
## 121 0.07821484
## 122 0.09649332
## 123 0.07595300
## 124 0.07221547
## 127 0.05541486
## 128 0.09904922
## 129 0.06116892
## 130 0.07186645
## 131 0.06992124
## 136 0.06570245
## 137 0.11032187
## 1 0.25670894
## 4 0.19575441
Please note that in the sidebar we have an index of individual papers we did this when importing the data (here numerically, in practice I suggest by first author and year of publication, e.g.: Novak2011, it is easier to search and remember).
Let’s sort the results descending
hats[order(-hats['hatvalues(model)']), ]
.
hats[order(-hats['hatvalues(TCY)']), ]
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
## [1] 0.97998610 0.39530021 0.37665168 0.37576933 0.34836347 0.34292671
## [7] 0.25670894 0.21290798 0.19575441 0.19167812 0.18367009 0.13679124
## [13] 0.13581008 0.13162536 0.12817979 0.11032187 0.10611081 0.10478660
## [19] 0.09904922 0.09669277 0.09649332 0.09475865 0.09122794 0.09072711
## [25] 0.08739606 0.08724704 0.08400364 0.08084825 0.07988768 0.07835187
## [31] 0.07821484 0.07595300 0.07554022 0.07345583 0.07308525 0.07222597
## [37] 0.07221547 0.07201387 0.07186645 0.07183608 0.07100794 0.07036614
## [43] 0.06992124 0.06844399 0.06762549 0.06736653 0.06632617 0.06570245
## [49] 0.06529267 0.06246831 0.06180910 0.06116892 0.05541486 0.05073110
## [55] 0.05068387 0.04735230 0.04695298 0.04597844 0.04571295 0.04526655
## [61] 0.04404193 0.04362519 0.04351669 0.04291890 0.04191017 0.04133848
## [67] 0.04062334
The question that arises is which of these results are statistically significant
Let’s see the lemmings graphically
plot(hatvalues(TCY), type = 'h')
Our test statistic is
\[ h^*= \frac{2*(k-1)}{n}\] where: \(h^*\) value of the test statistic k- number of estimated parameters n - number of observations
In our case 8 parameters (we include free expression) n- 67 because we have 67 observations Hence the cut-off value 0.2089552.
So all papers whose jpeg value exceeds 0.2089552 are our wanted papers for descriptive review. So let’s see which ones they are:
cutoff<-2*(8-1)/67
z<-hats> cutoff
z
## hatvalues(TCY)
## 2 TRUE
## 3 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## 8 FALSE
## 9 FALSE
## 10 FALSE
## 11 FALSE
## 12 FALSE
## 15 FALSE
## 19 FALSE
## 20 FALSE
## 21 FALSE
## 22 TRUE
## 23 FALSE
## 24 FALSE
## 29 FALSE
## 30 FALSE
## 35 TRUE
## 36 FALSE
## 37 FALSE
## 38 FALSE
## 43 FALSE
## 44 TRUE
## 45 FALSE
## 46 FALSE
## 51 FALSE
## 52 FALSE
## 53 FALSE
## 54 FALSE
## 55 FALSE
## 62 FALSE
## 63 FALSE
## 64 TRUE
## 67 FALSE
## 68 FALSE
## 77 FALSE
## 78 FALSE
## 88 TRUE
## 90 FALSE
## 91 FALSE
## 101 FALSE
## 102 TRUE
## 103 FALSE
## 104 FALSE
## 105 FALSE
## 106 FALSE
## 107 FALSE
## 108 FALSE
## 117 FALSE
## 118 FALSE
## 119 FALSE
## 120 FALSE
## 121 FALSE
## 122 FALSE
## 123 FALSE
## 124 FALSE
## 127 FALSE
## 128 FALSE
## 129 FALSE
## 130 FALSE
## 131 FALSE
## 136 FALSE
## 137 FALSE
## 1 TRUE
## 4 FALSE
sum(z, na.rm=FALSE)
## [1] 8
In the sidebar we have the indexes of the papers, while in the second column we have information on whether the paper is or is not influential.
Let’s see how it worked out in the original paper:
Source:Staszkiewicz (2019).
In the original we have 7 papers, in our conversion came out 8 papers. Hence the sample reduction is about 10 times.
It seems that this method selects the oldest item, and the most recent, usually a review (like Habib here), and the rest depends on how centralized it is in the mainstream discussion. Technically, verification was done on the basis of other literature reviews, but there is no formal test method.
The whole method is based on meta bibliographic information. So, in practice, the DOI is crucial, because it uniquely identifies the paper and its attributes, but without citations. Therefore, bibliometric databases WoS and Scopus are used, unfortunately, both have different algorithms for collecting citations, so they are inconsistent. You can do a review on WoS and check the stability with Scopus or vice versa, although I don’t recommend it - a lot of work little gain. You can also export from Menddeley or Zetoro and manually apply citations, labor intensive, but when writing a review on Word, it may be worth the time investment. For markup work, I recommend Zetoro because it integrates more smoothly.
Back to bibliometric databases. Access is generally possible after registration, it is suggested to export the full record (including the abstract) this helps in decoding variables.
An example of how to do this
Importing from WoS from SGH Library https://www.youtube.com/watch?v=8wVqVvEsYUc&t=2s.
Loading CSV into Excel https://www.youtube.com/watch?v=pY895-S1VWE&t=1s.
Of course, the whole procedure can be done faster in Gretl https://www.youtube.com/watch?v=XIFPuDmrAmA.
But I do not recommend it, because it is a high-level tool, however.
In practice, either to reduce the population of papers for a descriptive (narrative) review or as a supplement to the method for a systematic review, or - this is more from the position of the reviewer - to check whether the presented review article is complete, possibly whether there is a significant burden on the researcher.